How many actual bits in unfolded Morgan fingerprint?

The easiest possible descriptors used in ML for chemistry are fragments, and they are usually used in a folded form. The default folded size is 1024, with some recommendations to use higher dimensionality (>=4096) for computational chemogenomics tasks. 

Folding of descriptors is done for several reasons: to save space, get a uniform length, with a price of bit collision, and sometimes insensitivity to small changes. Let's check how many actual bits are in drugs as an example of FDA-approved chemical space vs random selection of 300K compounds from Enamine REAL virtual space of 3B compounds commonly used in virtual screening.

Number of unique bits

So, as we can see, on average, for radius 3, there are 16 bits per compound on average for drug space that's going down 4 and 2 bits for increased chemical space of 100K and 300K diverse compounds from the Enamine REAL library. 

Enamine virtual space is, of course, repetitive, but it still indicates that with a large chemical space number of unique bits will grow up significantly to more than half of million unique bits. This raises a good old question about using folded descriptors for ML tasks, especially in very large chemical space. 

PS: Google Colab Notebook is here (not sure why Github is screwing up with preview).

Comments

Popular posts from this blog

Paper comment: "MELLODDY: cross pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information"

ChatGPT will bring revolution to knowledge management and insight generation. Are we ready for this?

Can we "talk" with the data? A tiny case for testing pandas_ai for human clearance data