How many actual bits in unfolded Morgan fingerprint?
The easiest possible descriptors used in ML for chemistry are fragments, and they are usually used in a folded form. The default folded size is 1024, with some recommendations to use higher dimensionality (>=4096) for computational chemogenomics tasks.
Folding of descriptors is done for several reasons: to save space, get a uniform length, with a price of bit collision, and sometimes insensitivity to small changes. Let's check how many actual bits are in drugs as an example of FDA-approved chemical space vs random selection of 300K compounds from Enamine REAL virtual space of 3B compounds commonly used in virtual screening.
Number of unique bits |
So, as we can see, on average, for radius 3, there are 16 bits per compound on average for drug space that's going down 4 and 2 bits for increased chemical space of 100K and 300K diverse compounds from the Enamine REAL library.
Enamine virtual space is, of course, repetitive, but it still indicates that with a large chemical space number of unique bits will grow up significantly to more than half of million unique bits. This raises a good old question about using folded descriptors for ML tasks, especially in very large chemical space.
PS: Google Colab Notebook is here (not sure why Github is screwing up with preview).
Comments
Post a Comment