Posts

Showing posts from April, 2023

Can we "talk" with the data? A tiny case for testing pandas_ai for human clearance data

Image
This is a test of the pandas_ai by package by Gabriele Venturi . PandasAI is a layer between the pandas data frame and LLMs that simplifies interaction with the data and makes it more conversational.  I did a quick test for AstraZeneca clearance data from ChEMBL to test if pandas_ai will ease data manipulation.  Google colab notebook is here . **TLDR** : the proposed solution could be more optimal, but it's working! Some issues with linking abstract things, but current LLMs do not have a defined knowledge graph, so there are no surprises. More packages will appear soon, that will ease interaction with private data. The algorithm behind the package provides real code for data frame filtering and manipulation. Because of this, it will fail if the data types of columns are not correctly defined.  Simple request : Provide names for 5 compounds with the highest LogP. Answer : looks excellent, proper linking name to 'Molecule Name' and 'IDs' to 'Molecule ChEMBL ID...

How many actual bits in unfolded Morgan fingerprint?

Image
The easiest  possible descriptors used in ML for chemistry are fragments, and they are usually used in a folded form. The default folded size is 1024, with some recommendations to use higher dimensionality (>=4096) for computational chemogenomics tasks.  Folding of descriptors is done for several reasons: to save space, get a uniform length, with a price of  bit collision , and sometimes insensitivity to small changes.  Let's check how many actual bits are in drugs as an example of FDA-approved chemical space vs random selection of 300K compounds from Enamine REAL virtual space of 3B compounds commonly used in virtual screening. Number of unique bits So, as we can see, on average, for radius 3, there are 16 bits per compound on average for drug space that's going down 4 and 2 bits for increased chemical space of 100K and 300K diverse compounds from the Enamine REAL library.  Enamine virtual space is, of course, repetitive, but it still indicates that with a l...

Kaggle Solubility competition: overview

Kaggle Solubility competition is over. It did draw the attention of 100 participants with some strong issues underlined by the participants. I decided to share my dive into this... Data: 70 711 compounds in train set: A very good and diverse dataset.   30 308 compounds in test set . Quality m etric: Kappa coefficient  (was it changed later?). Results Initial winner (by leaderboard) was a Ohue laboratory of Tokyo Institute of Technology with a combination of the BART transformer language model and GNN with focal loss  ( solution ).  The dramatic difference between the winner and the rest of the participants was noticeably big and raised a lot of questions. Bernhard Rohde (Novartis) was the second leaderboard winner. He underlined the same issue and showed that “plate/row/column” has a drastic effect on model quality (explicit IDs were provided: EOS12286). Still, the Bayesian Instance-Based Offset Learning solution he used draw my attention, because it can be app...