Posts

Paper comment: "MELLODDY: cross pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information"

Image
 Intro I finally got my hand on carefully reading papers by the MELLODDY consortium led by my ex-supervisor (and fantastic visionary) - Hugo Ceulemans. I participated in some very-very early discussions and saw the painful process of the project's initiation. What Hugo did is colossal, and he helped to build a PRECEDENT! Honestly, the amount of PR the project got is too little! (Compared to PR of some other ML-for-chemistry companies). The project united ten big pharma companies to build a federated multi-task model for activity assays (activity on proteins, PK/PD, toxicity):  40000+ assays on 21+ million compounds with 2.6+ billion end-points . MELLODDY used federated learning to improve individual models of the consortium partners by confidential sharing of the "model"  without sharing private data itself. An exciting approach that has a lot of tiny, small, and big challenges related to fairness, data security, MLOps, ML engineering, ML algorithms, and legal questions. ...

ChatGPT will bring revolution to knowledge management and insight generation. Are we ready for this?

I have never done a cyclic peptide design in my life. So, when I was approached with a question on how to optimize the permeability of one, I guessed that more lipophilicity would help, and likely there are some specific neutral "vectors" that will also help. But the usual step is to go thru several papers and get the answer, obvious, right? I started with a very good article : "Optimizing PK properties of cyclic peptides: the effect of side chain substitutions on permeability and clearance". Read it, made conclusions and decided to feed it to ChatGPT (3.5) with request to "please extract very concise medicinal chemistry insights that can be further reused for peptide optimization". Here is what I got and and that was very neat: 1. Modification of peptide side chains significantly alters their log D values and subsequently their in vitro properties. For instance, aromatic or polar aliphatic side chains didn't significantly reduce log D, whereas ioniz...

Can we "talk" with the data? A tiny case for testing pandas_ai for human clearance data

Image
This is a test of the pandas_ai by package by Gabriele Venturi . PandasAI is a layer between the pandas data frame and LLMs that simplifies interaction with the data and makes it more conversational.  I did a quick test for AstraZeneca clearance data from ChEMBL to test if pandas_ai will ease data manipulation.  Google colab notebook is here . **TLDR** : the proposed solution could be more optimal, but it's working! Some issues with linking abstract things, but current LLMs do not have a defined knowledge graph, so there are no surprises. More packages will appear soon, that will ease interaction with private data. The algorithm behind the package provides real code for data frame filtering and manipulation. Because of this, it will fail if the data types of columns are not correctly defined.  Simple request : Provide names for 5 compounds with the highest LogP. Answer : looks excellent, proper linking name to 'Molecule Name' and 'IDs' to 'Molecule ChEMBL ID...

How many actual bits in unfolded Morgan fingerprint?

Image
The easiest  possible descriptors used in ML for chemistry are fragments, and they are usually used in a folded form. The default folded size is 1024, with some recommendations to use higher dimensionality (>=4096) for computational chemogenomics tasks.  Folding of descriptors is done for several reasons: to save space, get a uniform length, with a price of  bit collision , and sometimes insensitivity to small changes.  Let's check how many actual bits are in drugs as an example of FDA-approved chemical space vs random selection of 300K compounds from Enamine REAL virtual space of 3B compounds commonly used in virtual screening. Number of unique bits So, as we can see, on average, for radius 3, there are 16 bits per compound on average for drug space that's going down 4 and 2 bits for increased chemical space of 100K and 300K diverse compounds from the Enamine REAL library.  Enamine virtual space is, of course, repetitive, but it still indicates that with a l...

Kaggle Solubility competition: overview

Kaggle Solubility competition is over. It did draw the attention of 100 participants with some strong issues underlined by the participants. I decided to share my dive into this... Data: 70 711 compounds in train set: A very good and diverse dataset.   30 308 compounds in test set . Quality m etric: Kappa coefficient  (was it changed later?). Results Initial winner (by leaderboard) was a Ohue laboratory of Tokyo Institute of Technology with a combination of the BART transformer language model and GNN with focal loss  ( solution ).  The dramatic difference between the winner and the rest of the participants was noticeably big and raised a lot of questions. Bernhard Rohde (Novartis) was the second leaderboard winner. He underlined the same issue and showed that “plate/row/column” has a drastic effect on model quality (explicit IDs were provided: EOS12286). Still, the Bayesian Instance-Based Offset Learning solution he used draw my attention, because it can be app...