Paper comment: "MELLODDY: cross pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information"

 Intro



I finally got my hand on carefully reading papers by the MELLODDY consortium led by my ex-supervisor (and fantastic visionary) - Hugo Ceulemans. I participated in some very-very early discussions and saw the painful process of the project's initiation. What Hugo did is colossal, and he helped to build a PRECEDENT! Honestly, the amount of PR the project got is too little! (Compared to PR of some other ML-for-chemistry companies).


The project united ten big pharma companies to build a federated multi-task model for activity assays (activity on proteins, PK/PD, toxicity): 40000+ assays on 21+ million compounds with 2.6+ billion end-points. MELLODDY used federated learning to improve individual models of the consortium partners by confidential sharing of the "model" without sharing private data itself. An exciting approach that has a lot of tiny, small, and big challenges related to fairness, data security, MLOps, ML engineering, ML algorithms, and legal questions. It is the only solution for costly and privacy-centric data. Everything in italic is a direct copy-paste from the paper.


Modeling setup

Data: mostly classical cleaning procedures for any activity data, greatly described in the paper, and very detailed in the EU official report.

Descriptors: MELLODDY used ECFP6 chemical fingerprints folded to 32K bits (that in my opinion is a bit too small). Gobbi 2D pharmacophore fingerprints, atom pair and topological torsion fingerprints were also explored "but showed no clear advantage over ECFP fingerprints".

ML algorithm: 
The main ML package was SparseChem, a feedforward neural network designed to handle large sparse data efficiently. The model is trained on a combined pool of compounds 


The model (cls, rear) has a shared "trunk" that learned some compound representation for all partners and the private that learns the tasks: "On the platform, the weights of the common trunk could be trained in a federated way by applying secure aggregation of the individual gradients from each minibatch of the contributing partners." Thus matching compounds but without sharing assay information keeping end-points completely private, a very secure setup but limiting the "feedback" loop significantly.


Splits: MELLODDY used a very conservative setup to generate the train-test splits, considering either chemical clusters or scaffold, giving lower model evaluation metrics than random partitions that are too optimistic and detached from real drug discovery project setup. Train-test split for multi-task modeling under federated learning provides an additional challenge Jaak Simm addressed in the ChemFold paper.

Quality control
  1. Standard quality metrics were used: AUC-ROC and AUC-PR for classification tasks, Pearson correlation, and RMSE for regression. All of those metrics have many limitations, but for the final goal of comparing models for individual tasks is acceptable.
  2. Model improvement is hard to measure, and almost impossible for unlabeled datasets. MELLODDY used conformal efficiency as a metric of model improvement: it is a % of given chemical space where the conformal prediction is in-domain (active, inactive or both). For example, efficiency of 50% means that for 50% of compounds conformal prediction will provide a predicted label - e.g. inhibitor at 10uM or non-inhibitor at 10uM. And this can be applied for unlabeled data (with IID limitation).
  3. RIPtoP – Relative Improvement of Proximity to Perfection, was used to compare baseline models and a perfect model, because each metric follows an individual scale.

Results
  • All ten partners reported an improvement in their models. Means federated learning worked.
  • The outcomes clearly demonstrate that the original model of the partner showed an improvement of over 4% for AUC-PR, more than 2% in R², and up to 12.5% in AUC-PR and 4.8% in R² on average. It may be considered a minor enhancement by some; however, even the slightest increase in AUC-PR or AUC-ROC can significantly impact hit rate, project hit triaging, hit-to-lead optimization pipeline, and compound enrichment. Wait, why actually enrichment was not used? 
  • The median increase in conformal efficiency of the federated over the single-partner model of 5.5% (with gains up to 9.7%). And please, don't let those numbers disappoint you, 5.5% of 1 million compounds is 55000 compounds that can be used in a decision flow with 95% accuracy. Unfortunately, most of those fall into inactive space of the prediction flow (means that models got better predictions in inactive space), but we always have this heavy imbalance in drug discovery projects. Conformal efficiencies reported by Wouter hide the crucial thing - for some tasks, improvement is rather significant, up to 20%. Also, the paper had a huge limitation IMO, the sampled datasets that were used for the CE calculations are very small. Why not run predictions on Enamine library, 500M or even use any virtual library of a billion size?
Notes:
  • 0.3% of the chemogenomics matrix is filled, mean sparsity of data is very low. Total number of compounds per partner is around 2.1 million;
Open questions:
  1. Averaging model performance across all assays is very tricky. Some originally noisy assays can produce lower quality models thus quality dragging model metrics median down. It would be very interesting to see an improvement of the assays vs how much data is available for those assays (quality vs data density).
  2. Conformal prediction assumes IID that it's hard to achieve in federated learning environment. Should we used CP modifications for this, e.g. "Conformal prediction beyond exchangeability".
  3. Chemical descriptors used likely limit the generalization power of FFNN. It's actually exactly where the 3D descriptors might be very useful.

Links:




Comments

Popular posts from this blog

ChatGPT will bring revolution to knowledge management and insight generation. Are we ready for this?

Can we "talk" with the data? A tiny case for testing pandas_ai for human clearance data