Paper comment: "MELLODDY: cross pharma federated learning at unprecedented scale unlocks benefits in QSAR without compromising proprietary information"
Intro
I finally got my hand on carefully reading papers by the MELLODDY consortium led by my ex-supervisor (and fantastic visionary) - Hugo Ceulemans. I participated in some very-very early discussions and saw the painful process of the project's initiation. What Hugo did is colossal, and he helped to build a PRECEDENT! Honestly, the amount of PR the project got is too little! (Compared to PR of some other ML-for-chemistry companies).
The project united ten big pharma companies to build a federated multi-task model for activity assays (activity on proteins, PK/PD, toxicity): 40000+ assays on 21+ million compounds with 2.6+ billion end-points. MELLODDY used federated learning to improve individual models of the consortium partners by confidential sharing of the "model" without sharing private data itself. An exciting approach that has a lot of tiny, small, and big challenges related to fairness, data security, MLOps, ML engineering, ML algorithms, and legal questions. It is the only solution for costly and privacy-centric data. Everything in italic is a direct copy-paste from the paper.
Modeling setup
- Standard quality metrics were used: AUC-ROC and AUC-PR for classification tasks, Pearson correlation, and RMSE for regression. All of those metrics have many limitations, but for the final goal of comparing models for individual tasks is acceptable.
- Model improvement is hard to measure, and almost impossible for unlabeled datasets. MELLODDY used conformal efficiency as a metric of model improvement: it is a % of given chemical space where the conformal prediction is in-domain (active, inactive or both). For example, efficiency of 50% means that for 50% of compounds conformal prediction will provide a predicted label - e.g. inhibitor at 10uM or non-inhibitor at 10uM. And this can be applied for unlabeled data (with IID limitation).
- RIPtoP – Relative Improvement of Proximity to Perfection, was used to compare baseline models and a perfect model, because each metric follows an individual scale.
- All ten partners reported an improvement in their models. Means federated learning worked.
- The outcomes clearly demonstrate that the original model of the partner showed an improvement of over 4% for AUC-PR, more than 2% in R², and up to 12.5% in AUC-PR and 4.8% in R² on average. It may be considered a minor enhancement by some; however, even the slightest increase in AUC-PR or AUC-ROC can significantly impact hit rate, project hit triaging, hit-to-lead optimization pipeline, and compound enrichment. Wait, why actually enrichment was not used?
- The median increase in conformal efficiency of the federated over the single-partner model of 5.5% (with gains up to 9.7%). And please, don't let those numbers disappoint you, 5.5% of 1 million compounds is 55000 compounds that can be used in a decision flow with 95% accuracy. Unfortunately, most of those fall into inactive space of the prediction flow (means that models got better predictions in inactive space), but we always have this heavy imbalance in drug discovery projects. Conformal efficiencies reported by Wouter hide the crucial thing - for some tasks, improvement is rather significant, up to 20%. Also, the paper had a huge limitation IMO, the sampled datasets that were used for the CE calculations are very small. Why not run predictions on Enamine library, 500M or even use any virtual library of a billion size?
- 0.3% of the chemogenomics matrix is filled, mean sparsity of data is very low. Total number of compounds per partner is around 2.1 million;
- Averaging model performance across all assays is very tricky. Some originally noisy assays can produce lower quality models thus quality dragging model metrics median down. It would be very interesting to see an improvement of the assays vs how much data is available for those assays (quality vs data density).
- Conformal prediction assumes IID that it's hard to achieve in federated learning environment. Should we used CP modifications for this, e.g. "Conformal prediction beyond exchangeability".
- Chemical descriptors used likely limit the generalization power of FFNN. It's actually exactly where the 3D descriptors might be very useful.
- Code: https://github.com/melloddy
- Main page: https://www.melloddy.eu/
- Main papers:
Comments
Post a Comment