Kaggle Solubility competition: overview
Kaggle Solubility competition is over. It did draw the attention of 100 participants with some strong issues underlined by the participants. I decided to share my dive into this...
Data:
70 711 compounds in train set: A very good and diverse dataset. 30 308 compounds in test set. Quality metric: Kappa coefficient (was it changed later?).
Results
Initial winner (by leaderboard) was a Ohue laboratory of Tokyo Institute of Technology with a combination of the BART transformer language model and GNN with focal loss (solution). The dramatic difference between the winner and the rest of the participants was noticeably big and raised a lot of questions.
Bernhard Rohde (Novartis) was the second leaderboard winner. He underlined the same issue and showed that “plate/row/column” has a drastic effect on model quality (explicit IDs were provided: EOS12286). Still, the Bayesian Instance-Based Offset Learning solution he used draw my attention, because it can be applied to bioassay data where plate effect is usually prominent.
Third leaderboard winner was Qubit Pharmaceuticals team, their model I guess was based on smart quantum descriptors developed internally and of course solution was not provided to the Kaggle competition (that means they cannot be competition winners).
The issue with this challenge was in a “pattern leaking”:
The challenge dataset is comprised of two datasets, the primary screen, and the confirmation screen. All the compounds tested in confirmation screens(n=2670) are originally labeled as having “low” solubility in the primary screen, some of them are relabeled as having “medium” solubility by the confirmation screen. The confirmation screen result is used to rewrite the rows in the primary screen result table to generate the final challenge data.
The above mentioned process made the compounds tested in the confirmation screen listed at the end of the table. After that, stratified random sampling was conducted to split out a training set, whereas the remaining rows were used to generate the test set. This led to the data points from the confirmation screen being arranged at the end of the test.csv file, which was mostly compounds having “low” and “medium” categories. The same pattern was not shown in training data due to the random sampling process. In the test set, the last 817 data points, comprised of 681 “low” and 136 “medium” are coming from the confirmation screen. The remaining test set comprises 194 “low”, 1081 “medium” and 28216 “high”.
That means that using just compound labels it is possible to predict a compound solubility class that was found by Bernhard Rohde and unintentionally by olab team. Olab team re-calculated the model without “pattern leakage” and quality dropped significantly. Read the discussion here and here.
Real announced winners are
- a.kopp.chem with a pretty complex ensemble of catboost and various types of neural networks (GNN mostly).
- Ensemble of Graph Neural Networks by “Beardy Polonium”
- Mr. Maniac with AutoML (using FLAML) solution using 2D and fragment descriptors augmented by other solubility models
The strongest concern I personally have is that the difference between first winner with unbiased score 0.15731 (Qubit Pharmaceuticals) is not that much different for the 20th 0.09261 thus indicating that a more robust metric is needed for model quality estimation, or a different data preparation for the challenge was required.
PS
I participated myself but due to family reasons did not have time to participate in a full force. Yet, my simple ensemble-based xgboost multiclass model with under-sampling gave me an originally 12th place later falling to the 20 in private leadership board and 24 on public. Pretty much ok, taking into account only 2 hours I spent on this model.
Update 1:
A geat comment by Guillaume Godin that I completely forgot to mention, that indicates that TransformerCNN has the most impact on predictions in the ensemble of 28 models.
Comments
Post a Comment