TDC : machine learning and biomedicine {data set and LB}

Biomedicine is one of the most important application scenarios of machine learning . But biomedicine covers a variety of tasks, the data is very complex, and the acquisition and processing of data requires a lot of professional knowledge. This has resulted in many machine learning researchers who can only do method research on a small number of relatively well-known but much-researched tasks, while ignoring a large number of meaningful but very lacking machine learning methods research tasks. To solve this problem, a group of students and professors from Harvard, MIT, Stanford, CMU, UIUC, Georgia Tech, IQVIA launched Therapeutics Data Commons (TDC), the first large-scale data set of machine learning in biomedicine. .

TDC : machine learning and  biomedicine {data set  and LB}

Biomedicine is one of the most important application scenarios of machine learning . But biomedicine covers a variety of tasks, the data is very complex, and the acquisition and processing of data requires a lot of professional knowledge. This has resulted in many machine learning researchers who can only do method research on a small number of relatively well-known but much-researched tasks, while ignoring a large number of meaningful but very lacking machine learning methods research tasks. To solve this problem, a group of students and professors from Harvard, MIT, Stanford, CMU, UIUC, Georgia Tech, IQVIA launched Therapeutics Data Commons (TDC), the first large-scale data set of machine learning in biomedicine. .

TDC currently contains 20+ meaningful tasks and more than 70 high-quality data sets, covering the discovery of target proteins, pharmacokinetics, safety, and drug production. And not only small molecules, but also antibodies, vaccines, miRNAs, etc. Later, CRISPR, Clinical Trials, etc. will be added. These data are processed to be directly input into the machine learning model. And most of them are new pits! We also provide some leaderboards to provide model SOTA comparison.

Everyone is welcome to use TDC and provide suggestions! For more information, please visit the website and GitHub !

Website : zitniklab.hms.harvard.edu 

GitHub : github.com/mims-harvard 

Quote :

@misc{tdc,
author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, Connor and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
title ={Therapeutics Data Commons: Machine Learning Datasets for Therapeutics},
howpublished={\url{ https:// zitniklab.hms.harvard.edu /TDC/}},
month=nov,
year=2020
}

Team : Harvard University, Georgia Tech, Massachusetts Institute of Technology, Carnegie Mellon University, Stanford University, IQVIA, University of Illinois at Urbana-Champaign


background

In recent years, machine learning (ML) has been developed and applied in the field of biomedicine. For example, the recent AlphaFold2 has greatly improved the effect of protein structure prediction[1], and the powerful antibiotic Halicin predicted by ML[1] 2] etc. However, acquiring and processing raw biomedical data into ML-Ready data requires a lot of professional knowledge, and it is difficult for machine learning researchers to process it quickly and accurately. Moreover, biomedicine is a huge field, many data sets are scattered in every corner, and there is no central platform to organize and obtain these data. For these reasons, current ML researchers only focus on a few tasks in method research to improve the results on a few small data sets. However, a large number of meaningful tasks have not been advanced by ML. Method researched . This greatly reduces the research progress of ML in the field of biomedicine.

TDC introduction


TDC Team

To solve this problem, 

from Havard ,from Georgia Tech, Wenhao of MIT, myself, Yusuf of Stanford, and our mentors Connor, Jure, Jimeng, Danica and Marinka, jointly launched Therapeutics Data Commons (TDC), the first major Large-scale ML data sets and benchmarks in biomedicine.


TDC Overview

In the first version, we have compiled more than 20 very meaningful tasks and more than 70 data sets of ML in biomedicine , including the discovery of target proteins, pharmacokinetics, safety, and drug production. . And not only small molecules, but also antibodies, vaccines, miRNAs, etc. In addition to the data itself, we noticed that there are many frequently used data functions when doing research, so we also provide many functions to support the research of ML in biomedicine. All ML-ready data and data functions can be obtained with only three lines of code !


TDC Vision

Through TDC, the ultimate goal we want to achieve is to serve as a connection point: people in the biomedical industry find meaningful domain problems, TDC turns it into an ML task and processes it into ML-ready data, and then ML researchers can Quickly use TDC to design cutting-edge ML methods. In this way, we hope to help people in the ML community focus on solving practical and valuable biomedical problems.


TDC Modular Design

The structure of TDC

Next, let me introduce TDC in detail. The purpose of TDC is to cover a variety of tasks, and each task has a different data structure. Therefore, we have proposed a three-tier hierarchical structure-we call it TDC "Central Dogma" (Central Dogma). As far as we know, this is the first attempt of systematic evaluation of machine learning in the entire biomedical field.

The first layer is Problem. We summarize all tasks into 3 big ML problems:

  • Single-instance prediction: Predict certain properties of a single entity (such as molecules, proteins).
  • Multi-instance prediction: predict certain properties between multiple entities (such as reaction types)
  • Generation: Know a series of entities, generate new entities with certain properties (such as optimized molecules)

The second layer is the learning task Task . Each task belongs to one of the problem types. These tasks are defined from the perspective of biomedicine. The scope of improvement in application includes designing new antibodies, identifying personalized combination therapies, improving disease diagnosis, and finding new ways to treat new diseases.

Finally, in the third layer of TDC, each task is instantiated by multiple data sets Dataset .

To sum up, there are three problems, each of which has many learning tasks, and each learning task has many data sets. This three-tier structure allows us to organize and use TDC clearly.

TDC's programming framework

In the TDC programming framework, we design a basic Data Loader class for each Problem, and then each Task inherits this base class. So suppose you want to retrieve the data set "X" in the learning task "Y" in the problem "Z" category, you just need to enter:

from tdc.Z import Y
data = Y(name = 'X')
splits = data.split()

For example, suppose you want to obtain the Bioavailability Dataset in the drug ADME prediction task, you can directly enter

from tdc.single_pred import ADME
data = ADME(name = 'Bioavailability_Ma')
split = data.get_split()

This split contains three Pandas DataFrames, each line is the most common drug molecule input format SMILES String (can be understood as the character expression of the drug molecule graph) and the Bioavailability value.

TDC data


TDC Datasets Snapshot.

As I said at the beginning, you can just use these three lines of code to get more than 70 meaningful data sets from more than 20 important biomedical tasks! Many tasks in the ML method research are brand-new tasks or brand-new data sets (new pits that have not been excavated)! To give a few examples:

  • ADMET prediction: ADMET contains a series of important drug indicators to measure whether a drug molecule can safely and effectively reach the desired target after oral administration. Predicting these indicators accurately can save pharmaceutical companies a lot of resources. Previously, there were many web servers to do ADMET prediction, but the data were all non-public. TDC has collected important indicators that are actually in use in more than 20 pharmaceutical companies from many small databases and scattered journals and other sources. All data is open source!
  • Precise combination of drugs: There are two major trends. The same drug will have different effects depending on different patients, especially for tumor drugs. Therefore, ML is very important to predict the effect of drugs in different patients' gene expression. TDC processes a large data set from GDSC [3], each data point in it corresponds to the gene expression of a drug molecule and Cell Line, and the reaction effect between them. Another trend is that the combination of many drug molecules will have a better effect (drug synergy) than a single drug molecule, and it can save a lot of development time. Therefore, it is very meaningful to predict whether there will be a combined effect between the two drugs. TDC processed two large data sets (from Merck[4] and NCI[5]). Each data point contains two drug molecular structures and cell line expressions, as well as their synergy effects.
  • Biologics. In recent years, ML has done a lot of good work on small molecules, but it hasn’t done much work on large-molecule biopharmaceuticals. TDC therefore includes six biological drug tasks, such as the affinity prediction of antibodies and antigens, the affinity prediction of peptides and MHC, the prediction of miRNA and target response, and so on.

We are also preparing to include some 3D drug molecules and protein tasks, CRISPR gene editing tasks (off-target, repair outcome), and clinical trials tasks. If you have a new and interesting task, please contact us too!

TDC Leaderboard


Absorption Category in ADMET Benchmark Group

On the basis of these data sets, TDC also provides a variety of Leaderboards to compare model prediction effects for ML researchers. Each data set of TDC can be used as a benchmark. But we have observed that to really use an ML model for many biomedical problems, this ML model must achieve good results on a series of data sets and tasks. Therefore, we merge many sub-benchmarks to form a benchmark group (Benchmark Group). All sub-benchmarks in a portfolio revolve around a meaningful biomedical problem, and all the measurement standards and training test segmentation methods are designed to simulate actual biomedical application scenarios. For example, our first benchmark combination is ADMET property prediction. A good ML model must not only predict the properties of a certain ADMET, but also have a good effect on all properties. So in this combination, we have 22 ADMET property predictor benchmarks. We also use scaffold split to simulate the application scenarios of actual pharmaceutical factories.

For a Benchmark Group, TDC provides a programming framework to quickly allow ML researchers to build and evaluate models. For example, using ADMET group as an example,

from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')
predictions = {}

for benchmark in group:
    name = benchmark['name']
    train, valid, test = benchmark['train'], benchmark['valid'], benchmark['test']
    ## --- train your model --- ##
    predictions[name] = y_pred

group.evaluate(predictions)
# {'caco2_wang': {'mae': 0.234}, 'hia_hou': {'roc-auc': 0.786}, ...}

You can also get each sub-benchmark:

benchmark = group.get('Caco2_Wang')
predictions = {}

name = benchmark['name']
train, valid, test = benchmark['train'], benchmark['valid'], benchmark['test']
## --- train your model --- ##
predictions[name] = y_pred

group.evaluate(predictions)
# {'caco2_wang': {'mae': 0.234}}

In addition to providing the most standard train/valid/test split, you can also get many different train/valid splits to test the robustness of the model:

out = group.get_auxiliary_train_valid_split(seed = 42, benchmark = 'Caco2_Wang')
train, valid = out['train'], out['valid']
## --- train your model --- ##
group.evaluate(y_pred_val, y_true_val, benchmark = 'Caco2_Wang')
# {'mae': 0.234}

TDC's first leaderboard about ADMET has been released! ADMET is very important, and it is a very suitable task for ML researchers who do not have any biomedical background to start. Everyone is welcome to submit results! There is more information on our website.

TDC data processing function


TDC Data Functions

In addition to the core data set and Leaderboard, TDC also contains a variety of functions. There are four main blocks now:

  • Model evaluation: TDC provides an evaluation function with only 3 lines of code to evaluate tasks in TDC.
  • Data segmentation: Some training and test set segmentation methods are used to simulate actual biomedical scenarios. Such as scaffold split.
  • Data processing: some helpers such as visualization, label conversion, binarization, etc.


Molecular generation task
  • Molecular generation Oracles: In molecular generation tasks, the general goal is to generate new drug molecules with better properties, and then this property is generated through a default gold label function (oracle). So an oracle defines the task of generating a molecule. TDC has collected more than 20 meaningful oracles, all of which only require 3 lines of code. for example:
from tdc import Oracle
oracle = Oracle(name = 'GSK3B')
oracle(['CC(C)(C)[C@H]1CCc2c(sc(NC(=O)COc3ccc(Cl)cc3)c2C(N)=O)C1', \
        'CCNC(=O)c1ccc(NC(=O)N2CC[C@H](C)[C@H](O)C2)c(C)c1', \
        'C[C@@H]1CCN(C(=O)CCCc2ccccc2)C[C@@H]1O'])
# [0.03, 0.0, 0.0]

Later we will gradually add more meaningful data functions, such as conversion between various molecular formats, more practical oracles (for example, docking score will be provided later, the number of inverse synthesis steps) and so on.

Install TDC

The core data of TDC has very few environmental requirements, you can install it through pip:

pip install PyTDC

For some more specific data functions, TDC needs more environment. You can install it via conda-forge:

conda install -c conda-forge pytdc

TDC is an open source community

The original intention of TDC is to connect researchers in biomedicine and ML to accelerate the application of ML in biomedicine. So anyone who is interested in this area is very welcome to join us and provide any form of contribution to TDC.

TDC website and GitHub


TDC Website

All detailed information is on the TDC website, zitniklab.hms.harvard.edu 

Also welcome everyone to star TDC's GitHub github.com/mims-harvard  Check out the latest news from TDC!

Reference

[1] AlphaFold: a solution to a 50-year-old grand challenge in biology. AlphaFold: a solution to a 50-year-old grand challenge in biology

[2] Stokes, Jonathan M., et al. "A deep learning approach to antibiotic discovery." Cell. 180.4 (2020): 688-702.

[3] Yang, Wanjuan, et al. "Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells." Nucleic acids research 41.D1 (2012): D955-D961.

[4] O'Neil, Jennifer, et al. “An unbiased oncology compound screen to identify novel combination strategies.” Molecular cancer therapeutics 15.6 (2016): 1155-1162.

[5] Zagidullin, Bulat, et al. “DrugComb: an integrative cancer drug combination data portal.” Nucleic acids research 47.W1 (2019): W43-W51.

What's Your Reaction?

like
1
dislike
0
love
1
funny
0
angry
0
sad
0
wow
0