ml dl el学习
Application of Machine Learning and Deep Learning for Drug Discovery, Genomics, Microsocopy and Quantum Chemistry can create radical impact and holds the potential to significantly accelerate the process of medical research and vaccine development, which is a necessity for any pandemic like Covid19.
机器学习和深度学习在药物发现,基因组学,显微技术和量子化学中的应用可以产生根本性的影响,并具有显着加速医学研究和疫苗开发过程的潜力,这对于像Covid19这样的大流行都是必不可少的。
Before we even begin, this article is a very high level one and specifically targeted for Data Scientists and ML Researchers interested in Drug Discovery, especially during the time of an existing pandemic like Covid19. If you are some one with a strong background in Bio-informatics or Chem-informatics and wants to venture into the world of data science for these use-cases, please reach out to me through any of the options mentioned here, and we can discuss few interesting opportunities for the greater good of mankind.
在我们甚至还未开始之前,本文是一篇非常高的文章,专门针对对药物发现感兴趣的数据科学家和ML研究人员,特别是在Covid19等现有大流行期间。 如果您是具有生物信息学或化学信息学背景的人,并且想涉足这些用例的数据科学领域,请通过此处提及的任何一种方法与我联系,我们可以讨论为人类带来更大利益的机会很少。
DeepChem, an open source framework, which internally uses TensorFlow, that has been specifically designed to simplify the creation of deep learning models for various life science applications.
DeepChem是一个开放源代码框架,内部使用TensorFlow,该框架专门设计用于简化各种生命科学应用程序的深度学习模型的创建。
In this tutorial, we will see how to setup DeepChem and we will see how to use DeepChem for :
在本教程中,我们将了解如何设置DeepChem,以及如何将DeepChem用于:
1. training a model that can predict toxicity of molecules
1.训练可以预测分子毒性的模型
2. training a model to predict solubility of molecules
2.训练模型以预测分子的溶解度
3. using SMART strings to query molecular structures.
3.使用SMART字符串查询分子结构。
设置DeepChem (Setting Up DeepChem)
Although, in multiple sources, I have seen users have expressed their concern in setting up DeepChem in Windows, Linux and Mac environments, but I found it quite easy to do that using pip installer.
虽然我从多个方面看到用户已经表达了他们对在Windows,Linux和Mac环境中设置DeepChem的担忧,但是我发现使用pip安装程序非常容易。
The DeepChem development team, is very active and they do provide daily builds, so I would like everyone to take a look at their pypi page: https://pypi.org/project/deepchem/#history and install a suitable version in case if the latest version has any issue. A simple pip install deepchem would install the very latest version.
DeepChem开发团队非常活跃,他们确实提供每日构建,因此,我希望每个人都可以查看他们的pypi页面: https ://pypi.org/project/deepchem/#history并安装合适的版本以防万一如果最新版本有任何问题。 一个简单的pip安装deepchem将安装最新版本。
Next, along with DeepChem, you would require TensorFlow to be installed. I had installed the latest version of TensorFlow using pip install tensorflow and RDkit which is an open source Cheminformatics software package. For RDkit and for installing in Windows, I did not find any reliable pip installer, so that installed it from https://anaconda.org/rdkit/rdkit using the conda installer : conda install -c rdkit rdkit
接下来,您需要与DeepChem一起安装TensorFlow。 我已经使用pip install tensorflow和RDkit安装了最新版本的TensorFlow,RDkit是开源的Cheminformatics软件包。 对于RDkit和在Windows中安装,我找不到任何可靠的pip安装程序,因此使用conda安装程序从https://anaconda.org/rdkit/rdkit安装了它: conda install -c rdkit rdkit
Once these three modules are installed, we are ready to start with our experiments.
一旦安装了这三个模块,我们就可以开始实验了。
预测分子的毒性 (Predicting toxicity of molecules)
Molecular toxicity can be defined as sum of adverse effects exhibited by a substance on any organism. Computational methods can actually determine the toxicity of a given compound using chemical and strcutral properties of the molecule and molecular featurization using molecular descriptors (Dong et al., 2015) and fingerprints (Xue and Bajorath, 2000), can effectively extract the chemical and structural information inherent in any given molecule for prediction-based approaches.
分子毒性可以定义为某种物质对任何生物产生的不利影响之和。 计算方法实际上可以使用分子的化学和结构特性确定给定化合物的毒性,并使用分子描述符(Dong等, 2015 )和指纹图谱(Xue and Bajorath, 2000 )进行分子特征化,可以有效地提取化学和结构任何基于预测方法的给定分子固有的信息。
For predicting toxicity, we will use the Tox21 toxicity dataset from MoleculeNet and we will use DeepChem to load the required dataset.
为了预测毒性,我们将使用MoleculeNet的Tox21毒性数据集,并使用DeepChem加载所需的数据集。
import numpy as np
import deepchem as dc
tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21()
After this we will see all the toxicity classes, just be printing tox21_tasks
在此之后,我们将看到所有毒性类别,只需打印tox21_tasks
['NR-AR',
'NR-AR-LBD',
'NR-AhR',
'NR-Aromatase',
'NR-ER',
'NR-ER-LBD',
'NR-PPAR-gamma',
'SR-ARE',
'SR-ATAD5',
'SR-HSE',
'SR-MMP',
'SR-p53']
We can divide the entire dataset into training, testing and validation dataset by:
我们可以通过以下方式将整个数据集分为训练,测试和验证数据集:
train_dataset, valid_dataset, test_dataset = tox21_datasets
If we check the distribution of the dataset, we will see that the dataset is not balanced, so we would need to balance the dataset as typically we are trying to solve a MultiClass Classification problem. And so if the dataset is not balanced, the majority class will add bias to the classifier, which will skew the results. So, the transformer object used by default is a balancing transformer.
如果检查数据集的分布,我们将看到数据集是不平衡的,因此我们将需要平衡数据集,因为通常情况下,我们正在尝试解决“多类分类”问题。 因此,如果数据集不平衡,则多数类将为分类器增加偏差,这会使结果产生偏差。 因此,默认情况下使用的变压器对象是平衡变压器。
print(transformers)
[<deepchem.trans.transformers.BalancingTransformer at 0x26b5642dc88>]
now, for the training part :
现在,对于培训部分:
model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])
model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
train_scores = model.evaluate(train_dataset, [metric], transformers)
test_scores = model.evaluate(test_dataset, [metric], transformers)
Now, DeepChem’s submodule contains a variety of dc.models different life science–specific models.
现在,DeepChem的子模块包含各种dc.models,不同生命科学特定的模型。
And finally we see, that the final AUC-ROC scores are:
最后我们看到,最终的AUC-ROC分数是:
{'training mean-roc_auc_score': 0.9556297601807405}
{'testing mean-roc_auc_score': 0.7802496964641786}
This shows us that there is some over-fitting in the model as the test dataset metric scores are much less as compared to the train-set. But, nevertheless, now we do have a model that can predict toxicity from molecules!
这表明我们存在模型过度拟合的问题,因为与训练集相比,测试数据集指标得分要低得多。 但是,尽管如此,现在我们有了一个可以预测分子毒性的模型!
预测分子的溶解度 (Predicting solubility of molecules)
Solubility is a measure, which shows how easily a molecule can dissolve in water. For any drug discovery, it is very important to check the solubility of the compound as the drug should dissolve into the patient’s bloodstream to have the required therapeutic effect. Usually, medicinal chemists spend a lot of time in modifying molecules to increase this property of solubility. In this section we will use DeepChem to predict solubility of molecules.
溶解度是一种度量,它显示了分子在水中的溶解程度。 对于任何发现的药物,检查化合物的溶解度非常重要,因为药物应溶解到患者的血流中以达到所需的治疗效果。 通常,药用化学家花费大量时间来修饰分子以增加溶解度的这种特性。 在本节中,我们将使用DeepChem预测分子的溶解度。
We will be using the delaney dataset from MoleculeNet, which is also available in DeepChem, for predicting molecular solubility.
我们将使用DeepChem中也提供的MoleculeNet的delaney数据集来预测分子溶解度。
# load the featurized data
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')# Split into traintest-validation dataset
train_dataset, valid_dataset, test_dataset = datasets# Fit the model
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)
model.fit(train_dataset, nb_epoch=100)# Use r2 score as model evaluation metric
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print(model.evaluate(train_dataset, [metric], transformers))
print(model.evaluate(test_dataset, [metric], transformers))
Even in the first pass, we see some overfitting from the model evaluation results.
即使在第一遍中,我们也会从模型评估结果中看到一些过拟合。
{'training pearson_r2_score': 0.9203419837932797}
{'testing pearson_r2_score': 0.7529095508565846}
Let’s see how to predict solubility for a new set of molecules:
让我们看看如何预测一组新分子的溶解度:
smiles = ['COC(C)(C)CCCC(C)CC=CC(C)=CC(=O)OC(C)C',
'CCOC(=O)CC',
'CSc1nc(NC(C)C)nc(NC(C)C)n1',
'CC(C#C)N(C)C(=O)Nc1ccc(Cl)cc1',
'Cc1cc2ccccc2cc1C']
Next, we need to featurize these new set of molecules from their SMILES format
接下来,我们需要从SMILES格式中使这些新的分子集特征化
from rdkit import Chem
mols = [Chem.MolFromSmiles(s) for s in smiles]
featurizer = dc.feat.ConvMolFeaturizer()
x = featurizer.featurize(mols)predicted_solubility = model.predict_on_batch(x)
predicted_solubility
And thus, we can see the predicted solubility values :
因此,我们可以看到预测的溶解度值:
array([[-0.45654652],
[ 1.5316172 ],
[ 0.19090167],
[ 0.44833142],
[-0.32875094]], dtype=float32)
We saw very easily how DeepChem makes it very easy for the above two usecases, which may require alot of time for a human chemist to solve these problem!
我们非常轻松地了解到DeepChem如何轻松实现上述两个用例,这可能需要很多时间才能让化学工作者解决这些问题!
For the final part, we will see few visualizations and querying techniques available as a part of RDkit which is very much required when anyone is working for such use cases.
对于最后一部分,我们将看到很少的可视化和查询技术可以作为RDkit的一部分使用,而在任何人使用此类用例的情况下,这都是非常必要的。
SMART字符串可查询分子结构 (SMART strings to to query molecular structures)
SMARTS is an extension of the SMILES language described previously that can be used to create queries.
SMARTS是先前描述的SMILES语言的扩展,可用于创建查询。
# To gain a visual understanding of compounds in our dataset, let's draw them using rdkit. We define a couple of helper functions to get startedimport tempfile
from rdkit import Chem
from rdkit.Chem import Draw
from itertools import islice
from IPython.display import Image, displaydef display_images(filenames):
"""Helper to pretty-print images."""
for file in filenames:
display(Image(file))def mols_to_pngs(mols, basename="test"):
"""Helper to write RDKit mols to png files."""
filenames = []
for i, mol in enumerate(mols):
filename = "%s%d.png" % (basename, i)
Draw.MolToFile(mol, filename)
filenames.append(filename)
return filenames
Now, let’s take a sample SMILES string and visualize the molecular structure.
现在,让我们采样一个SMILES字符串并可视化分子结构。
from rdkit import Chem
from rdkit.Chem.Draw import MolsToGridImage
smiles_list = ["CCCCC","CCOCC","CCNCC","CCSCC"]
mol_list = [Chem.MolFromSmiles(x) for x in smiles_list]
display_images(mols_to_pngs(mol_list))
This is how the visual structures are formed from the SMILES string.
这就是从SMILES字符串形成视觉结构的方式。
Now, let’s say we want to query SMILES string that has three adjacent carbons.
现在,假设我们要查询具有三个相邻碳原子的SMILES字符串。
query = Chem.MolFromSmarts("CCC")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
We see, that the highlighted part, represents the compound with three adjacent carbons.
我们看到,突出显示的部分代表具有三个相邻碳原子的化合物。
Similarly, let’s see some wild character query and other sub-structure query options.
同样,让我们看一些通配符查询和其他子结构查询选项。
query = Chem.MolFromSmarts("C*C")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
query = Chem.MolFromSmarts("C[C,N,O]C")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
Thus, we can see that selective subquery can also be easily handled.
因此,我们可以看到选择性子查询也很容易处理。
Thus, this brings us to the end of this article. I know this article was very high level and specifically targeted for Data Scientists and ML Researchers interested in Drug Discovery, especially during the time of an existing pandemic like Covid19. Hope I was able to help! If you are someone with a strong background in Bio-informatics or Chem-informatics and wants to venture into the world of data science, please reach out to me through any of the options mentioned here. Keep following: https://medium.com/@adib0073 and my website: https://www.aditya-bhattacharya.net/ for more!
因此,这使我们到了本文的结尾。 我知道这篇文章的水平很高,特别针对对药物发现感兴趣的数据科学家和ML研究人员,特别是在Covid19等现有大流行期间。 希望我能提供帮助! 如果您是具有生物信息学或化学信息学背景的人,并且想涉足数据科学领域,请通过 此处 提到的任何一种方法与我联系 。 继续关注: https : //medium.com/@adib0073 和我的网站: https : //www.aditya-bhattacharya.net/ 了解更多 !
翻译自: https://towardsdatascience.com/deepchem-a-framework-for-using-ml-and-dl-for-life-science-and-chemoinformatics-92cddd56a037
ml dl el学习
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391182.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!