ml dl el学习_DeepChem —在生命科学和化学信息学中使用ML和DL的框架

ml dl el学习

Application of Machine Learning and Deep Learning for Drug Discovery, Genomics, Microsocopy and Quantum Chemistry can create radical impact and holds the potential to significantly accelerate the process of medical research and vaccine development, which is a necessity for any pandemic like Covid19.

机器学习和深度学习在药物发现,基因组学,显微技术和量子化学中的应用可以产生根本性的影响,并具有显着加速医学研究和疫苗开发过程的潜力,这对于像Covid19这样的大流行都是必不可少的。

Before we even begin, this article is a very high level one and specifically targeted for Data Scientists and ML Researchers interested in Drug Discovery, especially during the time of an existing pandemic like Covid19. If you are some one with a strong background in Bio-informatics or Chem-informatics and wants to venture into the world of data science for these use-cases, please reach out to me through any of the options mentioned here, and we can discuss few interesting opportunities for the greater good of mankind.

在我们甚至还未开始之前,本文是一篇非常高的文章,专门针对对药物发现感兴趣的数据科学家和ML研究人员,特别是在Covid19等现有大流行期间 如果您是具有生物信息学或化学信息学背景的人,并且想涉足这些用例的数据科学领域,请通过此处提及的任何一种方法与我联系,我们可以讨论为人类带来更大利益的机会很少。

DeepChem, an open source framework, which internally uses TensorFlow, that has been specifically designed to simplify the creation of deep learning models for various life science applications.

DeepChem是一个开放源代码框架,内部使用TensorFlow,该框架专门设计用于简化各种生命科学应用程序的深度学习模型的创建。

In this tutorial, we will see how to setup DeepChem and we will see how to use DeepChem for :

在本教程中,我们将了解如何设置DeepChem,以及如何将DeepChem用于:

1. training a model that can predict toxicity of molecules

1.训练可以预测分子毒性的模型

2. training a model to predict solubility of molecules

2.训练模型以预测分子的溶解度

3. using SMART strings to query molecular structures.

3.使用SMART字符串查询分子结构。

设置DeepChem (Setting Up DeepChem)

Although, in multiple sources, I have seen users have expressed their concern in setting up DeepChem in Windows, Linux and Mac environments, but I found it quite easy to do that using pip installer.

虽然我从多个方面看到用户已经表达了他们对在Windows,Linux和Mac环境中设置DeepChem的担忧,但是我发现使用pip安装程序非常容易。

The DeepChem development team, is very active and they do provide daily builds, so I would like everyone to take a look at their pypi page: https://pypi.org/project/deepchem/#history and install a suitable version in case if the latest version has any issue. A simple pip install deepchem would install the very latest version.

DeepChem开发团队非常活跃,他们确实提供每日构建,因此,我希望每个人都可以查看他们的pypi页面: https ://pypi.org/project/deepchem/#history并安装合适的版本以防万一如果最新版本有任何问题。 一个简单的pip安装deepchem将安装最新版本。

Next, along with DeepChem, you would require TensorFlow to be installed. I had installed the latest version of TensorFlow using pip install tensorflow and RDkit which is an open source Cheminformatics software package. For RDkit and for installing in Windows, I did not find any reliable pip installer, so that installed it from https://anaconda.org/rdkit/rdkit using the conda installer : conda install -c rdkit rdkit

接下来,您需要与DeepChem一起安装TensorFlow。 我已经使用pip install tensorflow和RDkit安装了最新版本的TensorFlow,RDkit是开源的Cheminformatics软件包。 对于RDkit和在Windows中安装,我找不到任何可靠的pip安装程序,因此使用conda安装程序从https://anaconda.org/rdkit/rdkit安装了它: conda install -c rdkit rdkit

Once these three modules are installed, we are ready to start with our experiments.

一旦安装了这三个模块,我们就可以开始实验了。

预测分子的毒性 (Predicting toxicity of molecules)

Molecular toxicity can be defined as sum of adverse effects exhibited by a substance on any organism. Computational methods can actually determine the toxicity of a given compound using chemical and strcutral properties of the molecule and molecular featurization using molecular descriptors (Dong et al., 2015) and fingerprints (Xue and Bajorath, 2000), can effectively extract the chemical and structural information inherent in any given molecule for prediction-based approaches.

分子毒性可以定义为某种物质对任何生物产生的不利影响之和。 计算方法实际上可以使用分子的化学和结构特性确定给定化合物的毒性,并使用分子描述符(Dong等, 2015 )和指纹图谱(Xue and Bajorath, 2000 )进行分子特征化,可以有效地提取化学和结构任何基于预测方法的给定分子固有的信息。

For predicting toxicity, we will use the Tox21 toxicity dataset from MoleculeNet and we will use DeepChem to load the required dataset.

为了预测毒性,我们将使用MoleculeNet的Tox21毒性数据集,并使用DeepChem加载所需的数据集。

import numpy as np
import deepchem as dc
tox21_tasks, tox21_datasets, transformers = dc.molnet.load_tox21()

After this we will see all the toxicity classes, just be printing tox21_tasks

在此之后,我们将看到所有毒性类别,只需打印tox21_tasks

['NR-AR',
'NR-AR-LBD',
'NR-AhR',
'NR-Aromatase',
'NR-ER',
'NR-ER-LBD',
'NR-PPAR-gamma',
'SR-ARE',
'SR-ATAD5',
'SR-HSE',
'SR-MMP',
'SR-p53']

We can divide the entire dataset into training, testing and validation dataset by:

我们可以通过以下方式将整个数据集分为训练,测试和验证数据集:

train_dataset, valid_dataset, test_dataset = tox21_datasets

If we check the distribution of the dataset, we will see that the dataset is not balanced, so we would need to balance the dataset as typically we are trying to solve a MultiClass Classification problem. And so if the dataset is not balanced, the majority class will add bias to the classifier, which will skew the results. So, the transformer object used by default is a balancing transformer.

如果检查数据集的分布,我们将看到数据集是不平衡的,因此我们将需要平衡数据集,因为通常情况下,我们正在尝试解决“多类分类”问题。 因此,如果数据集不平衡,则多数类将为分类器增加偏差,这会使结果产生偏差。 因此,默认情况下使用的变压器对象是平衡变压器。

print(transformers)
[<deepchem.trans.transformers.BalancingTransformer at 0x26b5642dc88>]

now, for the training part :

现在,对于培训部分:

model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])
model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean)
train_scores = model.evaluate(train_dataset, [metric], transformers)
test_scores = model.evaluate(test_dataset, [metric], transformers)

Now, DeepChem’s submodule contains a variety of dc.models different life science–specific models.

现在,DeepChem的子模块包含各种dc.models,不同生命科学特定的模型。

And finally we see, that the final AUC-ROC scores are:

最后我们看到,最终的AUC-ROC分数是:

{'training mean-roc_auc_score': 0.9556297601807405}
{'testing mean-roc_auc_score': 0.7802496964641786}

This shows us that there is some over-fitting in the model as the test dataset metric scores are much less as compared to the train-set. But, nevertheless, now we do have a model that can predict toxicity from molecules!

这表明我们存在模型过度拟合的问题,因为与训练集相比,测试数据集指标得分要低得多。 但是,尽管如此,现在我们有了一个可以预测分子毒性的模型!

预测分子的溶解度 (Predicting solubility of molecules)

Solubility is a measure, which shows how easily a molecule can dissolve in water. For any drug discovery, it is very important to check the solubility of the compound as the drug should dissolve into the patient’s bloodstream to have the required therapeutic effect. Usually, medicinal chemists spend a lot of time in modifying molecules to increase this property of solubility. In this section we will use DeepChem to predict solubility of molecules.

溶解度是一种度量,它显示了分子在水中的溶解程度。 对于任何发现的药物,检查化合物的溶解度非常重要,因为药物应溶解到患者的血流中以达到所需的治疗效果。 通常,药用化学家花费大量时间来修饰分子以增加溶解度的这种特性。 在本节中,我们将使用DeepChem预测分子的溶解度。

We will be using the delaney dataset from MoleculeNet, which is also available in DeepChem, for predicting molecular solubility.

我们将使用DeepChem中也提供的MoleculeNetdelaney数据集来预测分子溶解度。

# load the featurized data 
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')# Split into traintest-validation dataset
train_dataset, valid_dataset, test_dataset = datasets# Fit the model
model = dc.models.GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)
model.fit(train_dataset, nb_epoch=100)# Use r2 score as model evaluation metric
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print(model.evaluate(train_dataset, [metric], transformers))
print(model.evaluate(test_dataset, [metric], transformers))

Even in the first pass, we see some overfitting from the model evaluation results.

即使在第一遍中,我们也会从模型评估结果中看到一些过拟合。

{'training pearson_r2_score': 0.9203419837932797}
{'testing pearson_r2_score': 0.7529095508565846}

Let’s see how to predict solubility for a new set of molecules:

让我们看看如何预测一组新分子的溶解度:

smiles = ['COC(C)(C)CCCC(C)CC=CC(C)=CC(=O)OC(C)C',
'CCOC(=O)CC',
'CSc1nc(NC(C)C)nc(NC(C)C)n1',
'CC(C#C)N(C)C(=O)Nc1ccc(Cl)cc1',
'Cc1cc2ccccc2cc1C']

Next, we need to featurize these new set of molecules from their SMILES format

接下来,我们需要从SMILES格式中使这些新的分子集特征化

from rdkit import Chem
mols = [Chem.MolFromSmiles(s) for s in smiles]
featurizer = dc.feat.ConvMolFeaturizer()
x = featurizer.featurize(mols)predicted_solubility = model.predict_on_batch(x)
predicted_solubility

And thus, we can see the predicted solubility values :

因此,我们可以看到预测的溶解度值:

array([[-0.45654652],
[ 1.5316172 ],
[ 0.19090167],
[ 0.44833142],
[-0.32875094]], dtype=float32)

We saw very easily how DeepChem makes it very easy for the above two usecases, which may require alot of time for a human chemist to solve these problem!

我们非常轻松地了解到DeepChem如何轻松实现上述两个用例,这可能需要很多时间才能让化学工作者解决这些问题!

For the final part, we will see few visualizations and querying techniques available as a part of RDkit which is very much required when anyone is working for such use cases.

对于最后一部分,我们将看到很少的可视化和查询技术可以作为RDkit的一部分使用,而在任何人使用此类用例的情况下,这都是非常必要的。

SMART字符串可查询分子结构 (SMART strings to to query molecular structures)

SMARTS is an extension of the SMILES language described previously that can be used to create queries.

SMARTS是先前描述的SMILES语言的扩展,可用于创建查询。

# To gain a visual understanding of compounds in our dataset, let's draw them using rdkit. We define a couple of helper functions to get startedimport tempfile
from rdkit import Chem
from rdkit.Chem import Draw
from itertools import islice
from IPython.display import Image, displaydef display_images(filenames):
"""Helper to pretty-print images."""
for file in filenames:
display(Image(file))def mols_to_pngs(mols, basename="test"):
"""Helper to write RDKit mols to png files."""
filenames = []
for i, mol in enumerate(mols):
filename = "%s%d.png" % (basename, i)
Draw.MolToFile(mol, filename)
filenames.append(filename)
return filenames

Now, let’s take a sample SMILES string and visualize the molecular structure.

现在,让我们采样一个SMILES字符串并可视化分子结构。

from rdkit import Chem
from rdkit.Chem.Draw import MolsToGridImage
smiles_list = ["CCCCC","CCOCC","CCNCC","CCSCC"]
mol_list = [Chem.MolFromSmiles(x) for x in smiles_list]
display_images(mols_to_pngs(mol_list))
Image for post

This is how the visual structures are formed from the SMILES string.

这就是从SMILES字符串形成视觉结构的方式。

Now, let’s say we want to query SMILES string that has three adjacent carbons.

现在,假设我们要查询具有三个相邻碳原子的SMILES字符串。

query = Chem.MolFromSmarts("CCC")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
Image for post

We see, that the highlighted part, represents the compound with three adjacent carbons.

我们看到,突出显示的部分代表具有三个相邻碳原子的化合物。

Similarly, let’s see some wild character query and other sub-structure query options.

同样,让我们​​看一些通配符查询和其他子结构查询选项。

query = Chem.MolFromSmarts("C*C")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
Image for post
query = Chem.MolFromSmarts("C[C,N,O]C")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
Image for post

Thus, we can see that selective subquery can also be easily handled.

因此,我们可以看到选择性子查询也很容易处理。

Thus, this brings us to the end of this article. I know this article was very high level and specifically targeted for Data Scientists and ML Researchers interested in Drug Discovery, especially during the time of an existing pandemic like Covid19. Hope I was able to help! If you are someone with a strong background in Bio-informatics or Chem-informatics and wants to venture into the world of data science, please reach out to me through any of the options mentioned here. Keep following: https://medium.com/@adib0073 and my website: https://www.aditya-bhattacharya.net/ for more!

因此,这使我们到了本文的结尾。 我知道这篇文章的水平很高,特别针对对药物发现感兴趣的数据科学家和ML研究人员,特别是在Covid19等现有大流行期间。 希望我能提供帮助! 如果您是具有生物信息学或化学信息学背景的人,并且想涉足数据科学领域,请通过 此处 提到的任何一种方法与我联系 继续关注: https : //medium.com/@adib0073 和我的网站: https : //www.aditya-bhattacharya.net/ 了解更多

翻译自: https://towardsdatascience.com/deepchem-a-framework-for-using-ml-and-dl-for-life-science-and-chemoinformatics-92cddd56a037

ml dl el学习

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391182.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

2017-2018-1 20179215《Linux内核原理与分析》第二周作业

20179215《Linux内核原理与分析》第二周作业 这一周主要了解了计算机是如何工作的&#xff0c;包括现在存储程序计算机的工作模型、X86汇编指令包括几种内存地址的寻址方式和push、pop、call、re等几个重要的汇编指令。主要分为两部分进行这周的学习总结。第一部分对学习内容进…

Gradle复制文件/目录方法

2019独角兽企业重金招聘Python工程师标准>>> gradle复制文件/文件夹方法 复制文件 //复制IDE生成的classes.jar文件到build/libs中&#xff0c;并改名为FileUtils.jar. task copyFile(type:Copy) {delete build/libs/FileUtils.jarfrom(build/intermediates/bundles…

用户参与度与活跃度的区别_用户参与度突然下降

用户参与度与活跃度的区别disclaimer: I don’t work for Yammer, this is a public data case study, I’ve written it in a narrative format to make this case study more engaging to read.免责声明&#xff1a;我不为Yammer工作&#xff0c;这是一个公共数据案例研究&am…

重学TCP协议(6) 四次挥手

1. 四次挥手 客户端进程发出连接释放报文&#xff0c;并且停止发送数据。释放数据报文首部&#xff0c;FIN1&#xff0c;其序列号为sequ&#xff08;等于前面已经传送过来的数据的最后一个字节的序号加1&#xff09;&#xff0c;此时&#xff0c;客户端进入FIN-WAIT-1&#xff…

UML建模图实战笔记

一、前言 UML&#xff1a;Unified Modeling Language&#xff08;统一建模语言&#xff09;&#xff0c;使用UML进行建模的作用有哪些&#xff1a; 可以更好的理解问题可以及早的发现错误或者被遗漏的点可以更加方便的进行组员之间的沟通支持面向对象软件开发建模&#xff0c;可…

数据草拟:使您的团队热爱数据的研讨会

Learn the rules to Data Draw Up; a fun way to get your teams invested in data.了解数据收集的规则&#xff1b; 一种让您的团队投入数据的有趣方式。 Let’s keep things short. Metrics are one of the most important things in Product Management. They help us to u…

深入理解InnoDB(5)-文件系统

1. 数据库和文件系统的关系 像 InnoDB 、 MyISAM 这样的存储引擎都是把表存储在文件系统上的。当我们想读取数据的时候&#xff0c;这些存储引擎会从文件系统中把数据读出来返回给我们&#xff0c;当我们想写入数据的时候&#xff0c;这些存储引擎会把这些数据又写回文件系统。…

Digital River拉来Netconcepts站台 亚太营销服务升级

它是大洋彼岸的一家网络软件下载、分销商&#xff0c;很多重量级的软件行业领军企业都是其客户&#xff0c;它一直低调摸索亚太营销的路子&#xff0c;在今年九月份&#xff0c;它一改常态&#xff0c;高调宣布入华&#xff0c;三个月后&#xff0c;它带来了最新消息&#xff1…

按下按钮_按下

按下按钮Updated with the latest data: 23/8/2020更新最新数据&#xff1a;23/8/2020 As restrictions are lifted for Laois and Offaly, difficult times are set to continue for the people of Kildare, at least for another couple of weeks, as they continue to fight…

windows中怎么添加定时任务

linux中有crontab定时任务&#xff0c;很方便 其实windows也有类似的 需求&#xff1a;定时执行python脚本 1、Windows键R&#xff0c;调出此窗口&#xff0c;输入compmgmt.msc 2、 转载于:https://www.cnblogs.com/gcgc/p/11594467.html

重学TCP协议(7) Timestamps 选项

1.Timestamps 选项的组成部分 时间戳选项占10个字节 kind(1字节) &#xff0b; length(1字节) info (8字节)&#xff0c;其中kind8&#xff0c;length10&#xff0c;info由timestamp&#xff08;TS value&#xff09;和timestamp echo&#xff08;TS Echo Reply&#xff09;两…

c++ 时间序列工具包_我的时间序列工具包

c 时间序列工具包When it comes to time series forecasting, I’m a great believer that the simpler the model, the better.关于时间序列预测&#xff0c;我坚信模型越简单越好。 However, not all time series are created equal. Some time series have a strongly defi…

bash 的相关配置

bash 参数自动补全 请安装 bash-completion bash 提示符 说明&#xff1a;参考文档 1. 简洁风格 if [[ ${EUID} 0 ]] ; then PS1\[\033[01;32m\][\[\033[01;35m\]\u\[\033[01;37m\] \w\[\033[01;32m\]]\$\[\033[00m\] else PS1\[\033[01;32m\][\u\[\033[01;37m\] \w\[\033[01;…

LINUX系统安装和管理

目录 一.应用程序 对比应用程序与系统命令的关系 典型应用程序的目录结构 常见的软件包装类型 二.RPM软件包管理 1.RPM是什么&#xff1f; 2.RPM命令的格式 查看已安装的软件包格式 查看未安装的软件包 3.RPM安装包从哪里来&#xff1f; 4.挂载的定义 挂载命令moun…

adobe 书签怎么设置_让我们设置一些规则…没有Adobe Analytics处理规则

adobe 书签怎么设置Originally published at Analyst Admin.最初发布于Analyst Admin 。 In my experience working with Adobe Analytics, I’ve found that Processing Rules help in some cases, but oftentimes they create more work. I try to avoid using Processing R…

详解linux下安装python3环境

1、下载python3.5源码包首先去python官网下载python3的源码包&#xff0c;网址&#xff1a;https://www.python.org/ 进去之后点击导航栏的Downloads&#xff0c;也可以鼠标放到Downloads上弹出菜单选择Source code&#xff0c;表示源码包&#xff0c;这里选择最新版本3.5.2&am…

重学TCP协议(8) TCP的11种状态

TCP的11种状态 为了逻辑更加清晰&#xff0c;假设主动打开连接和关闭连接皆为客户端&#xff0c;被动打开连接和关闭连接皆为服务端 客户端独有的&#xff1a;&#xff08;1&#xff09;SYN_SENT &#xff08;2&#xff09;FIN_WAIT1 &#xff08;3&#xff09;FIN_WAIT2 &…

肯尼亚第三方支付_肯尼亚的COVID-19病例正在Swift增加,我们不知道为什么。

肯尼亚第三方支付COVID-19 cases in Kenya are accelerating rapidly. New cases have increased 300% month-over-month since April of this year while global and regional media have reported on the economic toll of stringent lock-down measures and heavy-handed go…

Java 集合 List Arrays.asList

2019独角兽企业重金招聘Python工程师标准>>> 参考链接&#xff1a;阿里巴巴Java开发手册终极版v1.3.0 【强制】使用工具类 Arrays.asList()把数组转换成集合时&#xff0c;不能使用其修改集合相关的方 法&#xff0c;它的 add/remove/clear 方法会抛出 UnsupportedO…

重学TCP协议(9) 半连接队列、全连接队列

1. 半连接队列、全连接队列基本概念 三次握手中&#xff0c;在第一步server收到client的syn后&#xff0c;把相关信息放到半连接队列中&#xff0c;同时回复synack给client&#xff08;第二步&#xff09;&#xff0c;同时开启一个定时器&#xff0c;如果超时还未收到 ACK 会进…