吴恩达神经网络1-2-2_图神经网络进行药物发现-第1部分

吴恩达神经网络1-2-2

预测溶解度 (Predicting Solubility)

相关资料 (Related Material)

  • Jupyter Notebook for the article

    Jupyter Notebook的文章

  • Drug Discovery with Graph Neural Networks — part 2

    图神经网络进行药物发现-第2部分

  • Introduction to Cheminformatics

    化学信息学导论

  • Deep learning on graphs: successes, challenges, and next steps (article by prof Michael Bronstein)

    图上的深度学习:成功,挑战和下一步 (迈克尔·布朗斯坦教授的文章)

  • Towards Explainable Graph Neural Networks

    走向可解释的图形神经网络

目录 (Table of Contents)

  • Introduction

    介绍
  • A Special Chemistry Between Drug Development and Machine Learning

    药物开发与机器学习之间的特殊化学
  • Why Molecular Solubility is Important

    为什么分子溶解度很重要
  • Approaching the Problem with Graph Neural Networks

    图神经网络解决问题
  • Hands-on Part with Deepchem

    Deepchem的动手部分
  • About Me

    关于我

介绍 (Introduction)

This article is a mix of theory behind drug discovery, graph neural networks and a practical part of Deepchem library. The first part will discuss potential applications of machine learning in drug development and then explain what molecular features might prove useful for the graph neural network model. We then dive into coding part and create a GNN model that can predict the solubility of a molecule. Let’s get started!

本文综合了药物发现,图形神经网络和Deepchem库的实际部分的理论知识。 第一部分将讨论机器学习在药物开发中的潜在应用,然后解释什么分子特征可能对图神经网络模型有用。 然后,我们深入编码部分,并创建可以预测分子溶解度的GNN模型。 让我们开始吧!

药物开发与机器学习之间的特殊化学 (A Special Chemistry between Drug Development and Machine Learning)

Image for post
Photo by Denise Johnson on Unsplash
丹妮丝·约翰逊 ( Denise Johnson)在Unsplash上摄

Drug development is a time-consuming process which might take decades to approve the final version of the drug [1]. It starts from the initial stage of drug discovery where it identifies certain groups of molecules that are likely to become a drug. Then, it goes through several steps to eliminate unsuitable molecules and finally tests them in real life. Important features that we look at during the drug discovery stage are ADME (Absorption, Distribution, Metabolism, and Excretion) properties. We can say that drug discovery is an optimization problem where we predict the ADME properties and choose those molecules that might increase the likelihood of developing a safe drug [2]. Highly efficient computational methods that find molecules with desirable properties speed up the drug development process and give a competitive advantage over other R&D companies.

药物开发是一个耗时的过程,可能需要数十年才能批准该药物的最终版本[1]。 它从药物发现的初始阶段开始,在此阶段它可以识别可能成为药物的某些分子组。 然后,它通过几个步骤来消除不合适的分子,并最终在现实生活中对其进行测试。 重要的特征,我们在药物开发阶段看是ADME(A bsorption,d istribution, etabolism和E xcretion)性能。 可以说,药物发现是一个优化问题,我们可以预测ADME特性并选择可能增加开发安全药物可能性的分子[2]。 查找具有所需特性的分子的高效计算方法可加快药物开发过程,并提供优于其他研发公司的竞争优势。

It was only a matter of time before machine learning was applied to the drug discovery. This allowed to process molecular datasets with a speed and precision that had not been seen before [3]. However, to make the molecular structures applicable to machine learning, many complicated preprocessing steps have to be performed such as converting 3D molecular structures to 1D fingerprint vectors, or extracting numerical features from specific atoms in a molecule.

将机器学习应用于药物发现只是时间问题。 这样可以以前所未有的速度和精度处理分子数据集[3]。 但是,要使分子结构适用于机器学习,必须执行许多复杂的预处理步骤,例如将3D分子结构转换为1D指纹矢量 ,或从分子中的特定原子提取数值特征。

为什么分子溶解度很重要 (Why Molecular Solubility is Important)

One of the ADME properties, absorption, determines whether the drug can reach efficiently the patient’s bloodstream. One of the factors behind the absorption is aqueous solubility, i.e. whether a certain substance is soluble in water. If we are able to predict the solubility, we can also get a good indication of the absorption property of the drug.

ADME的特性之一就是吸收,它决定药物是否可以有效地到达患者的血液中。 吸收背后的因素之一是水溶性,即某种物质是否可溶于水。 如果我们能够预测溶解度,我们也可以很好地表明药物的吸收特性。

图神经网络解决问题 (Approaching the Problem with Graph Neural Networks)

To apply GNNs to molecular structures, we must transform the molecule into a numerical representation that can be understood by the model. It is a rather complicated step and it will vary depending on the specific architecture of the GNN model. Fortunately, most of that preprocessing is covered by external libraries such as Deepchem or RDKit.

要将GNN应用于分子结构,我们必须将分子转换为模型可以理解的数字表示形式。 这是一个相当复杂的步骤,并且会根据GNN模型的特定架构而有所不同。 幸运的是,大多数预处理都被Deepchem或RDKit之类的外部库所覆盖。

Here, I will quickly explain the most common approaches to preprocess a molecular structure.

在这里,我将快速解释预处理分子结构的最常用方法。

微笑 (SMILES)

SMILES is a string representation of the 2D structure of the molecule. It maps any molecule to a special string that is (usually) unique and can be mapped back to the 2D structure. Sometimes, different molecules can be mapped to the same SMILES string which might decrease the performance of the model.

SMILES是分子的2D结构的字符串表示。 它将任何分子映射到(通常)唯一且可以映射回2D结构的特殊字符串。 有时,不同的分子可以映射到相同的SMILES字符串,这可能会降低模型的性能。

指纹识别 (Fingerprints)

Image for post
[Source][资源]

Fingerprints is a binary vector where each bit represents whether a certain substructure of the molecule is present or not. It is usually quite long and might fail to incorporate some structural information such as chirality.

指纹是一个二进制向量,其中每个位代表是否存在该分子的某个子结构 。 它通常很长,可能无法合并一些结构信息,例如手性

邻接矩阵和特征向量 (Adjacency Matrix and Feature Vectors)

Another way to preprocess a molecular structure is to create an adjacency matrix. The adjacency matrix contains information about the connectivity of atoms, where “1” means that there is a connection between them and “0” that there is none. The adjacency matrix is sparse and is often quite big which might not be very efficient to work with.

预处理分子结构的另一种方法是创建邻接矩阵 。 邻接矩阵包含有关原子连接性的信息,其中“ 1”表示原子之间存在连接,而“ 0”表示不存在连接。 邻接矩阵是稀疏的,并且通常很大,使用它可能不是很有效。

Image for post
C is connected to itself and all other C连接至其自身和所有其他H atoms (first row of the adjacency matrix). Individual feature vector, let’s say v0, contains information about specific atom. Individual Feature Pair Vector contains information about two neighbouring atoms and it is often a function (sum, average, etc. ) of two feature vectors of these individual atoms.H原子(邻接矩阵的第一行)。 假设单个特征向量v0包含有关特定原子的信息。 单个特征对向量包含有关两个相邻原子的信息,并且通常是这些单个原子的两个特征向量的函数(求和,平均值等)。

Together with this matrix, we can provide to the GNN model information about each individual atom and information about neighbouring atoms in a form of a vector. In the feature vector for each atom, there can be information about the atomic number, number of valence electrons, or number of single bonds. There is of course many more and they can fortunately be generated by RDKit and Deepchem,

与此矩阵一起,我们可以以矢量的形式向GNN模型提供有关每个单个原子的信息以及有关相邻原子的信息。 在每个原子的特征向量中,可以包含有关原子序数,价电子数或单键数的信息。 当然还有更多,它们可以由RDKit和Deepchem生成,

Image for post
A Feature Vector usually contains information about specific atom. This vector is often generated by using the functionality from the RDKit or Deepchem package.
特征向量通常包含有关特定原子的信息。 该向量通常是通过使用RDKit或Deepchem软件包中的功能生成的。

溶解度 (Solubility)

The variable that we are going to predict is called cLogP and is also known as octanol-water partition coefficient. Basically, the lower is the value the more soluble it is in water. clogP is a log ratio so the values range from -3 to 7 [6].

我们将要预测的变量称为cLogP 也称为辛醇-水分配系数。 基本上,该值越低,它在水中的溶解度越高。 clogP是对数比率,因此值的范围是-3到7 [6]。

There is also a more general equation describing the solubility logS:

还有一个更通用的方程式描述溶解度logS

Image for post
MP is a melting point (Celcius Degrees). MP是熔点(摄氏温度)。 logKow is an octanol-water partition coefficient, aka. logKow是辛醇-水分配系数,也称为。 cLogP日志

The problem with that equation is that MP is very difficult to predict from the chemical structure of the molecule [7]. All available solubility datasets contain only cLogP value and this is the value that we are going to predict as well.

该方程式的问题在于, 很难通过分子的化学结构来预测MP [7]。 所有可用的溶解度数据集仅包含cLogP值,这也是我们将要预测的值。

Deepchem的动手部分 (Hands-on Part with Deepchem)

Colab notebook that you can run by yourself is here.

您可以自己运行的Colab笔记本在这里。

Deepchem is a deep learning library for life sciences that is built upon few packages such as Tensorflow, Numpy, or RDKit. For molecular data, it provides convenient functionality such as data loaders, data splitters, featurizers, metrics, or GNN models. From my experience, it is quite troublesome to setup so I would recommend running it on the Colab notebook that I’ve provided. Let’s get started!

Deepchem是用于生命科学的深度学习库,它建立在Tensorflow,Numpy或RDKit等少数软件包的基础上。 对于分子数据,它提供了方便的功能,例如数据加载器,数据拆分器,特征化器,度量或GNN模型。 根据我的经验,设置起来很麻烦,所以我建议在我提供的Colab笔记本上运行它。 让我们开始吧!

Firstly, we will download a Delaney dataset, which is considered as a benchmark for solubility prediction task. We then load the dataset using CSVLoader class and specify a column with cLogP data which is passed into tasks argument. In smiles_field, name of the column with SMILES string have to be specified. We choose a ConvMolFeaturizer which will create input features in a format required by the GNN model that we are going to use.

首先,我们将下载Delaney数据集,该数据集被视为溶解度预测任务的基准。 然后,我们使用CSVLoader类加载数据集,并指定包含cLogP数据的列,该列将传递到task参数。 在smiles_field中,必须指定带有SMILES字符串的列的名称。 我们选择一个ConvMolFeaturizer,它将以我们将要使用的GNN模型所需的格式创建输入要素。

# Getting the delaney dataset
!wget https://raw.githubusercontent.com/deepchem/deepchem/master/datasets/delaney-processed.csv
from deepchem.utils.save import load_from_disk
dataset_file= "delaney-processed.csv"# Loading the data from the CSV file
loader = deepchem.data.CSVLoader(tasks=["ESOL predicted log solubility in mols per litre"], smiles_field="smiles", featurizer=deepchem.feat.ConvMolFeaturizer())
# Featurizing the dataset with ConvMolFeaturizer
dataset = loader.featurize(dataset_file)

Later, we split the dataset using RandomSplitter and divide data into training and validation set. We also use a normalization for y values so they have zero mean and unit standard deviation.

之后,我们使用RandomSplitter分割数据集,并将数据分为训练和验证集。 我们还对y值使用归一化,因此它们的均值和单位标准差为零。

# Splitter splits the dataset # In this case it's is an equivalent of train_test_split from sklearnsplitter = deepchem.splits.RandomSplitter()# frac_test is 0.01 because we only use a train and valid as an exampletrain, valid, _ = splitter.train_valid_test_split(dataset,frac_train=0.7,frac_valid=0.29,frac_test=0.01)# Normalizer will normalize y values in the datasetnormalizer = deepchem.trans.NormalizationTransformer(transform_y=True, dataset=train, move_mean=True)train = normalizer.transform(train)test = normalizer.transform(valid)

In this example, we will use a GraphConvModel as our GNN models. It’s an architecture that was created by Duvenaud, et al. You can find their paper here. There are other GNN models as a part of the Deepchem package such as WeaveModel, or DAGModel. You can find a full list of the models with required featurizers here.

在此示例中,我们将使用GraphConvModel作为我们的GNN模型。 这是Duvenaud等人创建的架构。 您可以在这里找到他们的论文。 Deepchem软件包中还包含其他GNN模型,例如WeaveModel或DAGModel。 您可以在此处找到具有所需功能的所有型号的完整列表。

In this code snippet, a person R2 score is also defined. Simply speaking, the closer this value is to 1, the better is the model.

在此代码段中,还定义了人员R2分数。 简单地说,该值越接近1,模型越好。

# GraphConvModel is a GNN model based on 
# Duvenaud, David K., et al. "Convolutional networks on graphs for
# learning molecular fingerprints."
from deepchem.models import GraphConvModel
graph_conv = GraphConvModel(1,batch_size=50,mode="regression")
# Defining metric. Closer to 1 is better
metric = deepchem.metrics.Metric(deepchem.metrics.pearson_r2_score)

Deepchem models use Keras API. The graph_conv model is trained with the fit() function. Here you can also specify the number of epochs. We get the scores with evaluate() function. Normalizer has to be passed here because y values need to be mapped again to the previous range before computing the metric score.

Deepchem模型使用Keras API。 graph_conv模型是使用fit()函数训练的。 在这里,您还可以指定时期数。 我们使用评价()函数获得分数。 必须在此处传递规范化器,因为在计算指标得分之前, y值需要再次映射到先前的范围。

# Fitting the model
graph_conv.fit(train, nb_epoch=10)# Reversing the transformation and getting the metric scores on 2 datasets
train_scores = graph_conv.evaluate(train, [metric], [normalizer])
valid_scores = graph_conv.evaluate(valid, [metric], [normalizer])

And that’s all! You can do much more interesting stuff with Deepchem. They created some tutorials to show what else you can do with it. I highly suggest looking over it. You can find them here.

就这样! 您可以使用Deepchem做更多有趣的事情。 他们创建了一些教程来展示您还可以做什么。 我强烈建议您仔细检查一下。 您可以在这里找到它们。

Thank you for reading the article, I hope it was useful for you!

感谢您阅读本文,希望对您有所帮助!

关于我 (About Me)

I am an MSc Artificial Intelligence student at the University of Amsterdam. In my spare time, you can find me fiddling with data or debugging my deep learning model (I swear it worked!). I also like hiking :)

我是阿姆斯特丹大学的人工智能硕士研究生。 在业余时间,您会发现我不喜欢数据或调试我的深度学习模型(我发誓它能工作!)。 我也喜欢远足:)

Here are my social media profiles, if you want to stay in touch with my latest articles and other useful content:

如果您想与我的最新文章和其他有用内容保持联系,这是我的社交媒体个人资料:

  • Medium

  • Linkedin

    领英

  • Github

    Github

  • Personal Website

    个人网站

翻译自: https://towardsdatascience.com/drug-discovery-with-graph-neural-networks-part-1-1011713185eb

吴恩达神经网络1-2-2

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391083.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

论文搜索源

中国科学院文献情报中心 见下图 中国计算机学会推荐国际学术会议和期刊目录 EI学术会议中心,        engieer village 转载于:https://www.cnblogs.com/cxy-941228/p/7693097.html

重学TCP协议(10)SYN flood 攻击

1.SYN flood 攻击 SYN Flood(半开放攻击)是一种拒绝服务(DDoS)攻击,其目的是通过消耗所有可用的服务器资源使服务器不可用于合法流量。通过重复发送初始连接请求(SYN)数据包,攻击者能…

python 数据框缺失值_Python:处理数据框中的缺失值

python 数据框缺失值介绍 (Introduction) In the last article we went through on how to find the missing values. This link has the details on the how to find missing values in the data frame. https://medium.com/kallepalliravi/python-finding-missing-values-in-…

Spring Cloud 5分钟搭建教程(附上一个分布式日志系统项目作为参考) - 推荐

http://blog.csdn.net/lc0817/article/details/53266212/ https://github.com/leoChaoGlut/log-sys 上面是我基于Spring Cloud ,Spring Boot 和 Docker 搭建的一个分布式日志系统. 目前已在我司使用. 想要学习Spring Cloud, Spring Boot以及Spring 全家桶的童鞋,可以参考学习,如…

51nod1832(二叉树/高精度模板+dfs)

题目链接: http://www.51nod.com/onlineJudge/questionCode.html#!problemId1832 题意: 中文题诶~ 思路: 若二叉树中有 k 个节点只有一个子树, 则答案为 1 << k. 详情参见:http://blog.csdn.net/gyhguoge01234/article/details/77836484 代码: 1 #include <iostream&g…

重学TCP协议(11)TFO(Tcp Fast Open)

1. TFO 为了改善web应用相应时延&#xff0c;google发布了通过修改TCP协议利用三次握手时进行数据交换的TFO(TCP fast open&#xff0c;RFC 7413)。 TFO允许在TCP握手期间发送和接收初始SYN分组中的数据。如果客户端和服务器都支持TFO功能&#xff0c;则可以减少建立到同一服…

外星人图像和外星人太空船_卫星图像:来自太空的见解

外星人图像和外星人太空船By Christophe Restif & Avi Hoffman, Senior Software Engineers, Crisis Response危机应对高级软件工程师Christophe Restif和Avi Hoffman Editor’s note: In 2019, we piloted a new feature in Search SOS Alerts for major California wild…

棒棒糖 宏_棒棒糖图表

棒棒糖 宏AKA: lollipop plot又名&#xff1a;棒棒糖情节 WHY: a lollipop chart (LC) is a handy variation of a bar chart where the bar is replaced with a line and a dot at the end. Just like bar graphs, lollipop plots are used to make comparisons between diff…

ubuntu上如何安装tomcat

1. 在官网下载linux里面的tomcat 2. 放到DownLoads下面--把tomcat的压缩包放到DownLoads3. sudo mkdir /usr/local/tomcat/ -在usr/local/路径下新建一个tomcat的文件夹4 sudo tar zxvf tomcat。。。。tar.gz -C /usr/local/tomcat/---把解压后的tomcat放到usr/local/下的tomca…

ZooKeeper3.4.5-最基本API开发

2019独角兽企业重金招聘Python工程师标准>>> package cn.itcast.bigdata.zk;import java.io.IOException; import java.util.List;import org.apache.zookeeper.CreateMode; import org.apache.zookeeper.KeeperException; import org.apache.zookeeper.WatchedEven…

nlp自然语言处理_不要被NLP Research淹没

nlp自然语言处理自然语言处理 (Natural Language Processing) 到底是怎么回事&#xff1f; (What is going on?) NLP is the new Computer VisionNLP是新的计算机视觉 With enormous amount go textual datasets available; giants like Google, Microsoft, Facebook etc have…

opencv 随笔

装环境好累&#xff0c;python3.6&#xff0c;opencv3.4 好不容易装好了&#xff0c;结果 addweight的时候总是报错 The operation is neither array op array (where arrays have the same size and the same number of channels), nor array op scalar, nor scalar op array …

中小型研发团队架构实践三要点(转自原携程架构师张辉清)

如果你正好处在中小型研发团队…… 中小型研发团队很多&#xff0c;而社区在中小型研发团队架构实践方面的探讨却很少。中小型研发团队特别是 50 至 200 人的研发团队&#xff0c;在早期的业务探索阶段&#xff0c;更多关注业务逻辑&#xff0c;快速迭代以验证商业模式&#xf…

时间序列预测 预测时间段_应用时间序列预测:美国住宅

时间序列预测 预测时间段1.简介 (1. Introduction) During these COVID19 months housing sector is rebounding rapidly after a downtime since the early months of the year. New residential house construction was down to about 1 million in April. As of July 1.5 mi…

zabbix之web监控

Web monitoring(web监控)是用来监控Web程序的&#xff0c;可以监控到Web程序的下载速度&#xff0c;返回码以及响应时间&#xff0c;还支持把一组连续的Web动作作为一个整体进行监控。 1.Web监控的原理 Web监控即对HTTP服务的监控&#xff0c;模拟用户去访问网站&#xff0c;对…

经验主义 保守主义_为什么我们需要行动主义-始终如此。

经验主义 保守主义It’s been almost three months since George Floyd was murdered and the mass protests. Three months since the nationwide protests, looting and riots across America.距离乔治弗洛伊德(George Floyd)被谋杀和大规模抗议活动已经快三个月了。 全国抗议…

redis介绍以及安装

一、redis介绍 redis是一个key-value存储系统。和Memcached类似&#xff0c;它支持存储的values类型相对更多&#xff0c;包括字符串、列表、哈希散列表、集合&#xff0c;有序集合。 这些数据类型都支持push/pop、add/remove及取交集并集和差集及更丰富的操作&#xff0c;而且…

python机器学习预测_使用Python和机器学习预测未来的股市趋势

python机器学习预测Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works withou…

Python数据结构之四——set(集合)

Python版本&#xff1a;3.6.2 操作系统&#xff1a;Windows 作者&#xff1a;SmallWZQ 经过几天的回顾和学习&#xff0c;我终于把Python 3.x中的基础知识介绍好啦。下面将要继续什么呢&#xff1f;让我想想先~~~嗯&#xff0c;还是先整理一下近期有关Python基础知识的随笔吧…

knn 机器学习_机器学习:通过预测意大利葡萄酒的品种来观察KNN的工作方式

knn 机器学习Introduction介绍 For this article, I’d like to introduce you to KNN with a practical example.对于本文&#xff0c;我想通过一个实际的例子向您介绍KNN。 I will consider one of my project that you can find in my GitHub profile. For this project, …