PSP - 提取 UniRef 数据库搜索的 MSA 序列物种 (Species) 信息

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/135702185

UniRef

UniRef库:UniProt参考聚类(UniRef)的简称,提供了从UniProt知识库(包括异构体)和选定的UniParc记录中聚类得到的一组蛋白质序列,以便在不同的分辨率下覆盖整个序列空间,同时隐藏冗余的序列。UniRef库有4种分辨率:UniRef100、UniRef90、UniRef50、UniRef30,分别表示聚类的序列之间的相似度至少为100%、90%、50%、30%。UniRef库可以显著减少数据库的大小,提高序列相似性搜索的速度。UniRef库的代表性序列是根据以下优先级选择的:条目的质量、注释分数、物种、序列的长度。UniRef库是蛋白质序列计算领域的重要资源,可以用于蛋白质功能注释、进化分析、同源建模等应用。

UniRef序列中的物种(Species)信息:指每个聚类的代表性序列所属的物种,可以帮助用户了解聚类的生物学背景和多样性。物种信息是根据代表性序列的 UniProtKB 或 UniParc 条目中的物种注释获取的。如果一个聚类包含多个物种,那么物种信息会显示为“多物种聚类”。

其中,AlphaFold2 - Multimer 中,仅提供 uniprot_hits.sto 文件物种信息的提取,本文增加 uniref90_hits.stobfd_uniclust_hits.a3m 物种信息的提取,部分参考 DeepMSA2 的源码。

测试样本:8gmn_A276_B273_C264_D264_1077/msas/A,已搜索的 MSA 文件,如下:

642K		bfd_uniclust_hits.a3m
26M			mgnify_hits.sto
900M		uniprot_hits.sto
701M		uniref90_hits.sto

其中 mgnify_hits.sto 中,不包含物种信息,只有 UniRef 的搜索文件,才包括物种 (Species) 信息。

序列:

>chain_A
SIENGKYTCEEPYYYMEGVNVLGPKCVPVCGVPREPFEEKIIGGSDADIKNFPWQVFFDNPWAGGALINEYWVLTAAHVVEGNREPTMYVGSTSVQTSRLAKMLTPEHVFIHPGWKLLAVPEGRTNFDNDIALVRLKDPVKMGPTVSPICLPGTSSDYNLMDGDLGLISGWGRTEKRDRAVRLKAARLPVAPLRKCKEVKVEAYVFTPNMICAGGEKGMDSCKGDSGGAFAVQDPNDKTKFYAAGLVSWGPQCGTYGLYTRVKNYVDWIMKTMQEN

在 AF2 的默认配置,只使用 uniprot_hits.sto 计算物种信息,进行 MSA Pairing,可以扩展至全部的 MSA 文件:

MSAuniprot_hitsuniref90_hitsbfd_uniclust_hits
格式stostomsa
文件900M701M642K
深度124700979581863
物种数量1714/1247004721/97958164/351
物种示例CHACNScophthalmus maximus (物种ID 52904)HAELO

uniref90 物种信息,即,与 uniprot_hits 类似,物种 CHACN

#=GS UniRef90_A0A6J2VG70/25-267       DE [subseq from] Uncharacterized protein LOC115812568 n=1 Tax=Chanos chanos TaxID=29144 RepID=A0A6J2VG70_CHACN

建议使用 TaxID 的 ID 作为物种的唯一标识。

其中 uniprotuniclust 的物种信息,使用 正则表达式 提取:

# Sequences coming from UniProtKB database come in the
# `db|UniqueIdentifier|EntryName` format, e.g. `tr|A0A146SKV9|A0A146SKV9_FUNHE`
# or `sp|P0C2L1|A3X1_LOXLA` (for TREMBL/Swiss-Prot respectively).
UNIPROT_PATTERN = re.compile(r"""^# UniProtKB/TrEMBL or UniProtKB/Swiss-Prot(?:tr|sp)\|# A primary accession number of the UniProtKB entry.(?P<AccessionIdentifier>[A-Za-z0-9]{6,10})# Occasionally there is a _0 or _1 isoform suffix, which we ignore.(?:_\d)?\|# TREMBL repeats the accession ID here. Swiss-Prot has a mnemonic# protein ID code.(?:[A-Za-z0-9]+)_# A mnemonic species identification code.(?P<SpeciesIdentifier>([A-Za-z0-9]){1,5})# Small BFD uses a final value after an underscore, which we ignore.(?:_\d+)?$""",re.VERBOSE)# Sequences coming from Uniref database come in the
# `RepID=xxx_xxx` format, e.g. `RepID=A0A660T7U3_9SPIR`
# or `RepID=UPI00068451ED`
UNIREF_PATTERN = re.compile(r"""^# Uniref(RepID=)# A primary accession number of the UniProtKB entry.(?P<AccessionIdentifier>[A-Za-z0-9]{6,10})_# A mnemonic species identification code.(?P<SpeciesIdentifier>([A-Za-z0-9]){1,5})$""",re.VERBOSE)

源码:

#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2024/1/15
"""import collections
import os
import re
import sysp = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if p not in sys.path:sys.path.append(p)from root_dir import DATA_DIR# Sequences coming from UniProtKB database come in the
# `db|UniqueIdentifier|EntryName` format, e.g. `tr|A0A146SKV9|A0A146SKV9_FUNHE`
# or `sp|P0C2L1|A3X1_LOXLA` (for TREMBL/Swiss-Prot respectively).
UNIPROT_PATTERN = re.compile(r"""^# UniProtKB/TrEMBL or UniProtKB/Swiss-Prot(?:tr|sp)\|# A primary accession number of the UniProtKB entry.(?P<AccessionIdentifier>[A-Za-z0-9]{6,10})# Occasionally there is a _0 or _1 isoform suffix, which we ignore.(?:_\d)?\|# TREMBL repeats the accession ID here. Swiss-Prot has a mnemonic# protein ID code.(?:[A-Za-z0-9]+)_# A mnemonic species identification code.(?P<SpeciesIdentifier>([A-Za-z0-9]){1,5})# Small BFD uses a final value after an underscore, which we ignore.(?:_\d+)?$""",re.VERBOSE)# Sequences coming from Uniref database come in the
# `RepID=xxx_xxx` format, e.g. `RepID=A0A660T7U3_9SPIR`
# or `RepID=UPI00068451ED`
UNIREF_PATTERN = re.compile(r"""^# Uniref(RepID=)# A primary accession number of the UniProtKB entry.(?P<AccessionIdentifier>[A-Za-z0-9]{6,10})_# A mnemonic species identification code.(?P<SpeciesIdentifier>([A-Za-z0-9]){1,5})$""",re.VERBOSE)class SequencePairingUtils(object):"""MSA Pairing 与 提取蛋白质序列的物种信息"""def __init__(self):pass@staticmethoddef parse_sequence_specie(msa_sequence_identifier: str, dbtype: str = 'uniprot'):"""Gets species from a msa sequence identifier.The sequence identifier has the format specified by_UNIPROT_TREMBL_ENTRY_NAME_PATTERN or _UNIPROT_SWISSPROT_ENTRY_NAME_PATTERN.An example of a sequence identifier: `tr|A0A146SKV9|A0A146SKV9_FUNHE`Args:msa_sequence_identifier: a sequence identifier.dbtype: msa file type.Returns:An `Identifiers` instance with species_id. Thesecan be empty in the case where no identifier was found."""if dbtype == 'uniref':matches = re.search(UNIREF_PATTERN, msa_sequence_identifier.strip())else:matches = re.search(UNIPROT_PATTERN, msa_sequence_identifier.strip())species_id = ''if matches:species_id = matches.group('SpeciesIdentifier')return species_id@staticmethoddef extract_sequence_ids(id: str):"""Extracts sequence identifier from description. Returns None if no match."""split_id = id.split()if split_id:if split_id[0].lower().__contains__('uniref'):# return split_id[0].partition('/')[0],'uniprot'  ### only return uniprot IDreturn split_id[-1].partition('/')[0], 'uniref'else:return split_id[0].partition('/')[0], 'uniprot'else:return None, None@classmethoddef get_species(cls, id: str, def_dbtype: str = ""):"""get species from id"""sequence_identifier, dbtype = cls.extract_sequence_ids(id)# print(sequence_identifier,dbtype)if def_dbtype:dbtype = def_dbtypeif sequence_identifier is None:return ''else:# print(parse_sequence_specie(sequence_identifier,dbtype=dbtype))return cls.parse_sequence_specie(sequence_identifier, dbtype=dbtype)@classmethoddef parse_a3m(cls, a3m_path):"""解析 a3m 文件,同时,生成物种信息,主要针对于 uniref 库的搜索结果:param a3m_path: a3m 文件:return query_seq: 查询序列, str:return seq_names: list, msa seq 的 描述信息:return name_to_species: dict, {描述信息: 物种}:return name_to_sequence: dict, {描述信息: 序列}"""seq_names = []name_to_species = {}  # (seq_id, specie)name_to_sequence = {}  # (seq_id, sequence)in_seqs = []with open(a3m_path, 'r') as f:a3m_txt = f.read()a3m_blocks = ('\n' + a3m_txt).split('\n>')[1:]for a3m_block in a3m_blocks:a3m_strs = a3m_block.split('\n')seq_id = a3m_strs[0]a3m_seq = ''for seq in a3m_strs[1:]:a3m_seq += seqseq_names.append(seq_id)name_to_sequence[seq_id] = a3m_seqin_seqs.append(a3m_seq)specie = cls.get_species(seq_id)if specie != "":  # 确保包含物种信息name_to_species[seq_id] = speciequery_seq = in_seqs[0]# query_seq: 查询序列, str# seq_names: list, msa seq 的 描述信息# name_to_species: dict, {描述信息: 物种}# name_to_sequence: dict, {描述信息: 序列}return query_seq, seq_names, name_to_species, name_to_sequence@classmethoddef parse_sto_for_uniprot(cls, sto_path):"""解析 sto 文件,支持 uniprot,生成物种信息:param sto_path: msa 文件:return query_seq: 查询序列, str:return seq_names: list, msa seq 的 描述信息:return name_to_species: dict, {描述信息: 物种}:return name_to_sequence: dict, {描述信息: 序列}"""with open(sto_path) as f:sto_string = f.read()name_to_sequence = collections.OrderedDict()for line in sto_string.splitlines():line = line.strip()if not line or line.startswith(('#', '//')):continuename, sequence = line.split()if name not in name_to_sequence:name_to_sequence[name] = ''name_to_sequence[name] += sequencemsa = []keep_columns = []query_seq = ""for seq_index, sequence in enumerate(name_to_sequence.values()):if seq_index == 0:# Gather the columns with gaps from the queryquery = sequencekeep_columns = [i for i, res in enumerate(query) if res != '-']query_seq = ''.join([sequence[c] for c in keep_columns])# Remove the columns with gaps in the query from all sequences.aligned_sequence = ''.join([sequence[c] for c in keep_columns])msa.append(aligned_sequence)seq_names = list(name_to_sequence.keys())[1:]msa = msa[1:]name_to_species = dict()for desc in seq_names:s = cls.get_species(desc, def_dbtype="uniprot")if s != "":  # 确保包含物种信息name_to_species[desc] = sname_to_sequence = dict()for desc, seq in zip(seq_names, msa):name_to_sequence[desc] = seq# query_seq: 查询序列, str# seq_names: list, msa seq 的 描述信息# name_to_species: dict, {描述信息: 物种}# name_to_sequence: dict, {描述信息: 序列}return query_seq, seq_names, name_to_species, name_to_sequence@classmethoddef parse_sto_for_uniref(cls, sto_path):"""解析 sto 文件,支持 uniref,生成物种信息:param sto_path: msa 文件:return query_seq: 查询序列, str:return seq_names: list, msa seq 的 描述信息:return name_to_species: dict, {描述信息: 物种}:return name_to_sequence: dict, {描述信息: 序列}"""with open(sto_path) as f:sto_string = f.read()name_to_sequence = collections.OrderedDict()for line in sto_string.splitlines():line = line.strip()if not line or line.startswith(('#', '//')):continuename, sequence = line.split()if name not in name_to_sequence:name_to_sequence[name] = ''name_to_sequence[name] += sequencemsa = []keep_columns = []query_seq = ""for seq_index, sequence in enumerate(name_to_sequence.values()):if seq_index == 0:# Gather the columns with gaps from the queryquery = sequencekeep_columns = [i for i, res in enumerate(query) if res != '-']query_seq = ''.join([sequence[c] for c in keep_columns])# Remove the columns with gaps in the query from all sequences.aligned_sequence = ''.join([sequence[c] for c in keep_columns])msa.append(aligned_sequence)# 解析物种name_to_species = collections.OrderedDict()for line in sto_string.splitlines():line = line.strip()if not line.startswith("#=GS"):  # 只处理序列信息continueitems = line.split()name = items[1]  # i.e. UniRef90_UPI001FA938EF/25-266# TaxID=52904for item in items:if item.startswith("TaxID"):species = item.split("=")[1]name_to_species[name] = speciesbreakseq_names = list(name_to_sequence.keys())[1:]msa = msa[1:]name_to_sequence = dict()for desc, seq in zip(seq_names, msa):name_to_sequence[desc] = seq# query_seq: 查询序列, str# seq_names: list, msa seq 的 描述信息# name_to_species: dict, {描述信息: 物种}# name_to_sequence: dict, {描述信息: 序列}return query_seq, seq_names, name_to_species, name_to_sequencedef process(self, input_path):print(f"[Info] input_path: {input_path}")assert os.path.isfile(input_path)if input_path.endswith("a3m"):query_seq, seq_names, name_to_species, name_to_sequence = self.parse_a3m(input_path)elif input_path.endswith("sto"):file_name = os.path.basename(input_path)if "uniprot" in file_name:query_seq, seq_names, name_to_species, name_to_sequence = self.parse_sto_for_uniprot(input_path)else:query_seq, seq_names, name_to_species, name_to_sequence = self.parse_sto_for_uniref(input_path)else:raise Exception(f"[Error] The input_path must be a3m or sto file ! {input_path}")print(f"[Info] species_set: {len(set(list(name_to_species.values())))}")print(f"[Info] query_seq: {query_seq}")print(f"[Info] seq_names: {len(seq_names)}, {seq_names[1]}")print(f"[Info] name_to_species: {len(name_to_species)}")  # 字典 (key序列):(value物种)print(f"[Info] name_to_sequence: {len(name_to_sequence)}")for k in name_to_species.keys():s = name_to_species[k]if s:print(f"[Info] s: {s}, k: {k}")breakprint(f"[Info] 计算完成!")def main():spu = SequencePairingUtils()# name_to_species: 1863# species_set: 165, s: HAELO, tr|O96088|O96088_HAELO Serin proteinase 1# input_path = os.path.join(DATA_DIR, "templates", "msa", "bfd_uniclust_hits.a3m")# species_set: 1714, name_to_species: 124700# specie_set: 1715, s: CHACN, k: tr|A0A6J2VG70|A0A6J2VG70_CHACN/25-267# input_path = os.path.join(DATA_DIR, "templates", "msa", "uniprot_hits.sto")# species_set: 4721, seq_names: 97958input_path = os.path.join(DATA_DIR, "templates", "msa", "uniref90_hits.sto")spu.process(input_path)if __name__ == '__main__':main()

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/634753.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

[力扣 Hot100]Day7 接雨水

题目描述 给定 n 个非负整数表示每个宽度为 1 的柱子的高度图&#xff0c;计算按此排列的柱子&#xff0c;下雨之后能接多少雨水。 出处 思路 就是寻找“凹”形区间&#xff0c;找使得左右两端点为最大的两个值的最长区间。这里我分了两种情况&#xff0c;右边大于等于左边…

MySQL深度分页优化问题

☆* o(≧▽≦)o *☆嗨~我是小奥&#x1f379; &#x1f4c4;&#x1f4c4;&#x1f4c4;个人博客&#xff1a;小奥的博客 &#x1f4c4;&#x1f4c4;&#x1f4c4;CSDN&#xff1a;个人CSDN &#x1f4d9;&#x1f4d9;&#x1f4d9;Github&#xff1a;传送门 &#x1f4c5;&a…

RenderDoc 增加 DXBC to HLSL 的 shader viewer

目的 便于后续抓帧出来的 DXBC 转为 HLSL&#xff0c;提高可读性 原因 编写的原因&#xff0c;因为按照网上的大佬的BLOG&#xff0c;发现某个 etnlGD/HLSLDecompiler 上的工具使用上是有问题的 &#xff08;有可能是以前的 render doc 版本没有问题&#xff0c;而我现在是在…

KubeSphere平台使用

KubeSphere官网地址&#xff1a;https://kubesphere.io/zh/ KubeKey一键部署K8S集群&#xff1a;https://kubesphere.io/zh/docs/v3.4/installing-on-linux/introduction/multioverview/ 一台master node&#xff08;初始化主节点&#xff09;、两台 work node&#xff08; joi…

SCI好看的配图-汇总

文章目录 图源&#xff1a;Sustainable Cities and Society【期刊】条形图2热力图-地图 图源&#xff1a;Sustainable Cities and Society【期刊】 引自&#xff1a;A machine learning-driven spatio-temporal vulnerability appraisal based on socio-economic data for COV…

如何编写一个好的测试用例?才能防止背黑锅

如何编写一个好的测试用例&#xff1f;才能防止背黑锅 什么是测试用例&#xff1f;一个好的测试用例包含什么&#xff1f;测试用例的编写思路总结 什么是测试用例&#xff1f; 在这之前&#xff0c;思考一个问题&#xff0c;下面这个简单的QQ登录页面&#xff0c;一共又多少条…

关于运维·关于Zabbix监控平台的面试点

目录 引言&#xff1a;明人不说暗话&#xff0c;今天分享几个在面试的时候常被问到关于Zabbix监控平台的面试点 1、zabbix的优点 2、zabbix的缺点 3、zabbix的监控模式 4、zabbix自定义监控怎么做 5、zabbix的自动发现功能 6、zabbix分布式监控有什么特点 引言&#xff1…

专业130+总分380+哈尔滨工程大学810信号与系统考研经验水声电子信息与通信

今年专业课810信号与系统130&#xff0c;总分380顺利考上哈尔滨工程大学&#xff0c;一年的努力终于换来最后的录取&#xff0c;期中复习有得有失&#xff0c;以下总结一下自己的复习经历&#xff0c;希望对大家有帮助&#xff0c;天道酬勤&#xff0c;加油&#xff01;专业课&…

入门设计者不容错过!5款网页原型设计工具推荐!

即时设计 即时设计是一种支持团队合作的原型设计工具&#xff0c;不限于设备和人群的使用&#xff0c;浏览器可以打开和使用。在即时设计中&#xff0c;您可以从0到1创建一个Web页面原型&#xff0c;具有钢笔、矩形、矢量编辑、轮廓、文本、色彩填充等设计功能&#xff0c;足以…

达梦数据库入门语法:从基础到进阶的指南

目录 博客前言&#xff1a; 达梦数据库语法介绍 一.创建表空间 1.图形化创建 2.语法创建 ​编辑​编辑 3.修改表空间参数 图形化修改 ​编辑​编辑 语法修改 4.设置加密算法、密码 二.创建用户 1.图形化 2.sql执行 ​编辑 3.授予权限 授予用户 DBA 权限 授予用户…

三、RHCE--时间服务器

三、RHCE--时间服务器 一、简介二、软件安装三、配置时间服务器客户端四、配置时间服务器服务端五、示例&#xff1a; 一、简介 NTP 是网络时间协议&#xff08;Network Time Protocol&#xff09;的简称&#xff0c;通过 udp 123 端口进行网络时钟同步。 Chrony是一个开源自由…

k8s---ingress对外服务(traefik)

目录 ingress的证书访问 traefik traefik的部署方式&#xff1a; deamonset deployment nginx-ingress与traefix-ingress相比较 nginx-ingress-controller ui访问 deployment部署 ingress的证书访问 ingress实现https代理访问: 需要证书和密钥 创建证书 密钥 secre…

将 SQL Server 2022 数据库备份到 MinIO

Microsoft 在将 S3 连接器和 Polybase 添加到 SQL Server 2022 时取得了重大飞跃。因此&#xff0c;企业可以利用他们保存到对象存储中的大量数据&#xff0c;并使用它来丰富 SQL Server 表。他们还可以利用对象存储来备份 SQL Server&#xff0c;这是开放性和云原生灵活性的又…

UE4 添加按键输入事件 并在蓝图中使用按键输入节点

绑定按键 选择Edit/ProjectSettings/Engine/Input 在bindings中可以选择添加ActionMappings或则AxisMappings ActionMappings:按键事件&#xff0c;有按下和抬起两个事件&#xff0c;需要分别用两个键触发AxisMappings:输入事件&#xff0c;返回值为float&#xff0c;对于键盘…

每日OJ题_算法_滑动窗口⑤_力扣904水果成篮

目录 力扣904. 水果成篮 解析及代码1&#xff08;使用容器&#xff09; 解析及代码2&#xff08;开数组&#xff09; 力扣904. 水果成篮 904. 水果成篮 - 力扣&#xff08;LeetCode&#xff09; 难度 中等 你正在探访一家农场&#xff0c;农场从左到右种植了一排果树。这…

Elastic Stack(1):Elastic Stack简介

1 简介 ELK是一个免费开源的日志分析架构技术栈总称&#xff0c;官网https://www.elastic.co/cn。包含三大基础组件&#xff0c;分别是Elasticsearch、Logstash、Kibana。但实际上ELK不仅仅适用于日志分析&#xff0c;它还可以支持其它任何数据搜索、分析和收集的场景&#xf…

GLM-4多模态重磅更新!摸着OpenAI过河!

智谱CEO张鹏说&#xff1a;OpenAI摸着石头过河&#xff0c;我们摸着OpenAI过河。 摸来摸去摸了一年&#xff0c;以每3-4个月升级一次基座模型的速度&#xff0c;智谱摸着OpenAI过河的最新成绩到底怎么样&#xff1f;真如所说吗&#xff1f; 听到GLM-4发布的当天&#xff0c;我就…

C++深入之虚函数、虚继承与带虚函数的多基派生问题

基础 在讲解带虚函数的多基派生问题时&#xff0c;我们要先弄清楚不带虚函数的多基派生存在什么样的问题&#xff0c;这样才好弄明白带虚函数的多基派生问题。 多基派生的二义性问题 一般来说&#xff0c;在派生类中对基类成员的访问应当具有唯一性&#xff0c;但在多基继承…

Docker(二)安装指南:主要介绍在 Linux 、Windows 10 和 macOS 上的安装

作者主页&#xff1a; 正函数的个人主页 文章收录专栏&#xff1a; Docker 欢迎大家点赞 &#x1f44d; 收藏 ⭐ 加关注哦&#xff01; 安装 Docker Docker 分为 stable test 和 nightly 三个更新频道。 官方网站上有各种环境下的 安装指南&#xff0c;这里主要介绍 Docker 在…

DAZ to maxon 实时面捕52个blendshapes 表情模板基本形中英文对照表

一、转自&#xff1a; DAZ to maxon 实时面捕52个blendshapes 表情模板基本形中英文对照表 - 哔哩哔哩 很多学员反映实时表情怎么就不同步呢&#xff1f;这个问题其实很常见。 第一&#xff1a;表情模板的顺序弄错&#xff0c;导致表情错乱。 第二&#xff1a;表情模板不标准…