python运行hhsearch二进制命令的包装器类

hhsearch 是 HMM-HMM(Hidden Markov Model to Hidden Markov Model)比对方法的一部分,属于 HMMER 软件套件。它用于进行蛋白质序列的高效比对,特别适用于检测远缘同源性。

以下是 hhsearch 的一些主要特点和用途:

  1. HMM-HMM比对: hhsearch 使用隐藏马尔可夫模型(HMM)来表示蛋白质家族的模型。与传统的序列-序列比对方法不同,HMM-HMM比对考虑了氨基酸残基的多序列信息,使得在比对中能够更好地捕捉蛋白质家族的模式和结构。

  2. 检测远缘同源性: hhsearch 的一个主要优势是其能够检测到相对远离的同源关系。它在比对中引入了更多的信息,从而提高了对远缘同源蛋白的发现能力。

  3. 灵敏度和特异性: hhsearch 的设计旨在在维持高灵敏度的同时,减少假阳性的比对。这使得它在寻找结构和功能相似性时更为可靠。

  4. 数据库搜索: 用户可以使用 hhsearch 在大型蛋白质数据库中搜索与给定蛋白质序列相似的蛋白质。

"""Library to run HHsearch from Python."""import glob
import os
import subprocess
from typing import Sequence, Optional, List, Iterable
from absl import logging
import contextlib
import tempfile
import dataclasses
import contextlib
import time
import shutil
import re@contextlib.contextmanager
def timing(msg: str):logging.info('Started %s', msg)tic = time.time()yieldtoc = time.time()logging.info('Finished %s in %.3f seconds', msg, toc - tic)@dataclasses.dataclass(frozen=True)
class TemplateHit:"""Class representing a template hit."""index: intname: straligned_cols: intsum_probs: Optional[float]query: strhit_sequence: strindices_query: List[int]indices_hit: List[int]@contextlib.contextmanager
def tmpdir_manager(base_dir: Optional[str] = None):"""Context manager that deletes a temporary directory on exit."""tmpdir = tempfile.mkdtemp(dir=base_dir)try:yield tmpdirfinally:shutil.rmtree(tmpdir, ignore_errors=True)def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:"""Parses the content of an entire HHR file."""lines = hhr_string.splitlines()# Each .hhr file starts with a results table, then has a sequence of hit# "paragraphs", each paragraph starting with a line 'No <hit number>'. We# iterate through each paragraph to parse each hit.block_starts = [i for i, line in enumerate(lines) if line.startswith('No ')]hits = []if block_starts:block_starts.append(len(lines))  # Add the end of the final block.for i in range(len(block_starts) - 1):hits.append(_parse_hhr_hit(lines[block_starts[i]:block_starts[i + 1]]))return hitsdef _parse_hhr_hit(detailed_lines: Sequence[str]) -> TemplateHit:"""Parses the detailed HMM HMM comparison section for a single Hit.This works on .hhr files generated from both HHBlits and HHSearch.Args:detailed_lines: A list of lines from a single comparison section between 2sequences (which each have their own HMM's)Returns:A dictionary with the information from that detailed comparison sectionRaises:RuntimeError: If a certain line cannot be processed"""# Parse first 2 lines.number_of_hit = int(detailed_lines[0].split()[-1])name_hit = detailed_lines[1][1:]# Parse the summary line.pattern = ('Probab=(.*)[\t ]*E-value=(.*)[\t ]*Score=(.*)[\t ]*Aligned_cols=(.*)[\t'' ]*Identities=(.*)%[\t ]*Similarity=(.*)[\t ]*Sum_probs=(.*)[\t '']*Template_Neff=(.*)')match = re.match(pattern, detailed_lines[2])if match is None:raise RuntimeError('Could not parse section: %s. Expected this: \n%s to contain summary.' %(detailed_lines, detailed_lines[2]))(_, _, _, aligned_cols, _, _, sum_probs, _) = [float(x)for x in match.groups()]# The next section reads the detailed comparisons. These are in a 'human# readable' format which has a fixed length. The strategy employed is to# assume that each block starts with the query sequence line, and to parse# that with a regexp in order to deduce the fixed length used for that block.query = ''hit_sequence = ''indices_query = []indices_hit = []length_block = Nonefor line in detailed_lines[3:]:# Parse the query sequence lineif (line.startswith('Q ') and not line.startswith('Q ss_dssp') andnot line.startswith('Q ss_pred') andnot line.startswith('Q Consensus')):# Thus the first 17 characters must be 'Q <query_name> ', and we can parse# everything after that.#              start    sequence       end       total_sequence_lengthpatt = r'[\t ]*([0-9]*) ([A-Z-]*)[\t ]*([0-9]*) \([0-9]*\)'groups = _get_hhr_line_regex_groups(patt, line[17:])# Get the length of the parsed block using the start and finish indices,# and ensure it is the same as the actual block length.start = int(groups[0]) - 1  # Make index zero based.delta_query = groups[1]end = int(groups[2])num_insertions = len([x for x in delta_query if x == '-'])length_block = end - start + num_insertionsassert length_block == len(delta_query)# Update the query sequence and indices list.query += delta_query_update_hhr_residue_indices_list(delta_query, start, indices_query)elif line.startswith('T '):# Parse the hit sequence.if (not line.startswith('T ss_dssp') andnot line.startswith('T ss_pred') andnot line.startswith('T Consensus')):# Thus the first 17 characters must be 'T <hit_name> ', and we can# parse everything after that.#              start    sequence       end     total_sequence_lengthpatt = r'[\t ]*([0-9]*) ([A-Z-]*)[\t ]*[0-9]* \([0-9]*\)'groups = _get_hhr_line_regex_groups(patt, line[17:])start = int(groups[0]) - 1  # Make index zero based.delta_hit_sequence = groups[1]assert length_block == len(delta_hit_sequence)# Update the hit sequence and indices list.hit_sequence += delta_hit_sequence_update_hhr_residue_indices_list(delta_hit_sequence, start, indices_hit)return TemplateHit(index=number_of_hit,name=name_hit,aligned_cols=int(aligned_cols),sum_probs=sum_probs,query=query,hit_sequence=hit_sequence,indices_query=indices_query,indices_hit=indices_hit,)def _get_hhr_line_regex_groups(regex_pattern: str, line: str) -> Sequence[Optional[str]]:match = re.match(regex_pattern, line)if match is None:raise RuntimeError(f'Could not parse query line {line}')return match.groups()def _update_hhr_residue_indices_list(sequence: str, start_index: int, indices_list: List[int]):"""Computes the relative indices for each residue with respect to the original sequence."""counter = start_indexfor symbol in sequence:if symbol == '-':indices_list.append(-1)else:indices_list.append(counter)counter += 1class HHSearch:"""Python wrapper of the HHsearch binary."""def __init__(self,*,binary_path: str,databases: Sequence[str],maxseq: int = 1_000_000):"""Initializes the Python HHsearch wrapper.Args:binary_path: The path to the HHsearch executable.databases: A sequence of HHsearch database paths. This should be thecommon prefix for the database files (i.e. up to but not including_hhm.ffindex etc.)maxseq: The maximum number of rows in an input alignment. Note that thisparameter is only supported in HHBlits version 3.1 and higher.Raises:RuntimeError: If HHsearch binary not found within the path."""self.binary_path = binary_pathself.databases = databasesself.maxseq = maxseq#for database_path in self.databases:#  if not glob.glob(database_path + '_*'):#    logging.error('Could not find HHsearch database %s', database_path)#    raise ValueError(f'Could not find HHsearch database {database_path}')@propertydef output_format(self) -> str:return 'hhr'@propertydef input_format(self) -> str:return 'a3m'def query(self, a3m: str) -> str:"""Queries the database using HHsearch using a given a3m."""with tmpdir_manager() as query_tmp_dir:input_path = os.path.join(query_tmp_dir, 'query.a3m')hhr_path = os.path.join(query_tmp_dir, 'output.hhr')with open(input_path, 'w') as f:f.write(a3m)db_cmd = []for db_path in self.databases:db_cmd.append('-d')db_cmd.append(db_path)cmd = [self.binary_path,'-i', input_path,'-o', hhr_path,'-maxseq', str(self.maxseq)] + db_cmdprint("cmd:",cmd)logging.info('Launching subprocess "%s"', ' '.join(cmd))process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)with timing('HHsearch query'):stdout, stderr = process.communicate()retcode = process.wait()if retcode:# Stderr is truncated to prevent proto size errors in Beam.raise RuntimeError('HHSearch failed:\nstdout:\n%s\n\nstderr:\n%s\n' % (stdout.decode('utf-8'), stderr[:100_000].decode('utf-8')))with open(hhr_path) as f:hhr = f.read()return hhrdef get_template_hits(self,output_string: str,input_sequence: str) -> Sequence[TemplateHit]:"""Gets parsed template hits from the raw string output by the tool."""del input_sequence  # Used by hmmseach but not needed for hhsearch.return parse_hhr(output_string)def convert_stockholm_to_a3m (stockholm_format: str,max_sequences: Optional[int] = None,remove_first_row_gaps: bool = True) -> str:"""Converts MSA in Stockholm format to the A3M format."""descriptions = {}sequences = {}reached_max_sequences = Falsefor line in stockholm_format.splitlines():reached_max_sequences = max_sequences and len(sequences) >= max_sequencesif line.strip() and not line.startswith(('#', '//')):# Ignore blank lines, markup and end symbols - remainder are alignment# sequence parts.seqname, aligned_seq = line.split(maxsplit=1)if seqname not in sequences:if reached_max_sequences:continuesequences[seqname] = ''sequences[seqname] += aligned_seqfor line in stockholm_format.splitlines():if line[:4] == '#=GS':# Description row - example format is:# #=GS UniRef90_Q9H5Z4/4-78            DE [subseq from] cDNA: FLJ22755 ...columns = line.split(maxsplit=3)seqname, feature = columns[1:3]value = columns[3] if len(columns) == 4 else ''if feature != 'DE':continueif reached_max_sequences and seqname not in sequences:continuedescriptions[seqname] = valueif len(descriptions) == len(sequences):break# Convert sto format to a3m line by linea3m_sequences = {}if remove_first_row_gaps:# query_sequence is assumed to be the first sequencequery_sequence = next(iter(sequences.values()))query_non_gaps = [res != '-' for res in query_sequence]for seqname, sto_sequence in sequences.items():# Dots are optional in a3m format and are commonly removed.out_sequence = sto_sequence.replace('.', '')if remove_first_row_gaps:out_sequence = ''.join(_convert_sto_seq_to_a3m(query_non_gaps, out_sequence))a3m_sequences[seqname] = out_sequencefasta_chunks = (f">{k} {descriptions.get(k, '')}\n{a3m_sequences[k]}"for k in a3m_sequences)return '\n'.join(fasta_chunks) + '\n'  # Include terminating newlinedef _convert_sto_seq_to_a3m(query_non_gaps: Sequence[bool], sto_seq: str) -> Iterable[str]:for is_query_res_non_gap, sequence_res in zip(query_non_gaps, sto_seq):if is_query_res_non_gap:yield sequence_reselif sequence_res != '-':yield sequence_res.lower()if __name__ == "__main__":### 1. 准备输入数据## 输入序列先通过Jackhmmer多次迭代从uniref90,MGnify数据库搜索同源序列,输出的多序列比对文件(如globins4.sto),转化为a3m格式后,再通过hhsearch从pdb数据库中找到同源序列input_fasta_file = '/home/zheng/test/Q94K49.fasta'## input_sequencewith open(input_fasta_file) as f:input_sequence = f.read()test_templates_sto_file = "/home/zheng/test/Q94K49_aln.sto"with open(test_templates_sto_file) as f:test_templates_sto = f.read()## sto格式转a3m格式()test_templates_a3m = convert_stockholm_to_a3m(test_templates_sto)hhsearch_binary_path = "/home/zheng/software/hhsuite-3.3.0-SSE2-Linux/bin/hhsearch"### 2.类实例化# scop70_1.75文件名前缀scop70_database_path = "/home/zheng/database/scop70_1.75_hhsuite3/scop70_1.75"pdb70_database_path = "/home/zheng/database/pdb70_from_mmcif_latest/pdb70"#hhsuite数据库下载地址:https://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/  ## 单一数据库#template_searcher = HHSearch(binary_path = hhsearch_binary_path,#                             databases = [scop70_database_path])## 多个数据库database_lst = [scop70_database_path, pdb70_database_path]template_searcher = HHSearch(binary_path = hhsearch_binary_path,databases = database_lst) ### 3. 同源序列搜索## 搜索结果返回.hhr文件字符串templates_result = template_searcher.query(test_templates_a3m)print(templates_result)## pdb序列信息列表template_hits = template_searcher.get_template_hits(output_string=templates_result, input_sequence=input_sequence)print(template_hits)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/157654.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

vue2.0中使用v-if/v-show切换后echarts不显示和宽高问题

vue2.0中使用v-if/v-show切换后echarts不显示和宽高问题 需求描述问题描述问题解析 解决方案使用v-show替代&#xff08;不推荐&#xff09;v-if使用$nextTick&#xff08;推荐&#xff09; 需求描述 使用ehcarts时&#xff0c;请求数据时加loading,请求结束后取消loading并显示…

redis之高可用

&#xff08;一&#xff09;redis之高可用 1、在集群当中有一个非常重要的指标&#xff0c;提供正常服务的时间的百分比&#xff08;365天&#xff09;99.9% 2、redis的高可用的含义更加广泛&#xff0c;正常服务是指标之一&#xff0c;数据容量的扩展、数据的安全性 3、在r…

项目架构的发展

项目架构的发展 1. 单体架构 单体架构指的是将整个应用程序构建为单一的、独立的单元。在软件开发中&#xff0c;单体架构通常指的是将一个应用程序作为一个整体来开发、部署和管理&#xff0c;所有的功能模块都打包在一起&#xff0c;共享同一个数据库和代码库。 在单体架构…

存储日志数据并满足安全要求

日志数据是包含有关网络中发生的事件的记录的重要信息&#xff0c;日志数据对于监控网络和了解网络活动、用户操作及其动机至关重要。 由于网络中的每个设备都会生成日志&#xff0c;因此收集的数据量巨大&#xff0c;管理和存储所有这些数据成为一项挑战&#xff0c;日志归档…

python基本语法

基本的 Python 语法&#xff1a; 变量和数据类型&#xff1a; # 定义变量 x 5# 不需要显式声明数据类型&#xff0c;Python 会自动推断 name "John"# 常见的数据类型包括整数、浮点数、字符串、列表、字典等 my_list [1, 2, 3] my_dict {key: value}条件语句&…

【C语言】数据结构——栈和队列实例探究

&#x1f497;个人主页&#x1f497; ⭐个人专栏——数据结构学习⭐ &#x1f4ab;点击关注&#x1f929;一起学习C语言&#x1f4af;&#x1f4ab; 目录 导读&#xff1a;一、 栈1. 栈的概念及结构2. 栈的实现3. 实现代码3.1 定义结构体3.2 初始化栈3.3 销毁栈3.4 入栈3.5 出栈…

【Java 进阶篇】深入理解 Jackson:Java 对象转 JSON 的艺术

嗨&#xff0c;亲爱的小白们&#xff01;欢迎来到这篇关于 Jackson JSON 解析器中 Java 对象转 JSON 的详细解析指南。JSON&#xff08;JavaScript Object Notation&#xff09;是一种轻量级的数据交换格式&#xff0c;而 Jackson 作为一个强大的 JSON 解析库&#xff0c;能够帮…

贪心:leetcode2216 美化数组的最少删除数

2216. 美化数组的最少删除数 给你一个下标从 0 开始的整数数组 nums &#xff0c;如果满足下述条件&#xff0c;则认为数组 nums 是一个 美丽数组 &#xff1a; nums.length 为偶数对所有满足 i % 2 0 的下标 i &#xff0c;nums[i] ! nums[i 1] 均成立 注意&#xff0c;空…

Unity PlayerPrefs相关应用

PlayerPrefs是Unity游戏引擎中的一个类&#xff0c;用于在游戏中存储和访问玩家的偏好设置和数据。它可以用来保存玩家的游戏进度、设置选项、最高分数等信息。PlayerPrefs将数据存储在本地文件中&#xff0c;因此可以在游戏重新启动时保持数据的持久性。 //PlayerPrefs的数据…

基于SVM的车牌识别算法

基于SVM的车牌识别系统&#xff08;Python代码实现&#xff09; 车牌识别系统是智能交通系统的重要组成部分&#xff0c;有着广泛的应用。车牌识别系统主要有车牌定位、字符分割和字符识别三部分组成&#xff0c;本文的研究重点是车牌字符识别这部分&#xff0c;本文提出了一种…

RT-Thread Hoist_Motor PID

本节介绍的是一个举升电机&#xff0c;顾名思义&#xff0c;通过转轴控制物体升降&#xff0c;为双通道磁性译码器&#xff0c;利用电调进行操控&#xff0c;具体驱动类似于大学期间最大众的SG180舵机&#xff0c;在一定的频率下&#xff0c;通过调制脉宽进行控制。 设备介绍…

数据结构 图

树是无环连通图&#xff0c;是一种特殊的图。 分类 图分为有向图[边是有方向的]和无向图[边是无方向的]。 无向图(a—b)&#xff0c;建立两条有向图(a—>b&#xff0c;b—>a)&#xff0c;无向图是一种特殊的有向图。 存储有向图 邻接矩阵 ——用于存储比较稠密的图【…

MyBatis的xml实现

1.下载插件MyBatisX 2.添加依赖 <!--Mybatis 依赖包--><dependency><groupId>org.mybatis.spring.boot</groupId><artifactId>mybatis-spring-boot-starter</artifactId><version>2.3.1</version></dependency><!--…

Rust错误处理机制:优雅地管理错误

大家好&#xff01;我是lincyang。 今天&#xff0c;我们要探讨的是Rust语言中的错误处理机制。 Rust作为一种系统编程语言&#xff0c;对错误处理的重视程度是非常高的。它提供了一套既安全又灵活的机制来处理可能出现的错误。 Rust错误处理的两大类别 在Rust中&#xff0…

Pandas-pd.to_numeric函数知识点总结

前言 本文是该专栏的第38篇,后面会持续分享python数据分析的干货知识,记得关注。 我们在处理数据分析项目的时候,通常会需要处理各种类型的数据,比如说“时间日期,字符串,布尔值”等等类型。有的时候,恰巧需要用Pandas将这些数据转换为数值类型,以便于后期进行统计或计…

JavaScript中的假值对象是什么?

JavaScript是一种非常灵活且强大的编程语言&#xff0c;但有时候它的一些特性可能会让人感到困惑。其中一个常见的问题就是假值对象。在本文中&#xff0c;我们将探讨什么是假值对象&#xff0c;并通过代码示例来解释这个概念。 什么是假值对象&#xff1f; 在JavaScript中&am…

Java 类之 java.lang.reflect.Field

Java 类之 java.lang.reflect.Field 文章目录 Java 类之 java.lang.reflect.Field一、概述1、java.lang.Class 类获取字段的方法获取全部公有字段&#xff08;含继承的&#xff0c;不含私有的&#xff09;获取本类的所有字段&#xff08;不含继承的&#xff0c;含私有的&#x…

去掉表格里某一列单元格的所有后缀

根据代码 import pandas as pd# 加载数据 file_path your_data_file.csv # 替换为您的文件路径 data pd.read_csv(file_path)# 去掉 name 列中所有单元格的 .jpg 后缀 data[name] data[name].str.replace(.jpg, , regexFalse)# 显示修改后的前几行数据 print(data.head())#…

vue下载xlsx表格

vue下载xlsx表格 // 导入依赖库 import XLSX from xlsx; import FileSaver from file-saver; methods:{btn(){let date new Date()let Y date.getFullYear() -let M (date.getMonth() 1 < 10 ? 0 (date.getMonth() 1) : date.getMonth() 1) -let D (date.getDat…

【设备树添加节点】

节点结束位置都需要加分号 of_iomap 完成映射 of_property_read_u32_array of_property_read_string of_fine_node_by_path