生信序列基本操作算法
建议在Jupyter实践,python版本3.9
1. 获取overlap序列索引和序列的算法实现
# min_length 最小overlap碱基数量3个
def getOverlapIndexAndSequence(a, b, min_length=3):""" Return length of longest suffix of 'a' matchinga prefix of 'b' that is at least 'min_length'characters long. If no such overlap exists,return 0. """# 开始位置start = 0 while True:# 在序列a中查找b的最小长度后缀start = a.find(b[:min_length], start) # 如果没有匹配到则返回0if start == -1: return 0# 如果存在overlap序列,则输出a序列开始索引以及overlap序列# 即序列b的开始 min_length 个碱基与a序列的 min_length 个碱基的后缀序列相同if b.startswith(a[start:]):return len(a)-start, a[start:]# 右移1个碱基start += 1
2. 算法测试
getOverlapIndexAndSequence('TTACGT', 'CGTGTGC')
# (3, 'CGT') overlap序列开始索引和对应序列碱基getOverlapIndexAndSequence('TTACGT', 'GTGTGC')
# 0