模拟一个简单计算器_阅读模拟器的简单介绍

模拟一个简单计算器

Read simulators are widely being used within the research community to create synthetic and mock datasets for analysis. In this article, I will introduce some recently proposed, commonly used read simulators.

阅读模拟器在研究社区中被广泛使用,以创建用于分析的综合和模拟数据集。 在本文中,我将介绍一些最近提出的,常用的读取模拟器。

DNA测序和读取 (DNA Sequencing and Reads)

If you have come across my previous article on DNA Sequence Data Analysis, you may have read about DNA sequencing. Sequencing is the process that determines the precise order of nucleotides of a given DNA molecule. We can determine the order of the four bases adenine, guanine, cytosine and thymine, in a strand of DNA. DNA sequencing is used to determine the sequence of individual genes, full chromosomes or entire genomes of an organism.

如果您看过我以前有关DNA序列数据分析的文章,那么您可能已经阅读了有关DNA测序的信息。 测序是确定给定DNA分子核苷酸精确顺序的过程。 我们可以确定四个碱基的腺嘌呤鸟嘌呤胞嘧啶胸腺嘧啶的顺序, 在DNA链中。 DNA测序用于确定生物的单个基因,完整染色体或完整基因组的序列。

Special machines known as sequencing machines are used to extract short random DNA sequences from a particular genome we wish to determine (target genome). Current DNA sequencing technologies cannot read one whole genome at once. It reads small pieces of between 100 and 30,000 bases, depending on the technology used. These short pieces are called reads.

使用称为测序机的特殊机器从我们希望确定的特定基因组( 目标基因组 )中提取随机的短DNA序列。 当前的DNA测序技术无法一次读取一个完整的基因组。 根据所使用的技术,它可以读取100到30,000个碱基之间的小片段。 这些短片段称为读取

读模拟器 (Read Simulators)

Sequencing machines may not be available as we wish and we may not be able to get hold of real-world samples to sequence. This is where read simulators come in handy for research purposes. Read simulators can mimic sequencing machines to simulate reads. They have pre-defined statistical models to mimic the error rates relevant to the particular sequencing machines. Furthermore, we can provide our own error models as well (different rates of insertions, deletions and substitutions).

测序机器可能无法如我们所愿,并且我们可能无法掌握现实世界中的样品进行测序。 在这里,阅读模拟器可用于研究目的。 阅读模拟器可以模仿测序仪来模拟阅读。 他们具有预定义的统计模型,可以模拟与特定测序仪相关的错误率。 此外,我们还可以提供自己的错误模型(插入,删除和替换的比率不同)。

估计测序覆盖率 (Estimating sequencing coverage)

Sequencing coverage is defined as the average number of reads that covers each base of the reference genome. Estimating the sequencing coverage is very important when you are simulating datasets. The coverage equation is defined as follows.

测序覆盖率定义为覆盖参考基因组每个碱基的平均读取数。 模拟数据集时, 估计测序覆盖率非常重要。 覆盖方程定义如下。

C = LN / G

C = LN / G

  • C is the sequencing coverage

    C是测序覆盖率
  • G is the length of the genome

    G是基因组的长度
  • L is the read length

    L是读取长度
  • N is the number of reads

    N是读取次数

For example, if you have a genome of length 5Mbp and you simulate 1,000,000 HiSeq 2000 reads (read length is 100bp), then we will get a sequencing coverage of 20x as follows.

例如,如果您的基因组长度为5Mbp,并且模拟了1,000,000个HiSeq 2000读取(读取长度为100bp),那么我们将获得如下20x的测序覆盖率。

C = LN / G = 100 * 1,000,000 / 5,000,000 = 20x

Here, at least each position of the reference genome is covered by 20 reads.

在这里,参考基因组的至少每个位置被20个读数覆盖。

估计丰度 (Estimating Abundance)

The abundance of a species in a dataset is considered as the fraction of reads that belong to that species. For example, if there is a dataset with 10,000,000 reads and 1,000,000 of them belong to E. coli, then the abundance of E. coli will be 0.1.

数据集中物种的丰富度被视为属于该物种的读段的分数。 例如,如果存在具有10,000,000的数据集的读取和它们的百万属于大肠杆菌 ,然后大肠杆菌的丰度为0.1。

Note that coverage and abundance are not the same.

请注意,覆盖范围和丰度不同。

短读模拟器 (Short Read Simulators)

With the popularity of next-generation sequencing (NGS) technologies, many NGS read simulators have been developed. Currently, many of the popular short read simulators are designed to simulate reads mimicking many Illumina, 454 and SOLiD platforms. Listed below are some popular short read simulators. Links to their publications are provided as well.

随着下一代测序(NGS)技术的普及,已经开发了许多NGS读取模拟器。 当前,许多流行的短读模拟器被设计为模拟模仿许多Illumina,454和SOLiD平台的读。 下面列出了一些流行的简短阅读模拟器。 还提供了指向其出版物的链接。

  1. MetaSim

    MetaSim

  2. wgsim

    wgs​​im

  3. SimNGS

    SimNGS

  4. ArtificialFastqGenerator

    人工快速生成器

  5. InSilicoSeq

    InSilicoSeq

长读模拟器 (Long Read Simulators)

With the advancements in sequencing technologies, scientists have shown an increasing interest in using third-generation sequencing (TGS) technologies. Currently, many of the popular long read simulators are designed to simulate reads mimicking the two main TGS technologies; (1) Pacific Biosciences (PacBio) and (2) Oxford Nanopore (ONT). Listed below are some of the popular and recently introduced PacBio and ONT simulators. Links to their publications are provided as well.

随着测序技术的进步,科学家对使用第三代测序(TGS)技术的兴趣日益浓厚。 当前,许多流行的长读模拟器被设计为模拟模仿两种主要TGS技术的读操作。 (1) 太平洋生物科学(PacBio)和(2) 牛津纳米Kong(ONT) 。 下面列出的是一些最近流行的PacBio和ONT模拟器。 还提供了指向其出版物的链接。

PacBio模拟器 (PacBio Simulators)

  1. PBSIM

    PBSIM

  2. LongISLND

    长ISLND

  3. SimLoRD

    模拟神

  4. NPBSS

    全国公共广播电台

  5. PaSS

    通过

ONT模拟器 (ONT Simulators)

  1. NanoSim

    纳米模拟

  2. Nanopore SimulatION

    纳米Kong模拟

  3. DeepSimulator

    深度模拟器

  4. DeepSimulator1.5

    DeepSimulator1.5

InSilicoSeq (InSilicoSeq)

I have been using InSilicoSeq in my work a lot and I find it very intuitive and easy to use. I will walk you through some sample commands to simulate reads. You can easily install InSilicoSeq using conda or pip.

我在工作中经常使用InSilicoSeq ,发现它非常直观且易于使用。 我将引导您完成一些示例命令以模拟读取。 您可以使用轻松安装InSilicoSeq condapip

conda install -c bioconda insilicoseq
OR
pip install InSilicoSeq

Simulate reads by providing the number of reads

通过提供读取次数来模拟读取

Assume that you have a single reference genome and you want to simulate 1 million Illumina MiSeq reads. Given below is a sample command you can run using InSilicoSeq.

假设您有一个参考基因组,并且您想模拟一百万个Illumina MiSeq读数。 下面给出的是可以使用InSilicoSeq运行的示例命令。

iss generate --model miseq --genomes ref.fasta --n_reads 1M --cpus 8 --output reads

Simulate reads by providing the coverage

通过提供覆盖范围来模拟阅读

Assume that you have two reference genome files ref1.fasta and ref2.fasta. You want to simulate 30x coverage from ref1 and 10x coverage from ref2. You will need to create a tab-separated file named coverages.tsv and add the coverage details as follows.

假设您有两个参考基因组文件ref1.fastaref2.fasta 。 您要模拟ref1 30x覆盖率和ref2 10x覆盖率。 您将需要创建一个以制表符分隔的文件,名为coverages.tsv ,并按如下所示添加coverage的详细信息。

red1_id     30
ref2_id 10

ref1_id and ref2_id refer to the identifiers of the filesref1.fasta and ref2.fasta. If you download the reference genomes from NCBI, the identifies will consist of letters and numbers and for example, may look something like thisNC_007712.1 or CP001844.2. These identifiers are NCBI accession numbers provided for each reference genome.

ref1_idref2_id引用文件ref1.fastaref2.fasta 。 如果从NCBI下载参考基因组,则标识将由字母和数字组成,例如,看起来可能类似于NC_007712.1CP001844.2 。 这些标识符是为每个参考基因组提供的NCBI登录号。

Now you can simulate the reads using the following command.

现在,您可以使用以下命令模拟读取。

iss generate --model miseq --genomes ref1.fasta ref2.fasta --coverage coverages.tsv --cpus 8 --output reads

Simulate reads by providing the abundance

通过提供丰富的内容来模拟阅读

Assume that you have two reference genome files ref1.fasta and ref2.fasta. You want to simulate 0.4 abundance from ref1 and 0.6 abundance from ref2. Note that the sum of all the abundance values should be 1.0. Similar to coverage, you will need to create a tab-separated file named abundance.tsv and add the abundance details as follows.

假设您有两个参考基因组文件ref1.fastaref2.fasta 。 您要模拟ref1 0.4丰度和ref2 0.6丰度。 注意所有丰度值的总和应为1.0 。 与覆盖范围类似,您将需要创建一个制表符分隔的文件abundance.tsv ,并按如下所示添加丰度详细信息。

red1_id     0.4
ref2_id 0.6

Now you can simulate the reads using the following command.

现在,您可以使用以下命令模拟读取。

iss generate --model miseq --genomes ref1.fasta ref2.fasta --abundance abundance.txt --cpus 8 --output reads

You can read more details from the InSilicoSeq documentation.

您可以从InSilicoSeq文档中详细信息。

PBSIM (PBSIM)

PBSIM is a PacBio reads simulator which provides both sampling-based and model-based simulations. I will walk you through some sample commands to simulate reads using PBSIM.

PBSIM是PacBio读取模拟器,它提供基于采样和基于模型的模拟。 我将引导您完成一些示例命令,以使用PBSIM模拟读取。

基于模型的仿真 (Model-based simulation)

For model-based simulation, you can run the following command.

对于基于模型的仿真,可以运行以下命令。

pbsim --data-type CLR --depth 100 --length-min 10000 --length-max 20000 --prefix test --model_qc data/model_qc_clr ref.fasta

The model can be found in the PBSIM folder PBSIM-PacBio-Simulator/data/model_qc_clr. The data type CLR refers to Continuous Long Read which simulates long and high error rates. The other data type CCS refers to Circular consensus Read which simulates short and low error rates.

该模型可以在PBSIM文件夹PBSIM-PacBio-Simulator/data/model_qc_clr 。 数据类型CLR是指连续长读取 ,它模拟长错误率和高错误率。 另一种数据类型CCS指的是“ 循环共识读取” ,它可以模拟短错误率和低错误率。

基于采样的模拟 (Sampling-based simulation)

For sampling-based simulation, you can run the following command.

对于基于采样的模拟,可以运行以下命令。

pbsim --data-type CLR --depth 100 --sample-fastq sample/sample.fastq sample/sample.fasta

The sample FASTQ file can be found in the PBSIM folder PBSIM-PacBio-Simulator/sample/sample.fastq. You can use your own FASTQ file as well.

样本FASTQ文件可在PBSIM文件夹PBSIM-PacBio-Simulator/sample/sample.fastq 。 您也可以使用自己的FASTQ文件。

You can read more details from the PBSIM documentation.

您可以从PBSIM文档中详细信息。

模拟神 (SimLoRD)

SimLoRD is a TGS read simulator based on the Pacific Biosciences SMRT error model. I have frequently used SimLoRD to simulate PacBio datasets for my work. I will walk you through some sample commands to simulate reads using SimLoRD.

SimLoRD是基于Pacific Biosciences SMRT错误模型的TGS读取模拟器。 我经常使用SimLoRD为我的工作模拟PacBio数据集。 我将引导您完成一些示例命令,以使用SimLoRD模拟读取。

通过提供读取次数来模拟定长读取 (Simulate fixed-length reads by providing the number of reads)

Assume that you have a reference genome and you want to simulate fixed-length reads with 60x coverage. Given below is a sample command you can run using SimLoRD.

假设您有一个参考基因组,并且想要模拟覆盖率是60x固定长度读取。 下面给出的是可以使用SimLoRD运行的示例命令。

simlord --read-reference ref.fasta --coverage 60 --fixed-readlength 5000 output_prefix

通过提供覆盖范围来模拟定长读取 (Simulate fixed-length reads by providing the coverage)

Assume that you have a reference genome and you want to simulate 2000 fixed-length reads. Given below is a sample command you can run using SimLoRD.

假设您有一个参考基因组,并且想要模拟2000个固定长度的读取。 下面给出的是可以使用SimLoRD运行的示例命令。

simlord --read-reference ref.fasta --num-reads 2000 --fixed-readlength 5000 output_prefix

You can also set a minimum length for the reads using the --min-readlength parameter during the simulation. You can read more from the SimLoRD documentation.

您还可以在仿真过程中使用--min-readlength参数设置读取的最小长度。 您可以从SimLoRD文档中了解更多信息。

最后的想法 (Final Thoughts)

Read simulators have given us the opportunity to simulate reads ranging from zero errors to very high error rates. Also, they have allowed us to create synthetic and mock datasets mimicking different sequencing machines and different species compositions.

读取模拟器使我们有机会模拟从零错误到很高错误率的读取。 此外,它们还使我们能够创建模仿不同测序仪和不同物种组成的合成和模拟数据集。

Hope you found this article useful and informative as a starting point towards using read simulators. Feel free to use these tools for your projects and research work as they are freely available.

希望您发现本文对使用阅读模拟器有帮助,并为您提供了有益的信息。 您可以免费使用这些工具来进行项目和研究工作。

Cheers, and stay safe!

干杯,保持安全!

You can read my previous articles related to bioinformatics and DNA analysis.

您可以阅读我以前有关生物信息学和DNA分析的文章。

翻译自: https://medium.com/computational-biology/a-simple-introduction-to-read-simulators-bbeff4f0c0c6

模拟一个简单计算器

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388456.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

计算机部分应用显示模糊,win10系统打开部分软件字体总显示模糊的解决方法-电脑自学网...

win10系统打开部分软件字体总显示模糊的解决方法。方法一:win10软件字体模糊1、首先,在Win10的桌面点击鼠标右键,选择“显示设置”。2、在“显示设置”的界面下方,点击“高级显示设置”。3、在“高级显示设置”的界面中&#xff0…

Tomcat调节

Tomcat默认可以使用的内存为128MB,在较大型的应用项目中,这点内存是不够的,需要调大,并且Tomcat本身不能直接在计算机上运行,需要依赖于硬件基础之上的操作系统和一个java虚拟机。 AD: 这里向大家描述一下如何使用Tom…

假如不工作了,你还有源源不断的收入吗?

拥有金山跟银矿,其实不值得羡慕。俗话说:授人以鱼不如授人以渔。与其选择万贯家财,倒不如选择一个会持续冒出钱的杯子。很多人害怕上班的收入不确定,上班族急于寻找双薪,下班之后还要辛勤工作,以为这样就可…

turtle 20秒画完小猪佩奇“社会人”

转载:https://blog.csdn.net/csdnsevenn/article/details/80650456 图片源自网络 作者 丁彦军 如需转载,请联系原作者授权。 今年社交平台上最火的带货女王是谁?范冰冰?杨幂?Angelababy?不,是猪…

最佳子集aic选择_AutoML的起源:最佳子集选择

最佳子集aic选择As there is a lot of buzz about AutoML, I decided to write about the original AutoML; step-wise regression and best subset selection. Then I decided to ignore step-wise regression because it is bad and should probably stop being taught. That…

Java虚拟机内存溢出

最近在看周志明的《深入理解Java虚拟机》,虽然刚刚开始看,但是觉得还是一本不错的书。对于和我一样对于JVM了解不深,有志进一步了解的人算是一本不错的书。注明:不是书托,同样是华章出的书,质量要比《深入剖…

spring boot构建

1.新建Maven工程 1.File-->new-->project-->maven project 2.webapp 3.工程名称 k3 2.Maven 三个常用命令 选 项目右击- >run-> Maven clean,一般新工程,新导入工程用这个命令清理clean Mvaen install, Maven test&#xff0c…

用户输入汉字时计算机首先将,用户输入汉字时,计算机首先将汉字的输入码转换为__________。...

用户的蓄的形能器常见式有。输入时计算机首先输入包括药物具有基的酚羟。汉字换物包腺皮括质激肾上素药。对既荷又有线有相间负负荷时,将汉倍作为等选取相负效三相负荷乘荷最大,将汉相负荷换荷应先将线间负算为,效三相负荷时在计算等&#xf…

从最终用户角度来看外部结构_从不同角度来看您最喜欢的游戏

从最终用户角度来看外部结构The complete python code and Exploratory Data Analysis Notebook are available at my github profile;完整的python代码和Exploratory Data Analysis Notebook可在我的github个人资料中找到 ; Pokmon is a Japanese media franchise,…

apache+tomcat配置

无意间看到tomcat 6集群的内容,就尝试配置了一下,还是遇到很多问题,特此记录。apache服务器和tomcat的连接方法其实有三种:JK、http_proxy和ajp_proxy。本文主要介绍最为常见的JK。 环境:PC2台:pc1(IP 192.168.88.118…

记自己在spring中使用redis遇到的两个坑

本人在spring中使用redis作为缓存时&#xff0c;遇到两个坑&#xff0c;现在记录如下&#xff0c;算是作为自己的备忘吧&#xff0c;文笔不好&#xff0c;望大家见谅&#xff1b; 一、配置文件 1 <!-- 加载Properties文件 -->2 <bean id"configurer" cl…

Azure实践之如何批量为资源组虚拟机创建alert

通过上一篇的简介&#xff0c;相信各位对于简单的创建alert&#xff0c;以及Azure monitor使用以及大概有个印象了。基础的使用总是非常简单的&#xff0c;这里再分享一个常用的alert使用方法实际工作中&#xff0c;不管是日常运维还是做项目&#xff0c;我们都需要知道VM的实际…

南信大滨江学院计算机基础,南京信息工程大学滨江学院计算机基础期末复习知识点...

《计算机基础》期末考试复习知识点第一章计算机基础知识1.第一台电子计算机的名称、诞生时间及运算性能&#xff1b;名称&#xff1a;电子数字积分计算机ENIAC(埃尼阿克)。诞生时间&#xff1a;1946年2月14日。运算性能&#xff1a;运算速度为每秒5000次加法。2.计算机发展四个…

nginx集群

今天看到"基于apache的tomcat负载均衡和集群配置 "这篇文章成为javaEye热点。 略看了一下&#xff0c;感觉太复杂&#xff0c;要配置的东西太多&#xff0c;因此在这里写出一种更简洁的方法。 要集群tomcat主要是解决SESSION共享的问题&#xff0c;因此我利用memc…

管道过滤模式 大数据_大数据管道配方

管道过滤模式 大数据介绍 (Introduction) If you are starting with Big Data it is common to feel overwhelmed by the large number of tools, frameworks and options to choose from. In this article, I will try to summarize the ingredients and the basic recipe to …

DevOps时代,企业数字化转型需要强大的工具链

伴随时代的飞速进步&#xff0c;中国的人口红利带来了互联网业务的快速发展&#xff0c;巨大的流量也带动了技术的不断革新&#xff0c;研发的模式也在不断变化。传统企业纷纷效仿互联网的做法&#xff0c;结合DevOps进行数字化的转型。通常提到DevOps&#xff0c;大家浮现在脑…

2018.09.21 atcoder An Invisible Hand(贪心)

传送门 简单贪心啊。 这题显然跟t并没有关系&#xff0c;取差量最大的几组买入卖出就行了。 于是我们统计一下有几组差量是最大的就行了。 代码&#xff1a; #include<bits/stdc.h> #define N 100005 using namespace std; inline int read(){int ans0;char chgetchar();…

嘉应学院专插本计算机专业考纲,2015年嘉应学院汉语言文学专插本写作大纲.pdf...

.2015 专插本基础写作辅导部分分为五个部分&#xff0c;共 42 道题目。 50 &#xfe6a;-60 &#xfe6a;﹙填空&#xff0c;选择&#xff0c;判断&#xff0c;名词解释&#xff0c;简答&#xff0c;鉴赏﹚&#xff0c; 40 &#xfe6a;﹙作文﹚。1、什么是文章写作。文章写作是…

绿色版本Tomcat

解压版Tomcat配置(本例Tomcat6)&#xff1a;一 配置Tomcat1 下载Tomcat Zip压缩包&#xff0c;解压。如果增加tomcat的用户名和密码&#xff0c;则修改/conf/tomcat-user.xml<?xml version1.0 encodingutf-8?><tomcat-users><role rolename"manager"…

[ BZOJ 2160 ] 拉拉队排练

\(\\\) \(Description\) 一个由小写字母构成的长为\(N\)的字符串&#xff0c;求前\(K\)长的奇数长度回文子串长度之积&#xff0c;对\(19930726\)取模后的答案。 \(N\in [1,10^6]\)&#xff0c;\(K\in [1,10^{12}]\)\(\\\) \(Solution\) \(Manacher\)处理出所有位置的回文半径&…