NCBI Virus 帮助文档

What is NCBI Virus?(什么是NCBI病毒)


  1. Compare your sequence to those in the NCBI Virus database using NCBI BLAST algorithm.
    使用NCBI BLAST算法将您的序列与NCBI病毒数据库中的序列进行比较。
  2. Search, view and download nucleotide and protein sequences using virus name or taxonomy group.
  3. Quickly access common data sets for all viruses, all human viruses, bacteriophages, or sequences released in the past month.
  4. Explore the massive, normalized datasets and identify data trends.

Ways to access NCBI Virus data(访问NCBI病毒数据的方法)

Select one of the three options to access NCBI Virus data.
Option 1:
Through the navigation menu in Find data tab select one of the drop-down links:
Search by sequence to use virus-specific NCBI BLAST tool.
按序列搜索以使用病毒特异性NCBI BLAST工具
Search by virus to perform virus sequence search based on virus name or taxonomy.
All viruses, Human viruses, Bacteriophages, New sequences (past one month) and Available SARS-CoV-2 sequences to view preselected data sets.
Option 2:
The same functionalities can be accessed through the buttons Search by sequence and Search by virus located on NCBI Virus home page.
The results can be viewed in the Results Table, and further refined by utilizing the sequence attributes (metadata) in the Refine Results panel located on the right side of the table. Additionally, you can download the results, conduct multiple sequence alignments, and generate phylogenetic trees using the selected results.


Option 3:
Through NCBI Visual Data Dashboard via statistics buttons located in the top row of the dashboard.

NCBI Virus BLAST™ tool

The NCBI Virus BLAST™ tool provides rapid insight into query sequences by presenting BLASTn and BLASTp results alongside normalized metadata, when available. (NCBI Virus BLAST™工具通过在可用的情况下显示BLASTn和BLASTp结果以及标准化元数据,提供对查询序列的快速洞察。)These attributes include: isolation source, host, country, collection and release date, as well as taxonomy and genetic attributes such as completeness, and segment or protein names when applicable. (这些属性包括:分离来源、宿主、国家、收集和发布日期,以及分类学和遗传属性,如完整性,以及片段或蛋白质名称(如适用)。)The normalized metadata is generated via an internal, curator-guided data-processing pipeline that maps sequence-record attributes to standardized vocabularies to provide a user-friendly view of the data.(规范化元数据是通过一个内部的、由策展人引导的数据处理管道生成的,该管道将序列记录属性映射到标准化词汇表,以提供用户友好的数据视图。)

Compare your sequence to those in the NCBI Virus database using the BLAST algorithm

Press on the button Search by sequence (or select this option from the Find data navigation tab on the top of the page).
Select Nucleotide or Protein tab. Nucleotide tab allows to perform BLASTn search (search against all NCBI virus nucleotide sequences). Protein tab allows to perform BLASTp search (search against all NCBI virus protein sequences). Read more about BLAST™ searches at NCBI BLAST Guide.
选择核苷酸或蛋白质选项卡。核苷酸选项卡允许执行BLASTn搜索(针对所有NCBI病毒核苷酸序列进行搜索)。蛋白质标签允许进行BLASTp搜索(针对所有NCBI病毒蛋白质序列进行搜索)。有关BLAST™搜索的更多信息,请访问NCBI BLAST指南。
In NCBI Virus Search by sequence input form enter NCBI sequence accession sequence in plain text or FASTA format and click Start search.
The BLAST search results will open in a new window, presented in a tabulated format (the Results Table).

Compare your sequences to the sequences in up-to-date Betacoronavirus database

To accommodate the SARS-CoV-2 outbreak(爆发 ; 爆发,突然发生) the Betacoronavirus blast database was created. It is regularly updated and includes all sequences from the genus(属 ) Betacoronavirus. To search your sequence in Betacoronavirus database using BLAST:
为了适应严重急性呼吸系统综合征冠状病毒2型的爆发,创建了Betacoronavirus blast数据库。它定期更新,包括Betacoronavirus属的所有序列。要使用BLAST在Betacoronavirus数据库中搜索您的序列:
Press on the button Search by sequence (or select this option from the Find data navigation tab on the top of the page).
Select Nucleotide or Protein tab. 选择核苷酸或蛋白质选项卡。
In NCBI Virus Search by sequence input form enter NCBI sequence accession sequence in plain text or FASTA format and click Search up-to-date Betacoronavirus DB button.
在NCBI病毒按序列搜索输入表中,以纯文本或FASTA格式输入NCBI序列accession序列,然后单击搜索最新的Betacoronavirus DB按钮。
The BLAST search results will open in a separate window in a tabular format (the Results Table).

Compare BLAST results in the Results Table

Nucleotide tab allows to perform BLASTN search (using Megablast - optimize for highly similar sequences - search against all NCBI virus nucleotide sequences).
Protein tab allows to perform BLASTP search (search against all NCBI virus protein sequences). Read more about BLAST algorithms on NCBI BLAST help documentation.
蛋白质标签允许进行BLASTP搜索(针对所有NCBI病毒蛋白质序列进行搜索)。在NCBI BLAST帮助文档中关于BLAST算法的信息。
In BLAST search Results Table you can compare search results in tabular display using the following sortable default columns:
Accession - the NCBI accession number of the NCBI Virus database sequence. Reference sequence accessions marked with label “RefSeq”.
Accession-NCBI病毒数据库序列的NCBI Accession号(登录号 ; 检索号 ; 收录号 ; 存取号 )。标记有标签“RefSeq”的参考序列accessions。
Coverage - query coverage. 覆盖率-查询覆盖率。
Identity - the highest percent identity of all query-subject alignments.

Submitters(Submitter 递交者信息) - authors submitted the sequence. Only first submitter’s name is displayed in the column (for example, Baranov,P.V., et al.). To obtain a full list of submitters, click on sequence accession number, this will open the details menu. Click on accession number in the details panel, this will open GenBank Entrez page with all information available for the selected sequence. Alternatively, you can use Download button with CSV format option. The column “Submitters” in the downloaded table will contain the name of all authors submitted each sequence.
提交者-作者提交了序列。列中只显示第一个提交者的姓名(例如,Baranov,P.V.等人)。要获得提交者的完整列表,请单击序列accession号,这将打开详细信息菜单。点击详细信息面板中的accession号,这将打开GenBank Entrez页面,其中包含所选序列的所有可用信息。或者,您可以使用带有CSV格式选项的下载按钮。下载表格中的“提交者”列将包含每个序列提交的所有作者的姓名。

Release date - the date when sequence was released (publicly appeared) in GenBank or other INSDC databases.

Isolate - Individual isolate from which the sequence was obtained, typically an alphanumeric sample ID. Isolate name parsed from “/isolate” field of GenBank record. SARS-CoV-2 sequence isolate name is formatted according to the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (ICTV) definitions.
如 “isolate: Han”,表示样本来自于特定人群;“isolate: Prostate Cancer Cell Line”,表示样本来源于特定类型细胞。

Species – virus species name. 物种——病毒物种名称

Molecule type - viral nucleic acid type. Molecule type is provided by International Committee on Taxonomy of Viruses (ICTV) in the Master Species List and maintained in the NCBI Taxonomy database. RefSeqs that have “Unknown” molecule type belong to tax groups which were not recognized by the ICTV yet.
分子型-病毒核酸型。分子类型由国际病毒分类委员会(ICTV)在《主要物种名录》中提供,并保存在NCBI分类数据库中。具有“未知”分子类型的RefSeqs属于尚未被ICTV承认的tax groups。

Length - sequence length. Length—序列长度

Geo(地理) Location - country/region of virus specimen(样品 ; 样本 ; 标本 ; 抽样,血样,尿样 ; 单一实例) collection. May contain additional geographic information, for example, US state.

BLAST results can be customized by adding/removing additional columns from the Results Table in Select columns drop-down menu.

Additional columns include:

USA. If the sample was collected in the United States, the column shows the state abbreviation.

Host(样本来源生物的天然(非实验室)宿主物种学名,即拉丁名) – virus isolation host (read more about isolation host vocabulary mapping). If isolation host is unknown (/host field of the GenBank record), but laboratory host is present (as indicated in /lab_host field of the GenBank record), the laboratory host will be present in the host column of the Results Table. If both isolation host and laboratory host can be mapped, only isolation host will be presented in the host column of the table.

Collection Date – virus specimen collection date.

SRA accession - NCBI Sequence Read Archive (SRA) accession number.
SRA accession-NCBI序列读取档案(SRA)accession号。

Score - the total alignment scores (Total score) from all alignment segments.

Family. 家族
Sequence type – complete/partial/proviral/refseq read more about sequence type here.
Nuc completeness - nucleotide completeness (note: it is preliminary data, not always accurate).
Genotype. 基因型
Segment – segment name in case of segmented viruses.
Publications - links to the associated with sequences publications in PubMed.
Country - country of specimen collection (only country, no any additional information).
Isolation source – sequence isolation source read more about isolation source here.
BioSample – NCBI BioSample accession number.
BioProject – NCBI BioProject accession number.
GenBank title.
The default number of rows displayed in the Results Table is 200. You can change the number of table rows by selecting number results per page (200, 100, 50 or 25) in Select Columns menu.

View BLAST Alignment of selected sequences

To compare search results in pair-wise alignment:
Select sequences to display.
Click on View BLAST Alignment of selected sequences link displayed in the center of the Info panel located above the Results Table.
The new page will show a graphical view of pairwise alignments between selected BLAST results and the query, along with a feature map (if available) of the query at the top of the view.

Read more how to use alignment viewer please refer to NCBI Multiple Sequence Alignment Viewer documentation.

Build multiple sequence alignment of selected BLAST results

To build multiple sequences alignment based on selected BLAST results:
Select sequences that you want to align.
Press the button Align on the right above the Results Table.

Multiple sequence alignment will open at the new page. Multiple sequence alignments calculated using MUSCLE.

Read more how to use alignment viewer please refer to NCBI Multiple Sequence Alignment Viewer documentation.

Build phylogenetic tree of selected BLAST results

To build a phylogenetic tree to see the relationships of selected sequences:
Select sequences to display. 选择要显示的序列。
Press the button labeled Build Phylogenetic Tree on the right above the Results Table.
The tree will be calculated and available in tree viewer on a separate page.
For more about Tree Viewer and how to use it, please refer to NCBI Tree Viewer help documentation located here.

Refine tabular BLAST results via filters:通过过滤器优化表格BLAST结果:

1. Virus name or taxonomy 病毒名称或分类

To Restrict search results to the particular virus group:要将搜索结果限制为特定的病毒组,请执行以下操作:
On BLAST result page in Refine Results panel (left upper corner) click on Virus.在优化结果面板(左上角)的BLAST结果页面上,单击病毒。
In the text box paste or start typing a single virus taxonomy name, or taxid (only 5 top taxa will be shown).在文本框中粘贴或开始键入单个病毒分类名称,或滑行(只显示5个顶部分类群)。
Select your taxid (NCBI taxonomy database ID) from the fly-out menu.从弹出菜单中选择您的taxid(NCBI分类数据库ID)。
The filtered results will be presented in the Results Table with the following 5 default sortable columns: accession, coverage, identity, species, country, host, collection date. Additional columns to display connected metadata can be added via the Customize Table menu. The query sequence will be highlighted in the first row of the table.

2. Accession

You can search for the particular accessions in the Results Table by entering them in the search form under the Accession filter. The results on the table will be limited to the entered accession numbers.

3. Sequence length

To restrict your results to the particular sequence length, enter the minimum and maximum length in nucleotides (for nucleotide search) or amino acids (for protein search).

4. Ambiguous Characters


5. Sequence type

All sequences (Nucleotide or Protein) available in the NCBI Virus resource can be filtered based on following sequence types - GenBank and RefSeq.
GenBank sequences include all sequences available in GenBank, except RefSeqs.
Refseq filtered nucleotide sequences include all reference sequences for the selected virus. Note, that few RefSeqs are partial genomes, based on the International Committee on Taxonomy of Viruses (ICTV) proposal.

6. RefSeq genome completeness

Complete or partial RefSeq genomes - filter for all complete (or partial) genomes, reference records (RefSeqs), and proteins form these RefSeqs. In case of segmented viruses complete genomes contain all genome segments. Most of RefSeq records are complete, but few RefSeqs are partial, based on International Commitee on Taxonomy of Viruses (ICTV) proposal.

7. Nucleotide completeness

Complete nucleotide sequences - filter for all NCBI viral nucleotide sequences, where GenBank ASN.1 format contains the following descriptors: descr/molinfo/completeness=complete or there is a word ‘complete’ present in the record’s definition line (defline). It also includes complete reference records (RefSeqs).
完整核苷酸序列-过滤所有NCBI病毒核苷酸序列,其中GenBank ASN.1格式包含以下描述符:descr/molifo/complety=完整或记录的定义行(defline)中存在“完整”一词。它还包括完整的参考记录(参考序列)。

Partial nucleotide sequence – filter for sequences that are not complete according to the definition above.

If Protein tab selected and complete nucleotide sequence type filter applied, results will include all proteins from complete genomes or individual complete segments in case of segmented viruses.

8. Isolate

Isolate - individual isolate from which the sequence was obtained, typically an alphanumeric sample ID. Isolate name parsed from “/isolate” field of GenBank record. SARS-CoV-2 sequence isolate name is formatted according to the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (ICTV) definitions.

9. Proteins

Protein name parsed from “/product=” field of GenBank nucleotide and protein records
==从GenBank核苷酸和蛋白质记录的“/product=”字段解析的蛋白质名称 ==

10. Provirus

Provirus sequences - filter for sequences that have “/proviral” source qualifier in the GenBank record.

11. Geographic region

The Geographic region filter allows you to type your country of interest in the text box or select the continent(s) of interest. Selecting a continent also selects all the countries within that continent automatically.
Clicking on the arrow next to a continent’s name opens a secondary selection menu to (un)select the country(s) belonging to the continent of interest. The selected countries are listed below the continent name.
If an entire continent is selected, the continent’s name will be shown in a pillbox below, indicating that all countries for the continent are selected. If at least one country is selected, the corresponding continent is no longer displayed and instead, a pillbox for each selected country is shown below the associated continent. Each continent’s behavior is independent of the other continents.
Selection can be deselected by clicking on the pillboxes, and multiple concurrent selections are supported.

12. Isolation host or taxonomy

Enter a host name or taxid to the text box and several host terms will be suggested (only 20 top taxids will be shown). Select the desired host term and hit Enter. The results will be restricted to sequences in the database with the indicated host term. Multiple hosts can be filtered on simultaneously by adding additional host terms to the filter.
The terms for isolation host are parsed from the source/host field in a sequence’s GenBank record. Parsed terms are mapped to a standardized vocabulary, which was derived by curators by aggregating the variety of terms in GenBank files. Common mis-spellings are also included in this mapping strategy. For example, “Accipter cooperii” is mapped to “Accipiter cooperii”.
隔离host的术语是从序列的GenBank记录中的source/host字段中解析的。解析后的术语被映射到标准化词汇表,该词汇表是由curators通过汇总GenBank文件中的各种术语而导出的。常见的拼写错误也包括在这个映射策略中。例如,“Accipeter cooperii”映射为“Accipiter cooperii“。
The terms for isolation hosts are displayed in the host column of the Results Table. In case if the isolation source is unknown, but laboratory host is present (as indicated in /lab_host field of the GenBank record), the laboratory host will be present in the host column of the Results Table. If both isolation host and laboratory host can be mapped, only isolation host will be presented in the table (host column).
隔离host 的术语显示在“结果表”的host 列中。如果隔离源未知,但实验室host 存在(如GenBank记录的/lab_host字段所示),则实验室host 将出现在结果表的host 列中。如果隔离host 和实验室host 都可以映射,则表中只显示隔离host (host 列)。

13. Submitters

To search for sequences submitted by a particular author(s) enter the author’s last names with or without initials.
The following formats are supported: 支持以下格式:
Chiang,T.Y. Forsyth,K.A. Knittig,L.C. Lim,O.P. Chiang,T.Y., Forsyth,K.A., Knittig,L.C., Lim,O.P. Chiang Forsyth Knittig Lim Chiang, Forsyth, Knittig, Lim

14. Isolation source

The terms for isolation source are parsed from the isolation source field in a sequence’s GenBank record. Examples of parsed terms are serum and plasma, which are all mapped to the standardized vocabulary term blood.

Common mis-spelling as well as regional spelling differences are included in the mapping strategy. Multiple terms can be selected.

15. Sample collection date

Collection date (From, To) - is the collection date for the sample from which the sequence was derived.
By default, the To: date is set to the current date.
Use mm/dd/yyyy or yyyy formats or click on the calendar icon(图标 ; 偶像 ; 图符 ; 圣像 ; 崇拜对象 ) and select dates.

16. Sequence release date

Release date (From, To) – the date when sequence was released (publicly appeared) in GenBank or another INSDC database.
By default, the To: date is set to the current date.
Use mm/dd/yyyy or yyyy formats or click on the calendar icon and select dates.

17. Environmental sourse

Environmental source filter allows to select virus sequences isolated from the environmental sources. Generally, environmental isolates are identified by searching for key terms, such as sewage or ocean water from /isolation_source and /note fields of GenBank records when /host field is empty.
Select Include - to include all sequences isolated from environmental sources to the Results Table.
选择Include(包括)-将从环境源分离的所有序列包括在Results Table(结果表)中。
Select Exclude - to exclude all sequences isolated from environmental sources to the Results Table.
Select Only - to view only sequences isolated from environmental sources.

18. Laboratory samples

Lab host filter allows to view laboratory isolated virus sequences. Lab host identified by searching lab host name in /lab_host field of GenBank record. Additionally (only for bacteriophages) if /host and /lab_host fields are empty, lab host identified by parsing lab host name from bacteriophage organism name of GenBank record.
Select Include - to include all laboratory isolated virus sequences to the Results Table.
Select Exclude - to exclude all laboratory isolated virus sequences to the Results Table.
Select Only - to view only laboratory isolated virus sequences.

Note: lab host name can be viewed in the result table (in host column) only in cases when the isolation host cannot be identified (/host field of GenBank record is empty).
注意:只有在无法识别隔离host (GenBank记录的/host字段为空)的情况下,才能在结果表(host 列)中查看实验室host 名。

19. Vaccine strain

Vaccine strain filter allows to find virus vaccine strain sequences. Vaccine strains identified by searching vaccine strain terms in /isolation_source, /note, /host and definition line of GenBank record.
Select Include - to include all virus vaccine strain sequences to the Results Table.
Select Exclude - to exclude all virus vaccine strain sequences to the Results Table.
Select Only - to view only virus vaccine strain sequences.

Search for sequences by virus name or taxonomy group 按病毒名称或分类组搜索序列

Find your virus sequence(s) 查找您的病毒序列
Option 1:
Select Search by virus drop-down option from navigation menu Find Data tab on any of NCBI Virus pages. This will open the selection menu.
Start typing in the text box, then select your taxid (NCBI taxonomy database ID). To select all viral sequences, enter and then select the term viruses.
The results will be shown in the table. 结果将显示在表中

Note: Please view a list of all viral taxonomy terms using the NCBI taxonomy pages.

Option 2:
Click on button Search by virus located in the central part of NCBI virus home page.
Start typing in the text box, then select your taxid (NCBI taxonomy database ID).
This will open the tabular interface with sequences from the selected taxonomy group.

Compare results in the Results Table

Click on the Nucleotide tab to access genomic sequences, the Protein tab to access amino acid sequences for individual proteins, or RefSeq Genome tab to access RefSeq genomes. For segmented viruses each RefSeq genome includes all segments for each segmented virus

In virus search Results Table you can compare search results in tabular display using the following sortable default columns:

==Accession== - the NCBI accession number of the NCBI Virus database sequence.
==Submitters== - authors submitted the sequence. Only the first submitter's name displayed in the column (for example, Baranov,P.V., et al.). To obtain a full list of submitters, click on sequence accession number, this will open the details menu. Click on the accession number in the details panel, this will open GenBank Entrez page with all information available for the selected sequence. Alternatively, you can use the Download button with CSV format option. The column "Submitters" in the downloaded table will contain the name of all authors submitted each sequence.
==Release date== - the date when sequence was released (publicly appeared) in GenBank or other INSDC databases.
==Isolate== - Individual isolate from which the sequence was obtained, typically an alphanumeric sample ID. Isolate name parsed from "/isolate" field of GenBank record. SARS-CoV-2 sequence isolate name is formatted according to the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses (ICTV) definitions.    
==Species== – virus species name.
==Molecule type== - viral nucleic acid type. Molecule type is provided by International Committee on Taxonomy of Viruses (ICTV) in the Master Species List and maintained in the  NCBI Taxonomy database. RefSeqs that have "Unknown" molecule type belong to tax groups which were not recognized by the ICTV yet. 
==Length== - sequence length.
==Geo Location== – country/region of virus specimen collection.    
==USA.== If the sample was collected in the United States, the column shows the state abbreviation.
==Host== – virus isolation host (Read more about isolation source vocabulary mapping here). If isolation host is unknown (/host field of the GenBank record), but laboratory host is present (as indicated in /lab_host field of the GenBank record), the laboratory host will be present in the host column of the Results Table. If both isolation host and laboratory host can be mapped, only isolation host will be presented in the host column of the table.   

Search results can be customized by adding/removing additional columns from the Results Table in Select Columns dropdown menu.

Additional columns include:

Isolation source – sequence isolation source (read more about isolation source here).
Collection Date – virus specimen collection date.
SRA accession - NCBI Sequence Read Archive (SRA) accession number.
Sequence type – complete/partial/refseq (read more about sequence type here).
Nuc completeness - nucleotide completeness (note: it is preliminary data, not always accurate).
Segment – segment name in case of segmented viruses.
Publications - links to associated with sequences publications in PubMed.
BioSample – NCBI BioSample accession number.
BioProject – NCBI BioProject accession number.
GenBank title.

The default number of rows displayed in the Results Table is 200. You can change the number of table rows by selecting number results per page (200, 100, 50 or 25) in Select Columns menu.

Build multiple sequence alignment of selected results

Please, refer to the Build multiple sequence alignment of selected BLAST results, since functionality is the same.

Build phylogenetic tree of selected results

Please, refer to the Build phylogenetic tree of selected BLAST results, since functionality is the same.

Refine tabular results via filters

Please, refer to the Refine tabled BLAST results via filters, since functionality is the same.

How to find, view and download SARS-CoV-2 sequences and related metadata?

In order to provide free and easy access to genome and protein sequences and associated metadata from the SARS-CoV-2, we created a dedicated Severe acute respiratory syndrome coronavirus 2 data hub.

You can access the Results Table on SARS-CoV-2 data hub, by pressing “RefSeq genomes”, “nucleotide” or “protein” links on announcement(公告 ; 宣布 ; 通告 ; 宣告 ; 布告 ) banner located on NCBI home page, in “Find data” navigation menu or using “Up-to-date SARS-CoV-2” shortcut(快捷方式 ; 近路;捷径 ; 快捷办法,捷径 ) button in “Search by virus” form.
您可以访问SARS-CoV-2数据中心的结果表,方法是在NCBI主页的“查找数据”导航菜单中按公告横幅上的“RefSeq genomics”、“核苷酸”或“蛋白质”链接,或在“按病毒搜索”窗体中使用“最新严重急性急性呼吸系统疾病冠状病毒2型”快捷按钮。

SARS-CoV-2 data hub allows to search, retrieve, and analyze and vizualize SARS-CoV-2 data available in GenBank. This page also provides links to Betacoronavirus BLAST, SARS-CoV-2 articles in PubMed, SRA data, NCBI SARS-CoV-2 resources, Data Sets command line and CDC outbreak information.
SARS-CoV-2数据中心允许搜索、检索、分析和实时化GenBank中可用的SARS-CoV-2数据。该页面还提供了Betacoronavirus BLAST、PubMed中的SARS-CoV-2文章、SRA数据、NCBI SARS-CoV-2资源、数据集命令行和美国疾病控制与预防中心疫情信息的链接。

SARS-CoV-2 data hub results table has “Pangolin” column which is specific only to SARS-CoV-2 data. Pango lineages are determined by Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages). All SARS-CoV-2 GenBank records reprocessed nightly by Pangolin pipeline using UShER pipeline. The field is empty if the sequence was released after the Pangolin run that day. The field will show unclassifiable if the sequence does not meet requirements to be processed, and will show unassigned if the Pangolin tool is not able to determine the lineage for the sequence. You can view Pango version by downloading results in CSV format. You can view version strings in Pango Versions column. Each string includes the following sources: pangolin/pangolin-data/constellations/scorpio. For example, 4.0.6/1.8/v0.1.8/0.3.17.
SARS-CoV-2数据中心结果表有“Pangolin”列,该列仅适用于SARS-CoV-2数据。Pango谱系由Pangolin(名为全球爆发谱系的系统发育分配)决定。Pangolin管道使用UShER管道每晚重新处理所有SARS-CoV-2 GenBank记录。如果该序列是在当天Pangolin run后发布的,则该字段为空。如果序列不符合要处理的要求,该字段将显示为不可分类,如果Pangolin工具无法确定序列的谱系,则该字段显示为未分配。您可以通过下载CSV格式的结果来查看Pango版本。您可以在PangoVersions列中查看版本字符串。每个字符串包括以下来源:pangolin/pangolin-data/constellations/scorpio。例如,4.0.6/1.8/v0.1.8/0.3.17。

There are two filters on “Refine Results” panel which are specific only to SARS-CoV-2 data:

Pango lineage(血统 ; 世系 ; 家系 ; 宗系) - allows to filter sequences a particular Pango lineage assigned. Pango lineages are determined by Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages). All SARS-CoV-2 GenBank records reprocessed nightly by Pangolin pipeline using UShER pipeline. The field is empty if the sequence is unclassifiable or if it was released after a UShER run that day. You can view Pango version by downloading results in CSV format. You can view version strings in Pango Versions column. Each string includes the following sources: pangolin/pangolin-data/constellations/scorpio. For example, 4.0.6/1.8/v0.1.8/0.3.17.
Pango谱系-允许过滤特定Pango谱系分配的序列。Pango谱系由Pangolin(名为全球爆发谱系的系统发育分配)决定。Pangolin管道使用UShER管道每晚重新处理所有严重急性呼吸系统综合征冠状病毒2型GenBank记录。如果序列不可分类,或者在当天UShER运行后发布,则该字段为空。您可以通过下载CSV格式的结果来查看Pango版本。您可以在Pango Versions列中查看版本字符串。每个字符串包括以下来源:pangolin/pangolin-data/constellations/scorpio。例如,4.0.6/1.8/v0.1.8/0.3.17。

Random sampling - allows to filter sequences that were collected randomly for the purpose of baseline surveillance.(监控;(对犯罪嫌疑人或可能发生犯罪的地方的)监视) For example, this filter can be helpful if you would like to know which lineages are increasing in frequency, or are looking for a rough estimate of the infection rate in geographical regions where that data isn’t available yet. Random sampling of samples (e.g., not for vaccine breakthrough or localized outbreak investigation) allows to make these estimates better.

NCBI Virus scanns SARS-CoV-2 GenBank records and any linked BioSample records. If either of the following field/value pairs are found, then the sequence is included in our “random sampling” filter:
==NCBI病毒扫描严重急性呼吸系统综合征冠状病毒2型GenBank记录和任何相关的BioSample记录。如果找到以下字段/值对中的任何一个,则序列将包含在我们的“随机采样”过滤器中: ==

GenBank: KEYWORDS - purposeofsampling:baselinesurveillance
BioSample: purpose of sequencing - Baseline surveillance

Select Include - to include all randomly sampled SARS-CoV-2 sequences to the Results Table.

Select Exclude - to exclude all randomly sampled SARS-CoV-2 sequences from the Results Table.

Select Only - to view only randomly sampled SARS-CoV-2 sequences.

For other filters description please, refer to the Refine tabled BLAST results via filters, since functionality is the same.关于其他过滤器的描述,请参阅通过过滤器优化表格BLAST结果,因为功能是相同的。

By clicking on “SARS-CoV-2 interactive dashboard” link on the announcement banner located on NCBI home page you can access geographic and time distribution graphs. You also can access it through SARS-CoV-2 data hub

Where can I find SARS-CoV-2 lineage-related information?

You can explore lineage geo-temporal and mutation data using the interactive SARS-CoV-2 Variants Overview dashboard which can be accessed through the announcement banner located on NCBI home page.
您可以使用交互式SARS-CoV-2变异株Overview dashboard来探索谱系、地理时间和突变数据,该面板可以通过NCBI主页上的公告横幅访问。
Learn more using SARS-CoV-2 Variants Overview help center.

View and download specific virus sequence sets 查看和下载特定的病毒序列集

Find specific data sets

Option 1:
From navigation menu Find data tab select the desired group of viruses: All viruses, Human viruses, Bacteriophages, New sequences (past one month) and Available SARS-CoV-2 sequences to view preselected data sets.
Bacteriophages include virus groups with the following NCBI Taxonomy IDs: 10472, 10656, 10659, 10841, 10860, 10877, 11989, 28883, 1714270, 12333, 79205, 2136181
You can also access the selected virus groups through the “Popular Searchers” panel located on the Results Table. The following virus groups can be accessed:
您也可以通过结果表上的“Popular Searchers”面板访问选定的病毒组。可以访问以下病毒组:
Influenza virus - allows access to data for the following genera: Alphainfluenzavirus, Betainfluenzavirus, Gammainfluenzavirus and Deltainfluenzavirus. Capital letters A, B, C and D in brackets indicate the predominant species in each genus.
Rotavirus 轮状病毒;轮状病毒疫苗;轮状病毒属;人类轮状病毒
Dengue virus 登革热病毒
West Nile virus 西尼罗河病毒
Zika virus 寨卡病毒
MERS coronavirus MERS冠状病毒
Ebolavirus 埃博拉病毒
SARS-CoV-2 coronavirus

Option 2:
Click on button Search by sequence located in the central part of NCBI virus home page.
Select the desired popular virus searches group button located beneath the text box.

Both options will open the tabular display with the information about viruses from the selected group.
Learn more how to compare results in tabular display, build multiple sequence alignment of selected results, build phylogenetic tree of selected results or refine the Results Table via filters.

Option 3:
Use NCBI Visual Data Dashboard to explore, view and download the massive, normalized datasets. Learn more.

Download sequences

To download sequences in a variety of formats (FASTA, accession list, the Results Table as CSV or XML), choose Nucleotide, Protein, or RefSeq Genomes tab and optionally select individual sequences to download.

You can also specify if you want to download a randomized or stratified randomized sequence set.

Download a randomized sequence set 下载随机序列集

Disclaimers 免责声明
Please note, our current platform does not have the capability to generate repeatable randomized searches. We realize the importance of repeatability in the scientific community and are working diligently(勤奋地;勤勉地) to include this feature in our upcoming updates.
Downloading randomized subsets in either FASTA format or accession list is currently available for nucleotide, protein, and assembly records. We are working to make them available for coding region records in the future.

A randomized subset of sequences (also referred to as ‘downsampling’) can allow a user to work with a smaller subset of sequences selected at random from a larger dataset, as an approximation of the full dataset

A smaller, representative sequence set could make downstream analysis faster and less computationally intensive,( 密集的 ; 集约的 ; 彻底的 ; 十分细致的 ; 短时间内集中紧张进行的 😉 and still allow for interpretation of the larger collection. When downloading a randomized subset, the file name will include the date of download and the randomization seed used.
在这里插入图片描述Filters can be applied prior to downsampling as described here. After clicking the download button, a menu will allow you to select the download format, then a 2nd step will include an option to download a randomized subset of all the records in your filtered dataset. You can download a set of randomized sequences in a variety of formats (FASTA, accession list, Results table in CSV, or XML formats). Before opening the “Download” menu, please make sure to select the tab above the Results Table which corresponds to the data type you want to download. If you picked the “Nucleotide” tab, you will only be able to download randomized sequence data in FASTA Nucleotide, Nucleotide Accession list, XML, and CSV formats. If you chose the “Protein” tab, you will only be able to download randomized sequence data in FASTA Protein, Protein Accession List, XML, and CSV formats. If you picked the “RefSeq Genomes” tab, you will only be able to download randomized sequence data in Accession Assembly list, XML, and CSV formats.

如本文所述,可以在下采样之前应用滤波器。单击下载按钮后,菜单将允许您选择下载格式,然后第二步将包括下载过滤数据集中所有记录的随机子集的选项。您可以下载一组各种格式的随机化序列(FASTA、accession列表、CSV或XML格式的结果表)。在打开“下载”菜单之前,请确保选择与要下载的数据类型相对应的结果表上方的选项卡。如果选择“核苷酸”选项卡,则只能下载FASTA核苷酸、核苷酸Accession列表、XML和CSV格式的随机序列数据。如果选择“蛋白质”选项卡,则只能下载FASTA蛋白质、蛋白质Accession列表、XML和CSV格式的随机序列数据。如果您选择了“RefSeq Genomes”选项卡,您将只能在Accession组件列表中下载随机化序列数据

Download a stratified randomized sequence set

Randomized subsets of sequences can be stratified, meaning equally distributed over a field of categories (also referred to as ‘stratified downsampling’). This enables a user to work with a subset of sequences selected from a dataset, as an approximation of the full dataset, with equal numbers of sequences from a selected category, to approximate a larger sequence collection. The fields currently available for stratification are Country and Host. Before opening the “Download” menu, please make sure to select the tab above the Results table which corresponds to the data type you want to download. If you picked the “Nucleotide” tab, you will only be able to download randomized sequence data in FASTA Nucleotide, Nucleotide Accession list, XML, and CSV formats. If you chose the “Protein” tab, you will only be able to download randomized sequence data in FASTA Protein, Protein Accession List, XML, and CSV formats. If you picked the “RefSeq Genomes” tab, you will only be able to download randomized sequence data in Accession Assembly list, XML, and CSV formats.
序列的随机子集可以被stratified,这意味着在一个类别字段上均匀分布(也称为“分层下采样”)。这使得用户能够使用从数据集中选择的序列子集,作为整个数据集的近似值,使用来自所选类别的相等数量的序列,来近似更大的序列集合。目前可用于分层的字段有Country和Host。在打开“下载”菜单之前,请确保选择“结果”表上方与您要下载的数据类型相对应的选项卡。如果选择“核苷酸”选项卡,则只能下载FASTA核苷酸、核苷酸Accession列表、XML和CSV格式的随机序列数据。如果选择“蛋白质”选项卡,则只能下载FASTA蛋白质、蛋白质Accession列表、XML和CSV格式的随机序列数据。如果您选择了“RefSeq Genomes”选项卡,您将只能下载run
When downloading a stratified randomized subset, the file name will include the date of download and the randomization seed used.
在这里插入图片描述Step by step instructions how to download sequences
Click Download button on the upper left side of NCBI Virus Results Table page.
This will open the download menu consisting of 3 steps.
Step 1: Select Data Type. 选择数据类型
Nucleotide, protein, or coding region sequence (CDS) in FASTA format. Please note, that currently, randomized subsets are not available for coding region sequence (CDS) FASTA files.
Accession list for nucleotide, protein, or assembly records. Please note, currently, randomized subsets are not available for coding region sequence (CDS) accession lists.
Results Table – the contents of the Results Table, including the metadata, in CSV format (comma separated values table format) or in XML format.
Step 2: Select Records. ==选择记录 ==

Select which records you would like to download:
only selected records, which were selected using checkboxes in the results table, all records in the results table, randomized subset of up to 2,000 records in the Results Table (for Nucleotide FASTA, Protein FASTA, Nucleotide Accession List, Protein Accession List, Assembly Accession List, CSV, and XML formats only).
仅使用结果表中的复选框选择的选定记录、结果表中所有记录、结果表格中最多2000条记录的随机化子集(仅适用于核苷酸FASTA、蛋白质FASTA、核苷酸Accession列表、蛋白质Accession列表、Assembly Accession列表、CSV和XML格式)。
Randomized subsets contain a limited number of sequences randomly selected from all of the available sequences in the Results Table. As an option, you can choose to stratify your subset by a field, meaning that a roughly equal number of sequences will be randomly selected for each value of that field.
To use options for randomized subsets, select ‘Download a randomized subset of all records’ and then select either a fully randomized subset or a stratified subset. Enter the total number of randomly sorted records that you want to download into the input box, and enter the category that you want to stratify across from the dropdown.
Randomized subsets contain a limited number of sequences randomly selected from all the available sequences in the Results Table. As an option, you can choose to stratify your subset by a field (up to 20 records country or per host), meaning that a roughly equal number of sequences will be randomly selected for each value of that field.
To use options for randomized subsets, select 'Download a randomized subset of records (up to 2,000) and then select either a fully randomized subset or a stratified subset. Enter the number of randomly sorted records (up to 2,000 for randomized subset and up 20 records per value for stratified subset) that you want to download into the input box and enter the category that you want to stratify across from the dropdown.
The fields currently available for stratification are Country and Host.
Click “Next” and follow the prompts on the 3rd step in the menu to begin your download.


Step 3.
If in step 1 you selected Sequence Data (FASTA format), in step 3 you can select FASTA definition line for the sequences that you are going to download.
In case if nucleotide or protein sequence data were selected in Step 1, the default FASTA definition line will be presented in the format (accession) | (GenBank title) and will include the GenBank sequence accession number and GenBank title:

AAO17794 |VP4 spike protein[Human rotavirus A].

In case if coding region option was selected, the default definition line format will be (nucleotide accession)(cds coordinates)| (GenBank title) and will include the related GenBank nucleotide sequence accession number, the indication that this is a coding region (cds), related GenBank protein accession number and related protein GenBank title:
如果选择了编码区选项,默认定义行格式将为(核苷酸accession)(cds 坐标)|(GenBank标题),并将包括相关GenBank核苷酸序列accession号、这是编码区的指示(cds)、相关GenBank蛋白质accession号和相关蛋白质GenBank标题:

NC_045425.1:319…1659 |replication endonuclease [Thermus phage phiOH3].

You can change this default defline to fit your own needs by selecting Build custom sequence title option. Here you can select the following options (columns):

SRA accession
Release date
Random Sampling
Molecule type
Sequence type
Nucleotide Completeness
Geo Location
Host isolation source
Collection date

You can view description for each option in the description of the Results Table columns.
在这里插入图片描述If in Step 1 you selected the Accession list , you can download nucleotide, protein and and RefSeq genome assembly accession numbers with or without vesrsion number. For example: NC_045512 (without version) or NC_045512.2 (with version).

If in Step 1 you selected the the Results Table in CSV format, the downloaded results will show all selected columns data. You can modify the selected columns and choose the columns you need in Step 3: Select columns to include in results set. You also can select if you want to include accession number with or without version number.

NCBI Visual Data Dashboards

NCBI Virus visual data dashboards support data exploration and discovery across our normalized datasets. They can be used to identify trends in data and to select specific subsets based on those trends.
Visual dashboards in Virus encompass: 病毒包围中的可视化仪表板

  1. Dashboard located on the NCBI Virus Home page, which provides virus sequence statistics, Virus Taxonomy Sunburst Chart, and a Host Distribution Bar Chart.
  2. Dashboard “Visual Filters for GenBank Sequences”, which displays data for specific viral taxa(分类群;分类单元;类群) and includes Sequence Type links with calculated virus sequence statistics, a Geographic Distribution choropleth(等值线图 ) that shows the geographic distribution of sequence records based on collection locations, and time sliders(滑块;滑动器;滑动条;游标;旅行者) for Collection and Release Date to dynamically show the number of sequences for each time interval.

1: Home Page Dashboard

Access sequence data via buttons located in the top row for the following statistics:
RefSeq Nucleotides - all viral nucleotide reference sequences available at NCBI (find more about reference sequences here).
All Proteins - all NCBI viral protein sequences, including RefSeq proteins.
All Nucleotides – all viral nucleotide records available at NCBI, including RefSeqs.
RefSeq Proteins - all viral protein reference sequences available at NCBI.
Complete Nucleotides – all NCBI viral nucleotide sequences, where GenBank ASN.1 format contains the following descriptors: descr/molinfo/completeness=complete or there is a word ‘complete’ present in the record’s definition line (defline). It also includes complete reference records (RefSeqs).
完整核苷酸–所有NCBI病毒核苷酸序列,其中GenBank ASN.1格式包含以下描述符:descr/molifo/complety=完整,或者在记录的定义行(defline)中存在“完整”一词。它还包括完整的参考记录(参考序列)。
Clicking on each button will show a results table with the corresponding sequences. Those results can be further refined by using filters for various sequence attributes (metadata) located on the left side of the Results Table page (learn more here).

Explore virus taxonomy hierarchy using sunburst chart

Virus taxonomy can be explored via an interactive sunburst chart. The default view represents the classification for all available NCBI viral taxa. The inner layer (ring) represents four non-taxonomic groups of viruses: RNA viruses, DNA viruses, DNA/RNA viruses (which includes reverse-transcribing viruses), and Unclassified viruses. Only 4 levels of the whole hierarchy are visible on the plot at a given time.
==病毒分类法可以通过交互式的sunburst图表进行探索。默认视图表示所有可用NCBI病毒分类群的分类。内层(环)代表四类非分类病毒:RNA病毒、DNA病毒、DNA/RNA病毒(包括逆转录病毒)和未分类病毒。在给定的时间,整个层次结构中只有4个级别在绘图上可见。 ==
To explore virus taxonomy, click on any slice (section) of any layer on the sunburst chart. This will trigger the plot to zoom into the selected taxa and display any additional taxa below the selection. Each viral taxa name is displayed on a corresponding slice or can be viewed in the hover-over tool-tip by placing your cursor over the slice. Dynamic breadcrumbs with viral taxa names are located above the sunburst plot. Breadcrumbs are also a secondary navigation system that show the location of the taxa in the hierarchy and clicking on one will refocus the plot on the selected taxa. You can also see breadcrumbs by hovering( 盘旋 ; 翱翔 ; 靠近 ; 踌躇,彷徨 ; 处于不稳定状态 ; 停悬;空中悬停) over any slice in the sunburst. Clicking on the center of the sunburst chart will return you to the parent taxa.
要探索病毒分类,请单击日光图上任何层的任何切片(部分)。这将触发绘图放大到选定的分类群,并显示所选分类群下方的任何其他分类群。每个病毒分类群的名称都显示在相应的切片上,或者可以通过将光标放在切片上,在悬停工具提示中查看。带有病毒分类群名称的动态面包屑位于sunburst图上方。面包屑也是一种辅助导航系统,可以显示分类群在层次结构中的位置,单击其中一个会将绘图重新聚焦到选定的分类群上。你也可以通过在阳光下的任何any slice悬停来看到breadcrumbs。点击阳光爆发图的中心将返回到父分类群。

Select specific virus taxonomy group and view statistics for specific sequence sets with quick links to download them选择特定的病毒分类组,并通过快速下载链接查看特定序列集的统计信息

After selecting a specific taxonomy group on sunburst chart, you can view and explore the updated statistics in the top row of the dashboard.

Select a host term from the Host Distribution bar chart and see the distribution of that host among the various viral taxa 从宿主分布条形图中选择一个宿主术语,并查看该宿主在各种病毒分类群中的分布

The interactive Host Distribution chart shows the distribution of virus host species. Each host bar is proportional to the number of virus sequences isolated from this host. The total number of virus sequences for each bar can be viewed by hovering over the bar.
To select a host species, click on a bar or on a corresponding host name. This will highlight selected host, as well as all virus taxonomy groups containing sequences isolated from the selected host. Only one host can be selected at a time. Clicking on the selected host a second time will de-select it or you can use the Reset option available in the top right corner of the host chart. The statistics in the top row of the dashboard will be updated based on the selected host.
You can search for a host species by scrolling the scrollbar on Host Distribution Chart, or by using keyboard combination “CTRL+F”.

You can reset Host Distribution chart the the original view by pressing on button “Reset” in the upper right corner of the chart.

Explore viral taxonomy hierarchy within a given taxon highlighted by the host selection 探索宿主选择突出显示的给定分类单元中的病毒分类层次

By clicking on a highlighted taxonomy group, you can further explore viral taxonomy hierarchy on sunburst chart. The lower layers that include taxa with sequences from the selected host will be highlighted. While zooming in, not all taxa will be highlighted if not all taxa include sequences from the selected host.

2: “Visual Filters for GenBank Sequences” Dashboard
“Visual Filters for GenBank Sequences” is a dashboard which enables filtering of your virus search results based on important attributes, like geographic location, collection, and release date, using visualized, graphical filters.
How to access “Visual Filters for GenBank Sequences”?
==如何访问“GenBank序列的可视化过滤器”? ==
There are several ways to access Visual Filters for GenBank Sequences.

  1. From NCBI Virus home page follow the steps below:
    Select ‘Search by Virus’.
    Type virus name, then select an option from the autocomplete list.
    View the results table for your virus of interest.
    Find a tab named “Visual Filters for GenBank Sequences” above the results table.
    Click on the tab “Visual Filters for GenBank Sequences” to switch to visual filtering.

  2. From the Results Table page access the “Visual Filters for GenBank Sequences” tab in the header above the results table.
    Please note, if any filters were applied on the results table, switching to the “Visual Filters for GenBank Sequences” dashboard will reset all the filters except for the virus name.

3. By adding NCBI Virus “taxid” number directly to the page URL:

For example, for Zika virus (taxid=64320), enter the following URL:

How to use Visual Filters for GenBank Sequences?

Visual filters allow to filter your search by geographic location, collection time, and release time. Each filtering feature on the dashboard is interactive and connective, so when a filter is applied in one feature, it is also reflected in the other features. When using these filters, the top summary section is automatically updated to reflect the number of records in the NCBI RefSeq, Nucleotide, and Protein sets in the NCBI Virus database that fit the combined conditions of your search.

==视觉过滤器允许按地理位置、收集时间和发布时间过滤您的搜索。仪表板上的每个过滤功能都是交互式的和连接的,因此当过滤器应用于一个功能时,它也会反映在其他功能中。使用这些过滤器时,顶部摘要部分会自动更新,以反映NCBI病毒数据库中符合搜索组合条件的NCBI RefSeq、Nucleotide和Protein集合中的记录数。 ==
在这里插入图片描述Geographic Distribution choropleth map allows to select sequence records collected at that location.
Click on a selected geographic location to filter sequences by collection location.
Map allows to select multiple international locations or multiple locations in the USA. The selections will reset if you change between the International and USA maps.
To select a single location, start typing the name of the region and select the one from a dropdown list.

Please note, that color shades on the map are based on nucleotide record numbers for the virus; darker shades correspond to higher numbers, and lighter shades - to lower numbers.
By using the Collection Time and Release Time sliders, you can view a histogram of distribution of nucleotide record numbers in different time intervals.
Use the sliders or click date columns to select records by the sample collection date or the GenBank release date. Weekly, monthly and yearly time intervals can be selected.
Collection Time graph:采集时间图
Select collection date range of the samples by either selecting one time interval bar or dragging the ends of the sliders.
Slider displays data from the earliest collection year for this virus data to the current year.
If the collection time for a record is incomplete, we collapse it like this: If the record only has a year, the record is shown as Jan 1 of that year. If the record only has year and month, the record is shown on the first day of that month.
Release Time graph:发布时间图
Select release date range of the samples by either selecting one bar or dragging the ends of the sliders.
Slider displays data from the year this virus data was released first time to the current year.
You can also select different bi-yearly intervals, which will show you the portion of the graph for that time frame. However, you still have to click on the bar or select the time interval with the sliders to apply filtering.
在这里插入图片描述The top header of the Dashboard includes a link back to the Results Table page where you can review your results in tabular format, apply more filters, and download FASTA sequences, an accession list, or the table itself.

Note, that all filters applied in the graphical view will remain in effect on the Result Table page. However, if you switch from the Results Table page back to the visual filters, all applied filters will be lost, except for the selected virus name.

How to find, view and download HIV-1 sequences and related metadata?

Public HIV-1 nucleotide and protein sequence data are displayed in HIV-1 data hub.
HIV-1 data hub can be accessed by typing and selecting HIV-1 in Search by virus name or taxonomy input form.
Alternatively, it can be accessed from NCBI home page by typing HIV-1 in search window. This will open another page with HIV-1 virus genome assembly information. Press on NCBI virus button to access HIV-1 data hub.
These are early days for HIV-1 data support in NCBI Virus. Please stay tuned for updates and further details relevant to HIV-1.




一、 单选题 (共50题,100分) 1、表长为n的顺序存储的线性表,当在任何位置上插入或删除一个元素的概率相等时,插入一个元素所需移动元素的平均个数为( D ).(2.0) A、 &am…