Tag: Sequences

Monkeypox virus: Complete genome from the current outbreak now available in GenBank

Monkeypox virus: Complete genome from the current outbreak now available in GenBank

The first complete genome sequence of the current monkeypox virus (MPXV) outbreak (isolate name MPXV_USA_2022_MA001) is now available with accession ON563414 in GenBank, a public database of DNA sequences hosted by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM).

Several cases of monkeypox have been identified in geographically widespread countries. Monkeypox is classified as a zoonotic disease where transmission of the virus is usually due to animal-human contact. Genetically, monkeypox viruses cluster into two groups: the Congo basin and the west African clade. This particular outbreak has been identified as due to a virus from the west African clade which is often associated with milder disease and, in this case, human-to-human spread is suspected. Continue reading “Monkeypox virus: Complete genome from the current outbreak now available in GenBank”

RefSeq release 209 is available

RefSeq release 209 is available

RefSeq release 209 is now available online, from the FTP site and through NCBI’s Entrez
programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of November 1, 2021, and contains 296,293,486 records, including 215,655,378 proteins, 41,751,205 RNAs, and sequences from 114,396 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 209 is available”

NCBI will assign 64-bit numeric GIs by November 15th. Update affected software!

As announced  last month, NCBI will begin assigning larger (64-bit) numeric ‘GIs’ to the remaining sequence types that still receive these identifiers. This change is expected as soon as Nov. 15th, 2021 but could occur earlier if data submission volumes are unexpectedly high. This is a reminder that all organizations and developers using our products should review software for any remaining reliance on GIs and compatibility with these larger identifiers.

How do you know if your software or organization may be impacted?

If you have built custom software to interface with NCBI data and consume a sequence database UID (i.e. GI), process the GI from an ASN1 or XML product, or process the GI from any tabular product on FTP, you should review all code to ensure that the new, longer, 64-bit GIs will be handled properly. To ensure a smooth transition and the best overall experience, please update to the latest versions of NCBI-provided programmatic and command line tools. Alternatively, you could make updates  to your code to use accession.version identifiers instead of GIs.

NCBI is here to help the community as we make this change. Stay tuned here or follow NCBI Twitter where we will share updates and additional information, such as a final confirmation of the projected cutover date.

Please contact [email protected] with any questions about this change or to determine if any software you are using is affected.

NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?

In 2016, NCBI announced that it was curtailing its display of its numeric ‘GI’ in popular sequence data formats such as FASTA and GenBank flatfiles. Due to the continued growth of GenBank, NCBI will soon begin assigning GIs exceeding the signed 32-bit threshold of 2,147,483,647 for those remaining sequence types that still receive these identifiers.

NCBI has updated products including Entrez systemGenBank (Nucleotide), BLAST™ and the C++ Toolkit to prepare for that moment by upgrading GI-related code and APIs to accept 64-bit integers. This change over is projected for late 2021. Stay tuned for additional communications from NCBI and take note of the following information if you think you may be impacted.

For a seamless transition, all organizations and developers using our products should review software for any remaining reliance on GIs and compatibility with these larger identifiers. Note that this update requires no changes to submission procedures or assignment of accessions.  Continue reading “NCBI’s GI sequence identifiers will soon exceed 32-bit numbers. Are you and your software ready?”

RefSeq release 201 is public

RefSeq release 201 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of July 6, 2020, and contains 246,016,651 records, including 178,304,046 proteins, 32,462,009 RNAs, and sequences from 103,293 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Updated human genome Annotation Release 109.20200522
Updated Annotation Release 109.20200522 is an update of NCBI Homo sapiens Annotation Release 109. The annotation report for 109.20200522 is available here. The annotation products are available in the sequence databases and on the FTP site.

Updated mouse genome Annotation Release 108.20200622
Updated Annotation Release 108.20200622 is an update of NCBI Mus musculus Annotation Release 108. The annotation report for 108.20200622 is available here. The annotation products are available in the sequence databases and on the FTP site.

This update precedes the expected release of a full assembly update for the mouse GRCm38.p6 reference assembly by the GRC in 2020. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly later this year, for either RefSeq FTP Release 202 or 203.

Download high-quality graphics from the NCBI Multiple Sequence Alignment Viewer (MSAV)

You can now download a publication-quality graphic images of  the alignment displayed in the NCBI Multiple Sequence Alignment Viewer (Figure 1). Load sequence alignments into the viewer from BLAST or COBALT results or upload alignment files directly. Once you have the the alignment set in the viewer, choose the “Printer-friendly PDF/SVG” option in the Download menu on the toolbar to save the image. The PDF and SVG files contain vector graphics suitable for presentation and publication. MSA_downloadFigure 1. The image download options in the MSAV. You can adjust the desired coordinate range and choose to download a PDF or SVG image. You can also preview the PDF download . Choose simplified color shading to improve compatibility with some graphics programs.

The downloaded image will show the coordinate range you requested and will include all the rows in the alignment.

Please contact us through the Feedback link on the MSA Viewer or write to the NCBI Help Desk to provide feedback and let us know how we can make the NCBI Multiple Sequence Viewer work better for you.

RefSeq release 200 is public

RefSeq release 200 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of May 4, 2020, and contains 237,381,664 records, including 171,643,729 proteins, 31,244,247 RNAs, and sequences from 100,605 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements:

The number of organisms in RefSeq crosses 100,000!
The current RefSeq release contains 100,605 distinct species or taxons, with a net increase of 763 species since Release 99. This milestone coincides with the 100th release though the current release number is 200 (see below). Note that there is a decrease in the number of species for prokaryotes (bacteria and archaea) due to a clean-up that mainly removed unclassified bacteria, and assemblies from Metagenome-Assembled Genomes (MAGs).

The FTP release number has skipped to 200
As previously announced, NCBI’s Reference Sequence (RefSeq) FTP release number has incremented to 200 for this release, and skipped over the numbers 100-199. The previous, March 2020 release, was release 99. This change is to avoid overlapping with the release numbers of the independently numbered RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108.

NCBI Protein Families
A new release of the NCBI protein families profiles used by PGAP (the Prokaryotic Genome Annotation Pipeline) is now available. You can search this collection of Hidden Markov models (HMM) against your favorite prokaryotic proteins to identify their function using hmmer.

Recalculation of Prokaryotic Reference and Representative Genome Assemblies
We have updated the collection of reference and representative assemblies for Bacteria and Archaea to better reflect the taxonomic breadth of the prokaryotes in RefSeq. We have selected one reference or representative assembly for every species based on several criteria including contiguity, completeness, and whether the assembly is from type material.

Future change: Mouse Reference Assembly Update
A full assembly update for the mouse GRCm38.p6 reference assembly is expected to be released in 2020 by the GRC. We anticipate updating the mouse RefSeq annotation to the new GRCm39 assembly this summer, for either RefSeq FTP Release 201 or 202.

 

The next RefSeq FTP release number will skip to 200

NCBI’s Reference Sequence (RefSeq) FTP release numbers will increment to 200 for the next release and skip over the numbers 100-199. The current, March 2020 release, is release 99. The next bi-monthly release in May 2020 will be release 200.  This change is to avoid overlapping with the release numbers of the completely independent RefSeq annotation releases for the eukaryotic genomes we annotate, which are currently in the range 100-109, for example Mus musculus Annotation Release 108. Continue reading “The next RefSeq FTP release number will skip to 200”

RefSeq Release 99 is public

RefSeq release 99 is accessible online, via FTP and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 2, 2020, and contains 231,402,293 records, including 167,278,920 proteins, 29,869,155 RNAs, and sequences from 99,842 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Other announcements: Continue reading “RefSeq Release 99 is public”

Improving the Display of Type Material in the NCBI TaxBrowser

Have you ever been confused by multiple taxonomic names for a single organism? You’re not alone! It’s one of the challenges in maintaining any biological database. Recently we updated the NCBI TaxBrowser to assist with this.

Let’s start with a brief word about how investigators name species in the first place. For any new species, the reporting author declares a “type.” They then deposit a specimen, or “type material,” in a publicly available biorepository. This type material is tied to the new species name and serves as a reference for future comparisons. Researchers can then use DNA sequences obtained from type material to identify other samples from the same species. NCBI currently uses such an approach to verify the taxonomic assignment of prokaryotic genomes.

Our Taxonomy group has been curating type material records in the Taxonomy database since 2013 using a common vocabulary accepted by our international partners (the INSDC). For example, the Entrez query “type material[prop]” in the Taxonomy database will return all type material at NCBI.

So what are the improvements to the TaxBrowser?

Continue reading “Improving the Display of Type Material in the NCBI TaxBrowser”