Tag: Hidden Markov Models (HMM)

New PGAP release: Structural and functional annotation improvements

A new version of the Prokaryotic Genome Annotation Pipeline (PGAP) is available on GitHub. With this release, you can expect:

  • Incremental improvements in structural annotation, driven by increased weight of GeneMarkS2+ ab initio models at loci with only weak evidence, such as low identity and low coverage protein alignments or partial HMM signatures.
  • Better structural annotation and more specific functional annotation as a result of the incorporation of PFAM 34 and extensive curation of HMMs, BlastRules and Conserved Domain architectures by NCBI experts.
  • Fewer overly stringent calls by the taxonomy verification module for several species, including the human pathogens Listeria monocytogenes, Campylobacter lari, and Vibrio vulnificus. This is a result of manual review and adjustment of the minimum percent identity thresholds used by the Average Nucleotide Identity tool.
  • Multiple bug fixes. Notably, users of Azure Debian 10 machines can now run PGAP successfully, as we have incorporated GeneMarkS2+ compiled under Linux kernel 3 into the PGAP image.

Please try this release and send us your feedback!

New models added to the NCBI Hidden Markov models (HMM) collection with release 7.0

Release 7.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

Figure 1. Recently added HMM-based Protein Family Model for the histidine-histamine antiporter family (NF040512), with GO terms (framed in red).

Continue reading “New models added to the NCBI Hidden Markov models (HMM) collection with release 7.0”

RefSeq release 208 is available!

RefSeq release 208 is available!

RefSeq release 208 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of September 7, 2021, and contains 288,903,207 records, including 210,703,648 proteins, 40,213,945 RNAs, and sequences from 113,002 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings. Continue reading “RefSeq release 208 is available!”

Search the NCBI Hidden Markov models collection against your favorite prokaryotic proteins

The NCBI Hidden Markov models (HMM) 6.0 release, available on our FTP site, has 15,247 models supported at NCBI. We created 80 more new HMMs and consolidated the collection by removing 2,151 HMMs that were nearly identical to another. Release 6.0 also incorporates 12,656 PFAM from release 34 that apply to prokaryotic proteins. You can use the HMMER sequence analysis package to search the collection against your favorite prokaryotic proteins to identify their function. We have also added more specific names or associated EC number, gene symbols and publication to over 500 HMMs.

Gene Ontology (GO) term attributes are now available for 20% of HMM models (see Figure 1 below). We added most of these based on existing mappings, but our experts are working on creating more associations. Starting in the fall, we’ll start propagating GO terms from HMMs to annotated genomes and proteins!

Example Protein Family Model, TIGR03697.1 for the global nitrogen regulator NtcA protein family, with newly shown GO terms (framed in red).
Figure 1. Example Protein Family Model, TIGR03697.1 for the global nitrogen regulator NtcA protein family, with newly shown GO terms (framed in red).

Continue reading “Search the NCBI Hidden Markov models collection against your favorite prokaryotic proteins”

NCBI hidden Markov models (HMM) release 4.0 now available!

Release 4.0 of the NCBI hidden Markov models (HMM) used by the Prokaryotic Genome Annotation Pipeline (PGAP) is now available from our FTP site. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

This release contains 17,443 models, including 94 new models since the last release. We have also updated names and added EC numbers and  gene symbols to over 100 models. You can search and view the details of these HMMs in the newly deployed Protein Family Model collection that also includes conserved domain architectures and BlastRules  and allows you to find all RefSeq proteins named by these profiles. See our recent post for more details.

The Protein Family Model resource is now available!

The new Protein Family Model resource  (Figure 1) provides a way for you to search across the evidence used by the NCBI annotation pipelines to name and classify proteins. You can find protein families by gene symbol, protein function, and many other terms. You have access to related proteins in the family and publications describing members. Protein Family Models includes protein profile hidden Markov models (HMMs) and BlastRules for prokaryotes, and conserved domain architectures for prokaryotes and eukaryotes. The HMMs in the collection include Pfam models, TIGRFAMs as well as models developed at NCBI either de novo, or from NCBI protein clusters.  Each of the BlastRules (PMCID: 5753331) consists of one or more model proteins of known biological function with BLAST identity and coverage cutoffs.  The conserved domain architectures are based on BLAST-compatible Position Specific Score Matrices  (PSSMs) that constitute the NCBI Conserved Domain database.Figure 1. Protein Family Model resource pages. Top panel.  Home page. Middle  panel, selected results summaries from a fielded search for the DnaK gene product (DnaK[Gene Symbol]). Bottom panel, a portion of an HMM record for DnaK derived from NCBI Protein Clusters (NF009946). The record also includes PubMed citations and HMMER analyses showing the RefSeq proteins named by this method.

Continue reading “The Protein Family Model resource is now available!”