Biochemistry, Genetics and Molecular Biology › Molecular Biology

Genomics and Phylogenetic Studies

Description

This cluster of papers focuses on the statistical analysis, alignment, and annotation of RNA sequencing data, including methods for transcript quantification, phylogenetic analysis, genome assembly, and quality assessment in functional genomics. It also covers tools for sequence variation analysis and metagenomics assembly.

Keywords

RNA-seq; sequence alignment; phylogenetic analysis; genome annotation; transcript quantification; metagenomics assembly; sequence variation; phylogenetic tree; functional genomics; quality assessment

RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. … RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota … SILVA (from Latin silva, forest, http://www.arb-silva.de) is a comprehensive web resource for up to date, quality-controlled databases of aligned ribosomal RNA (rRNA) gene sequences from the Bacteria, Archaea and Eukaryota domains and supplementary online services. The referred database release 111 (July 2012) contains 3 194 778 small subunit and 288 717 large subunit rRNA gene sequences. Since the initial description of the project, substantial new features have been introduced, including advanced quality control procedures, an improved rRNA gene aligner, online tools for probe and primer evaluation and optimized browsing, searching and downloading on the website. Furthermore, the extensively curated SILVA taxonomy and the new non-redundant SILVA datasets provide an ideal reference for high-throughput classification of data from next-generation sequencing approaches.
When small RNA is sequenced on current sequencing machines, the resulting reads are usually longer than the RNA and therefore contain parts of the 3' adapter. That adapter must be … When small RNA is sequenced on current sequencing machines, the resulting reads are usually longer than the RNA and therefore contain parts of the 3' adapter. That adapter must be found and removed error-tolerantly from each read before read mapping. Previous solutions are either hard to use or do not offer required features, in particular support for color space data. As an easy to use alternative, we developed the command-line tool cutadapt, which supports 454, Illumina and SOLiD (color space) data, offers two adapter trimming algorithms, and has other useful features. Cutadapt, including its MIT-licensed source code, is available for download at http://code.google.com/p/cutadapt/
The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data.The MODELTEST package, including the source code and some documentation is available … The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data.The MODELTEST package, including the source code and some documentation is available at http://bioag.byu. edu/zoology/crandall_lab/modeltest.html.
The multiplex capability and high yield of current day DNA-sequencing instruments has made bacterial whole genome sequencing a routine affair. The subsequent de novo assembly of reads into contigs has … The multiplex capability and high yield of current day DNA-sequencing instruments has made bacterial whole genome sequencing a routine affair. The subsequent de novo assembly of reads into contigs has been well addressed. The final step of annotating all relevant genomic features on those contigs can be achieved slowly using existing web- and email-based systems, but these are not applicable for sensitive data or integrating into computational pipelines. Here we introduce Prokka, a command line software tool to fully annotate a draft bacterial genome in about 10 min on a typical desktop computer. It produces standards-compliant output files for further analysis or viewing in genome browsers.Prokka is implemented in Perl and is freely available under an open source GPLv2 license from http://vicbioinformatics.com/.
CLUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing … CLUSTAL X is a new windows interface for the widely-used progressive multiple sequence alignment program CLUSTAL W. The new system is easy to use, providing an integrated system for performing multiple sequence and profile alignments and analysing the results. CLUSTAL X displays the sequence alignment in a window on the screen. A versatile sequence colouring scheme allows the user to highlight conserved features in the alignment. Pull-down menus provide all the options required for traditional multiple sequence and profile alignment. New features include: the ability to cut-and-paste sequences to change the order of the alignment, selection of a subset of the sequences to be realigned, and selection of a sub-range of the alignment to be realigned and inserted back into the original alignment. Alignment quality analysis can be performed and low-scoring segments or exceptional residues can be highlighted. Quality analysis and realignment of selected residue ranges provide the user with a powerful tool to improve and refine difficult alignments and to trap errors in input sequences. CLUSTAL X has been compiled on SUN Solaris, IRIX5.3 on Silicon Graphics, Digital UNIX on DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for x86 PCs, and Macintosh PowerMac.
Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing web-based methods is complicated by the … Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing web-based methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner.This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets.BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/[email protected]; [email protected] data are available at Bioinformatics online.
Abstract Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of … Abstract Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20Ɨ faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]
The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, … The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum-likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbc L sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/.
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence … The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different … Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]
PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the … PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696–704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira–Hasegawa–like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phyml/.
Large phylogenomics data sets require fast tree inference methods, especially for maximum-likelihood (ML) phylogenies. Fast programs exist, but due to inherent heuristics to find optimal trees, it is not clear … Large phylogenomics data sets require fast tree inference methods, especially for maximum-likelihood (ML) phylogenies. Fast programs exist, but due to inherent heuristics to find optimal trees, it is not clear whether the best tree is found. Thus, there is need for additional approaches that employ different search strategies to find ML trees and that are at the same time as fast as currently available ML programs. We show that a combination of hill-climbing approaches and a stochastic perturbation method can be time-efficiently implemented. If we allow the same CPU time as RAxML and PhyML, then our software IQ-TREE found higher likelihoods between 62.2% and 87.1% of the studied alignments, thus efficiently exploring the tree-space. If we use the IQ-TREE stopping rule, RAxML and PhyML are faster in 75.7% and 47.1% of the DNA alignments and 42.2% and 100% of the protein alignments, respectively. However, the range of obtaining higher likelihoods with IQ-TREE improves to 73.3–97.1%. IQ-TREE is freely available at http://www.cibiv.at/software/iqtree.
The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to … The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+Vāˆ’SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies. SPAdes is available online (http://bioinf.spbau.ru/spades). It is distributed as open source software.
Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search … Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
We announce the release of the fourth version of MEGA software, which expands on the existing facilities for editing DNA sequence data from autosequencers, mining Web-databases, performing automatic and manual … We announce the release of the fourth version of MEGA software, which expands on the existing facilities for editing DNA sequence data from autosequencers, mining Web-databases, performing automatic and manual sequence alignment, analyzing sequence alignments to estimate evolutionary distances, inferring phylogenetic trees, and testing evolutionary hypotheses. Version 4 includes a unique facility to generate captions, written in figure legend format, in order to provide natural language descriptions of the models and methods used in the analyses. This facility aims to promote a better understanding of the underlying assumptions used in analyses, and of the results generated. Another new feature is the Maximum Composite Likelihood (MCL) method for estimating evolutionary distances between all pairs of sequences simultaneously, with and without incorporating rate variation among sites and substitution pattern heterogeneities among lineages. This MCL method also can be used to estimate transition/transversion bias and nucleotide substitution pattern without knowledge of the phylogenetic tree. This new version is a native 32-bit Windows application with multi-threading and multi-user supports, and it is also available to run in a Linux desktop environment (via the Wine compatibility layer) and on Intel-based Macintosh computers under the Parallels program. The current version of MEGA is available free of charge at http://www.megasoftware.net.
A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated … A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies—a whole-genome assembly and a regional chromosome assembly—were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional ∼12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.
A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are … A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT‐NS‐2) and the iterative refinement method (FFT‐NS‐i), are implemented in MAFFT. The performances of FFT‐NS‐2 and FFT‐NS‐i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT‐NS‐2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT‐NS‐i is over 100 times faster than T‐COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.
Motivation: Although many next-generation sequencing (NGS) read preprocessing tools already existed, we could not find any tool or combination of tools that met our requirements in terms of flexibility, correct … Motivation: Although many next-generation sequencing (NGS) read preprocessing tools already existed, we could not find any tool or combination of tools that met our requirements in terms of flexibility, correct handling of paired-end data and high performance. We have developed Trimmomatic as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data.
Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and … Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Here, we announce the release of Molecular Evolutionary Genetics Analysis version 5 (MEGA5), which is a user-friendly software for mining online databases, building sequence alignments and phylogenetic trees, and using methods of evolutionary bioinformatics in basic biology, biomedicine, and evolution. The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models (nucleotide or amino acid), inferring ancestral states and sequences (along with probabilities), and estimating evolutionary rates site-by-site. In computer simulation analyses, ML tree inference algorithms in MEGA5 compared favorably with other software packages in terms of computational efficiency and the accuracy of the estimates of phylogenetic trees, substitution parameters, and rate variation among sites. The MEGA user interface has now been enhanced to be activity driven to make it easier for the use of both beginners and experienced scientists. This version of MEGA is intended for the Windows platform, and it has been configured for effective use on Mac OS X and Linux desktops. It is available free of charge from http://www.megasoftware.net.
We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new … We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.
Abstract Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts … Abstract Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq . Contact: [email protected]
Abstract Motivation: Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare … Abstract Motivation: Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare populations. Undetected chimeras may be misinterpreted as novel species, causing inflated estimates of diversity and spurious inferences of differences between populations. Detection and removal of chimeras is therefore of critical importance in such experiments. Results: We describe UCHIME, a new program that detects chimeric sequences with two or more segments. UCHIME either uses a database of chimera-free sequences or detects chimeras de novo by exploiting abundance data. UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences. In testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus. UCHIME is >100Ɨ faster than Perseus and >1000Ɨ faster than ChimeraSlayer. Contact: [email protected] Availability: Source, binaries and data: http://drive5.com/uchime. Supplementary information: Supplementary data are available at Bioinformatics online.
Abstract Summary: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the … Abstract Summary: The Clustal W and Clustal X multiple sequence alignment programs have been completely rewritten in C++. This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems. Availability: The programs can be run on-line from the EBI web server: http://www.ebi.ac.uk/tools/clustalw2. The source code and executables for Windows, Linux and Macintosh computers are available from the EBI ftp site ftp://ftp.ebi.ac.uk/pub/software/clustalw2/ Contact: [email protected]
Motivation: Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the … Motivation: Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. Results: We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. Availability and implementation: featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages. Contact: [email protected]
Abstract Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an … Abstract Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community. Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory requirements of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available. Availability and implementation: The code is available under GNU GPL at https://github.com/stamatak/standard-RAxML. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST … Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications.We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site.The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.
MrBayes 3 performs Bayesian phylogenetic analysis combining information from different data partitions or subsets evolving under different stochastic evolutionary models. This allows the user to analyze heterogeneous data sets consisting … MrBayes 3 performs Bayesian phylogenetic analysis combining information from different data partitions or subsets evolving under different stochastic evolutionary models. This allows the user to analyze heterogeneous data sets consisting of different data types-e.g. morphological, nucleotide, and protein-and to explore a wide variety of structured models mixing partition-unique and shared parameters. The program employs MPI to parallelize Metropolis coupling on Macintosh or UNIX clusters.
Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce … Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site dN/dS rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.
Abstract Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is … Abstract Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]
We announce the release of an advanced version of the Molecular Evolutionary Genetics Analysis (MEGA) software, which currently contains facilities for building sequence alignments, inferring phylogenetic histories, and conducting molecular … We announce the release of an advanced version of the Molecular Evolutionary Genetics Analysis (MEGA) software, which currently contains facilities for building sequence alignments, inferring phylogenetic histories, and conducting molecular evolutionary analysis. In version 6.0, MEGA now enables the inference of timetrees, as it implements the RelTime method for estimating divergence times for all branching points in a phylogeny. A new Timetree Wizard in MEGA6 facilitates this timetree inference by providing a graphical user interface (GUI) to specify the phylogeny and calibration constraints step-by-step. This version also contains enhanced algorithms to search for the optimal trees under evolutionary criteria and implements a more advanced memory management that can double the size of sequence data sets to which MEGA can be applied. Both GUI and command-line versions of MEGA6 can be downloaded from www.megasoftware.net free of charge.
ABSTRACT The Ribosomal Database Project (RDP) Classifier, a naïve Bayesian classifier, can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline … ABSTRACT The Ribosomal Database Project (RDP) Classifier, a naïve Bayesian classifier, can rapidly and accurately classify bacterial 16S rRNA sequences into the new higher-order taxonomy proposed in Bergey's Taxonomic Outline of the Prokaryotes (2nd ed., release 5.0, Springer-Verlag, New York, NY, 2004). It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. The majority of classifications (98%) were of high estimated confidence (≄95%) and high accuracy (98%). In addition to being tested with the corpus of 5,014 type strain sequences from Bergey's outline, the RDP Classifier was tested with a corpus of 23,095 rRNA sequences as assigned by the NCBI into their alternative higher-order taxonomy. The results from leave-one-out testing on both corpora show that the overall accuracies at all levels of confidence for near-full-length and 400-base segments were 89% or above down to the genus level, and the majority of the classification errors appear to be due to anomalies in the current taxonomies. For shorter rRNA segments, such as those that might be generated by pyrosequencing, the error rate varied greatly over the length of the 16S rRNA gene, with segments around the V2 and V4 variable regions giving the lowest error rates. The RDP Classifier is suitable both for the analysis of single rRNA sequences and for the analysis of libraries of thousands of sequences. Another related tool, RDP Library Compare, was developed to facilitate microbial-community comparison based on 16S rRNA gene sequence libraries. It combines the RDP Classifier with a statistical test to flag taxa differentially represented between samples. The RDP Classifier and RDP Library Compare are available online at http://rdp.cme.msu.edu/ .
DnaSP is a software package for a comprehensive analysis of DNA polymorphism data. Version 5 implements a number of new features and analytical methods allowing extensive DNA polymorphism analyses on … DnaSP is a software package for a comprehensive analysis of DNA polymorphism data. Version 5 implements a number of new features and analytical methods allowing extensive DNA polymorphism analyses on large datasets. Among other features, the newly implemented methods allow for: (i) analyses on multiple data files; (ii) haplotype phasing; (iii) analyses on insertion/deletion polymorphism data; (iv) visualizing sliding window results integrated with available genome annotations in the UCSC browser.Freely available to academic users from: (http://www.ub.edu/dnasp).
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits … The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSIBLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of … We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.
Abstract Summary: The program MRBAYES performs Bayesian inference of phylogeny using a variant of Markov chain Monte Carlo. Availability: MRBAYES, including the source code, documentation, sample data files, and an … Abstract Summary: The program MRBAYES performs Bayesian inference of phylogeny using a variant of Markov chain Monte Carlo. Availability: MRBAYES, including the source code, documentation, sample data files, and an executable, is available at http://brahms.biology.rochester.edu/software.html. Contact: [email protected]
Abstract Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a … Abstract Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Ī“ yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets ≄4000 taxa it also runs 2–3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Availability: Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the … Motivation: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results: To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 Ɨ 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80–90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/. Contact:[email protected].
Molecular genetic maps are commonly constructed by analyzing the segregation of restriction fragment length polymorphisms (RFLPs) among the progeny of a sexual cross. Here we describe a new DNA polymorphism … Molecular genetic maps are commonly constructed by analyzing the segregation of restriction fragment length polymorphisms (RFLPs) among the progeny of a sexual cross. Here we describe a new DNA polymorphism assay based on the amplification of random DNA segments with single primers of arbitrary nucleotide sequence. These polymorphisms, simply detected as DNA segments which amplify from one parent but not the other, are inherited in a Mendellan fashion and can be used to construct genetic maps in a variety of species. We suggest that these polymorphisms be called RAPD markers, after Random Amplified Polymorphic DNA.
Abstract We present the latest version of the Molecular Evolutionary Genetics Analysis (M ega ) software, which contains many sophisticated methods and tools for phylogenomics and phylomedicine. In this major … Abstract We present the latest version of the Molecular Evolutionary Genetics Analysis (M ega ) software, which contains many sophisticated methods and tools for phylogenomics and phylomedicine. In this major upgrade, M ega has been optimized for use on 64-bit computing systems for analyzing larger datasets. Researchers can now explore and analyze tens of thousands of sequences in M ega . The new version also provides an advanced wizard for building timetrees and includes a new functionality to automatically predict gene duplication events in gene family trees. The 64-bit M ega is made available in two interfaces: graphical and command line. The graphical user interface (GUI) is a native Microsoft Windows application that can also be used on Mac OS X. The command line M ega is available as native applications for Windows, Linux, and Mac OS X. They are intended for use in high-throughput and scripted analysis. Both versions are available from www.megasoftware.net free of charge.
The Molecular Evolutionary Genetics Analysis (Mega) software implements many analytical methods and tools for phylogenomics and phylomedicine. Here, we report a transformation of Mega to enable cross-platform use on Microsoft … The Molecular Evolutionary Genetics Analysis (Mega) software implements many analytical methods and tools for phylogenomics and phylomedicine. Here, we report a transformation of Mega to enable cross-platform use on Microsoft Windows and Linux operating systems. Mega X does not require virtualization or emulation software and provides a uniform user experience across platforms. Mega X has additionally been upgraded to use multiple computing cores for many molecular evolutionary analyses. Mega X is available in two interfaces (graphical and command line) and can be downloaded from www.megasoftware.net free of charge.
The Molecular Evolutionary Genetics Analysis (MEGA) software has matured to contain a large collection of methods and tools of computational molecular evolution. Here, we describe new additions that make MEGA … The Molecular Evolutionary Genetics Analysis (MEGA) software has matured to contain a large collection of methods and tools of computational molecular evolution. Here, we describe new additions that make MEGA a more comprehensive tool for building timetrees of species, pathogens, and gene families using rapid relaxed-clock methods. Methods for estimating divergence times and confidence intervals are implemented to use probability densities for calibration constraints for node-dating and sequence sampling dates for tip-dating analyses. They are supported by new options for tagging sequences with spatiotemporal sampling information, an expanded interactive Node Calibrations Editor, and an extended Tree Explorer to display timetrees. Also added is a Bayesian method for estimating neutral evolutionary probabilities of alleles in a species using multispecies sequence alignments and a machine learning method to test for the autocorrelation of evolutionary rates in phylogenies. The computer memory requirements for the maximum likelihood analysis are reduced significantly through reprogramming, and the graphical user interface has been made more responsive and interactive for very big data sets. These enhancements will improve the user experience, quality of results, and the pace of biological discovery. Natively compiled graphical user interface and command-line versions of MEGA11 are available for Microsoft Windows, Linux, and macOS from www.megasoftware.net.
DNA N6-methyladenine (6mA) plays a significant role in various biological processes. In the rice genome, 6mA is involved in important processes such as growth and development, influencing gene expression. Therefore, … DNA N6-methyladenine (6mA) plays a significant role in various biological processes. In the rice genome, 6mA is involved in important processes such as growth and development, influencing gene expression. Therefore, identifying the 6mA locus in rice is crucial for understanding its complex gene expression regulatory system. Although several useful prediction models have been proposed, there is still room for improvement. To address this, we propose an architecture named iRice6mA-LMXGB that integrates a fine-tuned large language model to identify the 6mA locus in rice. Specifically, our method consists of two main components: (1) a BERT model for feature extraction and (2) an XGBoost module for 6mA classification. We utilize a pre-trained DNABERT-2 model to initialize the parameters of the BERT component. Through transfer learning, we fine-tune the model on the rice 6mA recognition task, converting raw DNA sequences into high-dimensional feature vectors. These features are then processed by an XGBoost algorithm to generate predictions. To further validate the effectiveness of our fine-tuning strategy, we employ UMAP(Uniform Manifold Approximation and Projection) visualization. Our approach achieves a validation accuracy of 0.9903 in a five-fold cross-validation setting and produces a receiver operating characteristic (ROC) curve with an area under the curve (AUC) of 0.9994. Compared to existing predictors trained on the same dataset, our method demonstrates superior performance. This study provides a powerful tool for advancing research in rice 6mA epigenetics.
Eriochloa villosa (Thunb.) Kunth, 1829 (Woolly cupgrass), an invasive annual weed in the Poaceae family, poses a significant threat to corn and soybean crop production, resulting in substantial yield reduction. … Eriochloa villosa (Thunb.) Kunth, 1829 (Woolly cupgrass), an invasive annual weed in the Poaceae family, poses a significant threat to corn and soybean crop production, resulting in substantial yield reduction. Despite its agronomic importance, genomic resources for this species remain limited. In this study, we report the first complete chloroplast genome assembly of a hexaploid E. villosa, which spans 139,777 base pairs and contains 127 annotated genes. This comprehensive chloroplast genomic characterization provides essential foundational resources for further genetic and evolutionary studies within the Eriochloa genus.
A novel actinobacterium, designated strain 12N6 T , was isolated from tropical peat swamp forest soil in Rayong Province, Thailand. The taxonomic position was determined using a polyphasic approach. Phylogenetic … A novel actinobacterium, designated strain 12N6 T , was isolated from tropical peat swamp forest soil in Rayong Province, Thailand. The taxonomic position was determined using a polyphasic approach. Phylogenetic analysis based on 16S rRNA gene sequences revealed that strain 12N6 T was classified within the genus Planosporangium and showed the highest percentage similarity to Planosporangium flavigriseum YIM 46034 T (98.7%), followed by Planosporangium thailandense HSS8-18 T (98.0%) and Planosporangium mesophilum YIM 48875 T (97.9%). Strain 12N6 T produced a single globose spore with a spiny surface on short sporophores of the substrate mycelia. The approximate genome size and DNA G+C content of strain 12N6 T were 7.47 Mbp and 71.6 mol%, respectively. The highest average nucleotide identity and digital DNA–DNA hybridization values of genome sequences of 12N6 T compared with closest species type strains (83.0% and 27.8%, respectively) are well below the thresholds for species delineation. The whole-cell hydrolysates of strain 12N6 T contained meso -diaminopimelic acid, glucose, mannose, arabinose, xylose and ribose. The polar lipid profile comprised diphosphatidylglycerol, phosphatidylethanolamine and phosphatidylglycerol. The major cellular fatty acids were anteiso-C 17 : 0 , iso-C 16 : 0 and iso-C 15 : 0 . The analysis of phylogenetic, genomic, phenotypic and chemotaxonomic characteristics revealed that strain 12N6 T is considered to represent a novel species of the genus Planosporangium , for which the name Planosporangium spinosum sp. nov. is proposed. The type strain is 12N6 T (=TBRC 19149 T =NBRC 117121 T ).
Background With the increased use of shotgun metagenome and metatranscriptome sequences in characterizing the microbiome, accurate taxonomic classification of sequencing reads is essential for interpreting microbial community composition and revealing … Background With the increased use of shotgun metagenome and metatranscriptome sequences in characterizing the microbiome, accurate taxonomic classification of sequencing reads is essential for interpreting microbial community composition and revealing differential microbial signature between groups. K-mer based classifiers such as Kraken2 provide high speed and sensitivity, and are commonly used for low microbial biomass samples. However, their performance can be compromised by specific sources of error without proper parameter settings and incorporation of controls. Methods In this study, we analyzed six sequencing datasets of human tumor biopsies with Kraken2 and investigated how shared compact hash codes (i.e., identical hash codes across different k-mers), hash collision and the structure of reference databases can contribute to false positive taxonomic assignments in low biomass samples. Results We demonstrated that in samples with high non-microbial DNA noise, the classified taxa of Kraken2 in sequencing reads are significantly correlated with that of shuffled sequences using the default setting. These taxa showed a similar distribution as those overrepresented in the hash table construction of the reference database. Incorporation of controls using shuffled reads can separate significant taxa with more robust differences from those more affected by background noise. Although the confidence thresholds needed to minimize noise varied with taxa, a minimum value of 0.2 can also help reduce misclassifications. Conclusion Our findings highlighted the need for caution when interpreting low-abundance or unexpected taxa in sequencing datasets of low microbial biomass samples. This work contributes to a more comprehensive understanding of the limitations of k-mer based classification tools and provides practical guidance for improving accuracy in microbiome research.
Peiying Huang , Chao Yang , Mei‐Fang Lin +3 more | Frontiers in Cellular and Infection Microbiology
Wohlfahrtiimonas is an infrequently encountered Gram-negative bacterium capable of infecting humans and, in severe instances, precipitating sepsis. Presently, three species within the Wohlfahrtiimonas genus have been identified, with Wohlfahrtiimonas chitiniclastica … Wohlfahrtiimonas is an infrequently encountered Gram-negative bacterium capable of infecting humans and, in severe instances, precipitating sepsis. Presently, three species within the Wohlfahrtiimonas genus have been identified, with Wohlfahrtiimonas chitiniclastica being the sole species implicated in human infections. To date, there has been only one documented case of human infection with W. chitiniclastica in China. In this study, we present an additional case of human infection with a Wohlfahrtiimonas species. Notably, through 16S rRNA gene sequencing and whole-genome sequencing, the strain was identified as an unclassified species closely related to W. chitiniclastica DSM 18708 T .
Abstract Bifidobacterium species are well-established members of the human gut microbiome, particularly prominent during infancy, contributing to host health. Within this genus, Bifidobacterium longum ( BL. ) is a widespread … Abstract Bifidobacterium species are well-established members of the human gut microbiome, particularly prominent during infancy, contributing to host health. Within this genus, Bifidobacterium longum ( BL. ) is a widespread species found in both infant and adult guts, known for its complexity and functional diversity among its known subspecies: BL. longum , BL. infantis and BL. suis . Here, using genomic and phylogenetic tools we propose a novel subspecies within the BL. species, Bifidobacterium longum subsp. nexus subspecies novel. We analyzed 435 BL. genomes using a polyphasic taxonomic approach comprising average nucleotide identity (ANI), digital DNA–DNA hybridization (dDDH), and pangenome analysis. We identified nine BL. strains, isolated from human infants and adults stool samples, as members of a distinct lineage within the BL. species. The type strain, LL6991, was isolated from the stool of a two-week-old Dutch infant in the Lifelines NEXT birth cohort. Phenotypically, BL. nexus exhibits a distinct morphological pattern, predominantly forming rod-shaped cells, often in chains with visible septa, contrasting with the Y-shaped morphology commonly observed for other BL. subspecies. Furthermore, BL. nexus demonstrates unique metabolic capabilities, including efficient utilization of fructose, and starch, carbohydrates not metabolized well by other tested BL. subspecies. This ability may be attributed to specific genes, such as a gene predicted to encode an extracellular amylopullulanase. This characterization expands the known diversity within the BL. species and provides insights into BL. nexus ’s unique adaptations and potential ecological roles within the human gut, especially in infants. Based on the consistent results from genotypic, phylogenetic, and phenotypic analyses, a novel subspecies with the name Bifidobacterium longum subsp. nexus , with type strain LL6991, is proposed.
Complete chloroplast genome sequences are widely used in the analyses of phylogenetic relationships among angiosperms. As a species-rich genus, species diversity centers of Saxifraga L. include mountainous regions of Eurasia, … Complete chloroplast genome sequences are widely used in the analyses of phylogenetic relationships among angiosperms. As a species-rich genus, species diversity centers of Saxifraga L. include mountainous regions of Eurasia, such as the Alps and the Qinghai–Tibetan Plateau (QTP) sensu lato. However, to date, datasets of chloroplast genomes of Saxifraga have been concentrated on the QTP species; those from European Alps are largely unavailable, which hinders comprehensively comparative and evolutionary analyses of chloroplast genomes in this genus. Here, complete chloroplast genomes of 19 Saxifraga species were de novo sequenced, assembled and annotated, and of these 15 species from Alps were reported for the first time. Subsequent comparative analysis and phylogenetic reconstruction were also conducted. Chloroplast genome length of the 19 Saxifraga species range from 149,217 bp to 152,282 bp with a typical quadripartite structure. All individual chloroplast genome included in this study contains 113 unique genes, including 79 protein-coding genes, four rRNAs and 30 tRNAs. The IR boundaries keep relatively conserved with minor expansion in S. consanguinea. mVISTA analysis and identification of polymorphic loci for molecular markers shows that six intergenic regions (ndhC-trnV, psbE-petL, rpl32-trnL, rps16-trnQ, trnF-ndhJ, trnS-trnG) can be selected as the potential DNA barcodes. A total of 1204 SSRs, 433 tandem repeats and 534 Large sequence repeats were identified in the 19 Saxifraga chloroplast genomes. The codon usage analysis revealed that Saxifraga chloroplast genome codon prefers to end in A/T. Phylogenetic reconstruction of 33 species (31 Saxifraga species included) based on 75 common protein coding genes received high bootstrap support values for nearly all identified nodes, and revealed a tree topology similar to previous studies.
Here, we present the complete genome sequence of Saccharolobus solfataricus strain S441, isolated from Devil's Kitchen, Lassen Volcanic National Park in Northern California, U.S.A. The genome for this strain is … Here, we present the complete genome sequence of Saccharolobus solfataricus strain S441, isolated from Devil's Kitchen, Lassen Volcanic National Park in Northern California, U.S.A. The genome for this strain is 2,766,550 base pairs, with a GC content of 35.7% and 3,031 genes.
A new member affiliated with the genus Desulfovibrio, designated as strain TCA, was isolated from a dechlorinating enrichment culture originating from a freshwater river sediment. Its genome includes a 3.5 … A new member affiliated with the genus Desulfovibrio, designated as strain TCA, was isolated from a dechlorinating enrichment culture originating from a freshwater river sediment. Its genome includes a 3.5 Mb chromosome with a G + C content of 66.2% and a 3.2 kb plasmid with a G + C content of 59.9%.
The predictability of biochemical evolution remains highly debated, as it is unknown if biological function can be predicted a priori. Here, we explore predictability of function not in individual reactions, … The predictability of biochemical evolution remains highly debated, as it is unknown if biological function can be predicted a priori. Here, we explore predictability of function not in individual reactions, but in ensembles. Sampling from 13,777 bacterial, 371 archaeal, and 203 eukaryotic genomes, we use Enzyme Commission (EC) classification to hierarchically group enzyme-catalyzed reactions that perform similar transformations into functional equivalence classes. We show that organisms partition their reaction diversity among these functional equivalence classes in a predictable way, identifying over 120 new, system-size dependent functional scaling relationships. We find distantly related lineages, acting on distinct molecular substrates, can display similar functional scaling, demonstrating that convergence is not driven by phylogeny or common reactions. We demonstrate how transitions in functional scaling can be identified with physiological shifts, using as case studies O2-utilizing oxidoreductases and hexosyltransferases. Taken together, our results open novel avenues for predicting global features of evolving enzyme populations, independent of protein structures. These functional constraints may have broad implications, including for predicting biochemical diversity, designing synthetic organisms, and modeling the evolution of reaction diversity in cases where the exact identity of catalysts is not known, such as during the emergence of life and for potential alternative forms of biochemistry.
This study presents the first complete mitochondrial genome characterization of Elops machnata (Teleostei: Elopiformes: Elopidae), a basal teleost lineage critical for understanding early actinopterygian evolution. The assembled mitogenome, deposited under … This study presents the first complete mitochondrial genome characterization of Elops machnata (Teleostei: Elopiformes: Elopidae), a basal teleost lineage critical for understanding early actinopterygian evolution. The assembled mitogenome, deposited under GenBank accession number PV294982, spans 16,712 bp and exhibits the canonical vertebrate mitochondrial gene organization, comprising 13 protein-coding genes, 22 tRNA genes, 2 rRNA genes, and a control region. Base composition analysis revealed 22.71% A, 17.36% C, 29.82% T, and 30.11% G, with a slight AT bias (A + T = 52.53%). Codon usage analysis of the 13 protein-coding genes identified CUA (L), CGA (R), GCC (A), and GGA (G) as the most frequent codons, with a pronounced preference for adenine at the third codon position. Amino acid composition analysis across 23 Elopomorpha species revealed consistently high leucine contents, and tRNA secondary structure prediction showed 21 tRNAs forming typical cloverleaf structures, except for trnS1(gct), which lacks the dihydrouridine (DHU) arm. Phylogenetic reconstruction using maximum likelihood and Bayesian inference methods, based on concatenated mitochondrial protein-coding genes from 23 Elopomorpha species, placed E. machnata in a well-supported clade with Elops hawaiensis, confirming their close evolutionary relationship. This study not only provides essential genomic resources for E. machnata but also resolves key gaps in the mitochondrial genome and improves phylogenetic understanding of Elopomorpha.
Oysters are a group of bivalves forming the family Ostreidae. The identification of oysters at species level is sometimes difficult. The use of molecular data has drastically improved the reliability … Oysters are a group of bivalves forming the family Ostreidae. The identification of oysters at species level is sometimes difficult. The use of molecular data has drastically improved the reliability of species identification and our understanding of their phylogenetic relationships. Markers obtained from mitochondrial genome have played and continue to play a key role in this process. Complete mitogenomes are still unavailable for many oyster species. We sequenced three complete mitogenomes of the dwarf oyster Ostrea stentina . We performed a comparative and evolutionary mitogenomic study of the new sequences combined with all available ones for the Ostreinae. The mitogenome of O. stentina exhibited the standard gene order of Ostreinae, which is different from those observed in other subfamilies of Ostreidae. The study of these mitogenomic arrangements identified gene blocks that were present in the mitogenome of the last common ancestor of the Ostreidae. The comparative analysis allowed identifying peculiar features of the mitogenomes of Ostreinae as well as of their protein coding genes, tRNAs genes, rRNA genes, and control regions. The genus Ostrea resulted polyphyletic in the mito-phylogenomic analysis. The stems and loops of several tRNAs contained short DNA motifs useful to identify single species/groups of species. Short sequences, playing the role of molecular signatures characterizing a single taxon or a group of species, were identified also in the intergenic spacers. The identification of these taxonomic and phylogenetic markers reinforces the crucial role of mitogenomes in elucidating the evolutionary history of oysters.
This study provides the first comprehensive analysis of the chloroplast genome of Cordia subcordata, a protected species in China, native to Malesia and widely distributed along the Pacific and Indian … This study provides the first comprehensive analysis of the chloroplast genome of Cordia subcordata, a protected species in China, native to Malesia and widely distributed along the Pacific and Indian Oceans. The genome is 154,811 bp long, circular, with a 38.01% GC content, encoding 133 genes: 88 protein-coding, 37 tRNA, and 8 rRNA genes. Phylogenetic analysis reveals it clusters with other Cordia species and is distinct from genera like Ehretia and Heliotropium, highlighting significant diversification within the Boraginaceae family.
Abstract Parupeneus biaculeatus , also known as pointed goatfish, belongs to the family Mullidae and is distinguished by its unique hyoid barbels containing sensory organs and a specialized foraging strategy, … Abstract Parupeneus biaculeatus , also known as pointed goatfish, belongs to the family Mullidae and is distinguished by its unique hyoid barbels containing sensory organs and a specialized foraging strategy, setting it apart from other fish species and making it an ideal model for studying biological adaptations and evolutionary processes. In this study, we present a high-quality chromosome-level genome assembly for P . biaculeatus using HiFi long reads and Hi-C data. The assembled genome has a total length of 657.58 Mb with a contig N50 of 9.35 Mb, organized into 22 chromosomes covering 99.34% of the genome. A total of 22,490 protein-coding genes were predicted, of which 98.37% were functionally annotated. Repeat analysis revealed that 34.83% of the genome consists of repetitive sequences. The genome assembly achieved an estimated completeness of 99.30% according to BUSCO analysis. This genomic resource provides new opportunities for understanding the biological traits, adaptive mechanisms, and evolutionary history of P . biaculeatus , and lays a foundation for further genomic studies within the family Mullidae.
Protein sequence similarity search is fundamental to biology research, but current methods are typically not able to consider crucial genomic context information indicative of protein function, especially in microbial systems. … Protein sequence similarity search is fundamental to biology research, but current methods are typically not able to consider crucial genomic context information indicative of protein function, especially in microbial systems. Here, we present Gaia (Genomic AI Annotator), a sequence annotation platform that enables rapid, context-aware protein sequence search across genomic datasets. Gaia leverages gLM2, a mixed-modality genomic language model trained on both amino acid sequences and their genomic neighborhoods to generate embeddings that integrate sequence-structure-context information. This approach allows for the identification of functionally and/or evolutionarily related genes that are found in conserved genomic contexts, which may be missed by traditional sequence- or structure-based search alone. Gaia enables real-time search of a curated database comprising more than 85 million protein clusters from 131,744 microbial genomes. We compare the homolog retrieval performance of Gaia search against other embedding and alignment-based approaches. We provide Gaia as a web-based, freely available tool.
In January 2020, the World Health Organization officially recognized the novel coronavirus outbreak as a worldwide public health crisis, marking the onset of what would become the COVID-19 pandemic. Since … In January 2020, the World Health Organization officially recognized the novel coronavirus outbreak as a worldwide public health crisis, marking the onset of what would become the COVID-19 pandemic. Since then, extensive research efforts have been initiated to describe the virus, understand mutation patterns, transmission dynamics, and develop vaccines. Many of these studies require the classification of various virus strains, which is crucial for accurately characterizing the variants that emerged during the pandemic. However, classifying these strains requires methods for comparing genomic sequences, typically involving sequence alignment, a time-consuming process. In our study, we focused on assessing the accuracy and time efficiency of the k-mer method, which does not rely on sequence alignment but can enhance genomic comparisons. Using data from the National Center for Biotechnology Information SARS-CoV-2 website, we classified 17 complete genomes from different groups detected or emerging in Brazil, employing both alignment-based and k-mer approaches. An iterative prototype was developed in Shiny for classification analysis of SARS-CoV-2 virus sequences. Both methods yielded identical classifications, but the k-mer method outperformed significantly, being 97% faster. Therefore, we advocate for the use of the k-mer method in viral genome analysis, particularly during emerging pandemics. Its combination of speed and accuracy can greatly expedite responses to new viral threats.
<ns3:p>Background Long-read RNA sequencing technologies can produce complete or near-complete transcript sequences. Recently introduced methods for direct RNA and cDNA sequencing can provide a high-throughput strategy for the discovery of … <ns3:p>Background Long-read RNA sequencing technologies can produce complete or near-complete transcript sequences. Recently introduced methods for direct RNA and cDNA sequencing can provide a high-throughput strategy for the discovery of novel and rare gene isoforms. However, the high error rates in ONT sequences limit the ability to exactly pinpoint splice site boundaries when aligning reads to the genome. Methods In this paper, we present a novel tool called NIFFLR (Novel IsoForm Finder using Long Reads) that identifies and quantifies both known and novel isoforms using long-read RNA sequencing data. NIFFLR recovers known transcripts and assembles novel transcripts present in the data by aligning exons from a reference annotation to the long reads. Results NIFFLR effectively recovers correct transcripts from simulated reads based on known transcript annotations, achieving higher sensitivity and precision compared to several previously-published tools. On real data, NIFFLR shows the high accuracy as measured by concordance of isoform counts to the counts computed from Illumina data for the same sample. We applied NIFFLR to a set of 92 GTEx long-read samples and produced transcript counts for both novel and known isoforms. In total, we identified and quantified 121,155 isoforms present in the RefSeq annotation of GRCh38 and 106,667 high-confidence novel isoforms across 32,875 genes present in two or more samples in these data, more than previous studies identified in this data set. Conclusions NIFFLR is an effective tool aimed at assembly and quantification of transcripts present in the long high error transcriptome reads. NIFFLR is released under an open-source license (GPL 3.0) and is available on GitHub at https://github.com/alguoo314/NIFFLR/releases.</ns3:p>
A novel bacterium, designated 1RM2 T , was isolated from Xinjiang Province, north-west PR China. This strain could grow under conditions of 20–45 °C, pH 5.0–10.0 and 0–10% (w/v) NaCl. … A novel bacterium, designated 1RM2 T , was isolated from Xinjiang Province, north-west PR China. This strain could grow under conditions of 20–45 °C, pH 5.0–10.0 and 0–10% (w/v) NaCl. The species with the highest similarity of 16S rRNA gene sequences to strain 1RM2 T were strain Chryseobacterium bernardetii NCTC 13530 T (97.1%) and Chryseobacterium daecheongense DSM 15235 T (96.9%). The draft genome sequence G+C content of strain 1RM2 T was 39.5 mol%. The average nucleotide identity and DNA–DNA hybridization values between strain 1RM2 T and the two closest neighbours were 78.6%, 77.6% and 21.8%, 21.2%, respectively. The main fatty acids of strain 1RM2 T were iso-C 15:0 , iso-C 17:0 3-OH, summed feature 3 (C 16:1 ω 6 c and/or C 16:1 ω 7 c ) and summed feature 9 (C 16:0 10-methyl and/or iso-C 17:1 ω 9 c ). The main isoprenoid quinone was menaquinone-6 and polar lipids were phosphatidylethanolamine, unidentified amino phospholipids and unidentified lipids. Based on phenotypic characteristics and genotype analysis, strain 1RM2 T is a new species of the genus Chryseobacterium and is proposed to be named Chryseobacterium gossypii sp. nov. (=GDMCC 1.4437 T =KCTC 102275 T ).
Four Gram-stain-positive, facultative anaerobic, yellow-pigmented and short rod-shaped strains, designated zg-Y625 T , zg-Y843, zg-Y1090 T and zg-Y1211, were isolated from the intestinal contents of Marmota himalayana in Qinghai Province, … Four Gram-stain-positive, facultative anaerobic, yellow-pigmented and short rod-shaped strains, designated zg-Y625 T , zg-Y843, zg-Y1090 T and zg-Y1211, were isolated from the intestinal contents of Marmota himalayana in Qinghai Province, PR China. Strains zg-Y625 T and zg-Y1090 T showed the highest 16S rRNA gene sequence similarities of 99.7% and 99.9% to Microbacterium pullorum DSM 112390 T , respectively, followed by 99.2% and 99.5% to Microbacterium oleivorans JCM 14341 T and 99.1% and 99.0% to Microbacterium paulum LMG 32277 T . Phylogenetic analyses based on 16S rRNA gene sequences and phylogenomic analyses using whole-genome sequences revealed that these four strains belong to the genus Microbacterium , forming two separate clades distinct from all other known Microbacterium species. The genome sizes of strains zg-Y625 T and zg-Y1090 T were 3.26 and 3.07 Mb, respectively, with DNA G+C contents of 70.5 and 70.9 mol%. The average nucleotide identity and digital DNA–DNA hybridization values between each of the novel strains and the available members of the genus Microbacterium were all below the species thresholds. Both type strains contained diphosphatidylglycerol and phosphatidylglycerol as predominant polar lipids with one unidentified glycolipid for zg-Y625 T and two unidentified glycolipids for zg-Y1090 T . The predominant respiratory quinone in zg-Y625 T was MK-13, whilst in zg-Y1090 T , both MK-11 and MK-13 were identified as the major quinones. The major fatty acids (&gt;10%) in strains zg-Y625 T and zg-Y843 were anteiso-C 15 : 0 , anteiso-C 17 : 0 and iso-C 16 : 0 , whereas for zg-Y1090 T and zg-Y1211, the predominant fatty acids were anteiso-C 15 : 0 and anteiso-C 17 : 0 . Based on phenotypic, phylogenetic, genomic and chemotaxonomic data, two novel species in the genus Microbacterium are proposed, namely, Microbacterium jiangjiandongii sp. nov. (zg-Y625 T =GDMCC 1.3931 T =JCM 36203 T ) and Microbacterium wangruii sp. nov. (zg-Y1090 T =GDMCC 1.3930 T =JCM 36205 T ).
The genus Moraxella ( Moraxellaceae , Pseudomonadales ) comprises a diverse group of bacteria inhabiting human and animal mucosa, as well as environmental niches such as water, soil and food. … The genus Moraxella ( Moraxellaceae , Pseudomonadales ) comprises a diverse group of bacteria inhabiting human and animal mucosa, as well as environmental niches such as water, soil and food. While some species are clinically significant pathogens, others play ecological or biotechnological roles. Despite previous taxonomic revisions, Moraxella remains polyphyletic, necessitating a refined classification. In this study, we conducted comprehensive taxogenomic analyses integrating core protein phylogeny, average amino acid identity and the percentage of conserved proteins, along with 16S rRNA similarities and inferred phylogeny. Our results revealed three distinct phylogenetic clusters within Moraxella . The core group, comprising Moraxella lacunata NBRC 102154 T and closely related species, exhibited strong genomic cohesion. A second cluster, consisting of Moraxella boevrei DSM 14165 T , Moraxella osloensis CCUG 350 T and Moraxella atlantae NBRC 14588 T , showed greater genetic affinity to Faucicola mancuniensis GVCNT2 T , supporting their reassignment to the genus Faucicola . The third group, represented by Moraxella lincolnii CCUG 9405 T , was phylogenetically distinct, occupying a basal position relative to Psychrobacter , indicating the need for its classification within a novel genus within the family Moraxellaceae , for which we propose the name Lwoffella lincolnii gen. nov., comb. nov. Phenotypic data compiled from the original published descriptions of the respective species, including fatty acid composition and enzymatic profiles, further corroborated these genomic findings. This study refines Moraxella taxonomy by clarifying genus boundaries and evolutionary relationships, with implications for ecology and clinical microbiology.
ABSTRACT Thermus thermophilus is one of the most studied thermophilic bacteria. In this study, we report the draft genomes of four strains newly isolated from Japanese hot springs. The phylogenetic … ABSTRACT Thermus thermophilus is one of the most studied thermophilic bacteria. In this study, we report the draft genomes of four strains newly isolated from Japanese hot springs. The phylogenetic analysis of T. thermophilus strains isolated from diverse environments suggests a link between geographical distribution and genomic diversity.
Perkinsiodendron macgregorii, an endangered Chinese endemic tree with high ornamental and ecological value, faces extinction threats due to its poor natural regeneration and habitat degradation. Despite the urgent need for … Perkinsiodendron macgregorii, an endangered Chinese endemic tree with high ornamental and ecological value, faces extinction threats due to its poor natural regeneration and habitat degradation. Despite the urgent need for its conservation, the genetic architecture and population differentiation mechanisms of this taxon remain poorly understood, hindering science-based protection strategies. We conducted comprehensive chloroplast genomic analyses of 134 individuals from 13 natural populations to inform science-based conservation. The chloroplast genome (158,538–158,641 bp) exhibited conserved quadripartite organization, with 113 functional genes and elevated GC contents in IR regions (42.99–43.02%). Population-level screening identified 741 SNPs and 678 indels, predominantly in non-coding regions (89.8%), with three distinct phylogeographic clades revealing north-to-south genetic stratification. The northern clade (Clade A) demonstrates the highest haplotype diversity and nucleotide diversity, followed by the southern clade (Clade C), while the central clade (Clade B) exhibits signals of genetic erosion (Tajima’s D &gt; 3.43). Based on the genetic diversity distribution and phylogenetic tree of extant P. macgregorii, we inferred that the northern populations represent ancestral groups, while the Wuyi Mountains region and Nanling Mountains region served as glacial refugia. It is imperative to implement in situ conservation in these two regions. Additionally, ex situ conservation should involve collecting seed from representative populations across all three clades and establishing isolated cultivation lines for each clade. These findings establish a genomic framework for conserving endangered plants.
Dendrobatid poison frogs have become well established as model systems in several fields of biology. Nevertheless, the development of molecular and genetic resources for these frogs has been hindered by … Dendrobatid poison frogs have become well established as model systems in several fields of biology. Nevertheless, the development of molecular and genetic resources for these frogs has been hindered by their large, highly repetitive genomes, which have proven difficult to assemble. Here we present a draft assembly for Phyllobates terribilis (12.6 Gb), generated using a combination of sequencing platforms and bioinformatic approaches. Similar to other poison frog sequencing efforts, we recovered a highly fragmented assembly, likely due to the genome's large size and very high repeat content, which we estimated to be ā‰88%. Despite the assembly's low contiguity, we were able to annotate multiple members of three gene sets of interest (voltage-gated sodium channels and Notch and Wnt signaling pathways), demonstrating the usefulness of our assembly to the amphibian research community.
Introduction The domestication of dogs is regarded as an evolutionary adaptation influenced by artificial selective pressures, leading to the fruition of diverse canine breeds across regions. Indigenous breeds, developed in … Introduction The domestication of dogs is regarded as an evolutionary adaptation influenced by artificial selective pressures, leading to the fruition of diverse canine breeds across regions. Indigenous breeds, developed in tandem with local environments, display unique conformations and disease resistance, yet many remain understudied at the molecular level. The Gaddi dog, originating in the northern parts of India and used by local tribes for livestock guarding, exemplifies such a breed with potential for transcriptomic research. Despite its vital role, it remains unrecognized by the National Bureau of Animal Genetic Resources (NBAGR). This study addresses the gaps in understanding the genetics and immune responses of Indigenous breeds, emphasizing their importance as holders of unique genetic heritage. This study explores the molecular profiles of Indigenous Gaddi dogs and exotic Labrador retrievers, focusing on their immune responses to TLR ligand-induced infections. Methods The mRNA and miRNA sequencing were performed separately using the Illumina NovaSeq 6,000 platform (150 bp). The study involved comparing the Control group (i.e., without treatment of any TLR-ligand) with each of the Poly I: C, LPS, and CpG ODN-treated groups for Labrador and Gaddi dogs. Functional enrichment analysis of differentially expressed genes (DEGs) (fold change &amp;gt;3 and &amp;lt;āˆ’3, p &amp;lt; 0.05) was conducted to identify enriched pathways in each breed. Results The analysis revealed that Labrador dogs had more DEGs across all treatment groups than Gaddi dogs. The enriched pathways in Labradors included Th1, Th2, Th17 cell differentiation, and T-cell receptor signaling. In contrast, Gaddi dogs significantly enriched ā€˜Wnt’ signaling, T cell activation, and immune regulation pathways. The differential expression (DE) analysis of miRNA-Seq results indicated that Labradors had more DE miRNAs (with expression levels of the original level &amp;gt;1.5 and &amp;lt;āˆ’1.5), such as miR-204, miR-206, miR-106a, miR-132, miR-335, and miR-676, which help regulate inflammation, autophagy, and immune responses. Gaddi dogs had unique miRNAs (miR-551 and miR-1249) associated with tumor suppression and inflammation. Discussion The study highlights distinct immunological profiles between Labrador and Gaddi dogs, with no shared genes responding to TLR-ligand stimulation. The functional enrichment of miRNA targets demonstrated consistent regulatory patterns at both the mRNA and miRNA levels. These findings emphasize the importance of preserving the genetic diversity of indigenous Gaddi dogs and utilizing advanced sequencing techniques to explore immunological diversity for disease resistance and the selection of breeding individuals.
Abstract Bacterial genomes exhibit significant variation in gene content and sequence identity. Pangenome analyses explore this diversity by classifying genes into core and accessory clusters of orthologous groups (COGs). However, … Abstract Bacterial genomes exhibit significant variation in gene content and sequence identity. Pangenome analyses explore this diversity by classifying genes into core and accessory clusters of orthologous groups (COGs). However, strict sequence identity cutoffs can misclassify divergent alleles as different genes, inflating accessory gene counts. CLARC (Connected Linkage and Alignment Redefinition of COGs) (https://github.com/IndraGonz/CLARC) improves pangenome analyses by condensing accessory COGs using functional annotation and linkage information. Through this approach, orthologous groups are consolidated into more practical units of selection. Analyzing 8000+ Streptococcus pneumoniae genomes, CLARC reduced accessory gene estimates by &amp;gt;30% and improved evolutionary predictions based on accessory gene frequencies. CLARC is effective across different bacterial species, making it a broadly applicable tool for comparative genomics. By refining COG definitions, CLARC offers critical insights into bacterial evolution, aiding genetic studies across diverse populations.
Millipedes (Diplopoda) are crucial decomposers in soil ecosystems, as they play a vital role in organic matter degradation while also holding potential as bioindicators of environmental health. This study deciphered … Millipedes (Diplopoda) are crucial decomposers in soil ecosystems, as they play a vital role in organic matter degradation while also holding potential as bioindicators of environmental health. This study deciphered the complete mitogenomes of four millipede species (Diplopoda: Spirostreptida and Spirobolida) using next-generation sequencing technology, thus revealing evolutionary relationships among diplopod taxa and characterizing mitochondrial genomic features. The full mitochondrial sequences of Agaricogonopus acrotrifoliolatus, Bilingulus sinicus, Paraspirobolus lucifugus, and Trigoniulus corallinus, ranged in size from 14,906 to 15,879 bp, with each containing 37 typical genes and one D-loop region. Notably, the D-loop regions of A. acrotrifoliolatus and B. sinicus were positioned atypically, thus indicating structural rearrangements. A nucleotide composition analysis revealed pronounced AT-skews, with tRNA sequences exhibiting the highest A+T content. Ka/Ks ratios demonstrated that the ND5 gene experienced the weakest purifying selection pressure, thus suggesting its potential role in adaptive evolution. The results of the phylogenetic analysis showed genetic relationships between the three orders of ((Julida, Spirostreptida), Spirobolida), which was inconsistent with the previous conclusion regarding the three orders, obtained through morphological studies: ((Julida, Spirobolida), Spirostreptida). These findings highlight the role of the mitochondrial genome in resolving phylogenetic conflicts and provide important insights for further studies on millipedes.
Eugene W. Myers , Richard Durbin , Chenxi Zhou | bioRxiv (Cold Spring Harbor Laboratory)
Abstract FastGA finds alignments between two genome sequences more than an order of magnitude faster than previous methods that have comparable sensitivity. Its speed is due to (a) a carefully … Abstract FastGA finds alignments between two genome sequences more than an order of magnitude faster than previous methods that have comparable sensitivity. Its speed is due to (a) a carefully engineered architecture involving only cache-coherent MSD radix sorts and merges, (b) a novel algorithm for finding adaptive seed hits in a linear merge of sorted k-mer tables, and (c) a variant of the Myers adaptive wave algorithm [1] to find alignments around a chain of seed hits that detects alignments with up to 25-30% variation. It further does not require pre-masking of repetitive sequence, and stores millions of alignments in a fraction of the space of a conventional CIGAR-string [2] using a trace-point encoding that is further compressed by the ONEcode data system [3] introduced here. As an example, two bat genomes of size 2.2Gbp and 2.5Gbp can be compared in a little over 2 minutes using 8 threads on an Apple M4 Max laptop using 5.7GB of memory and producing 1.05 million alignments totaling 1.63Gbp of aligned sequence that cover about 60% of each genome. The output ā€œALNā€-formatted file occupies 66MB. This file can be converted to a PAF file with CIGAR strings in 6 seconds, where the PAF representation is a significantly larger 1.03GB file. FastGA is freely available at github: http://www.github.com/thegenemyers/FASTGA along with utilities for viewing inputs, intermediate files, and outputs and transforming outputs into other common formats. Specifically, FastGA can, in addition to its highly efficient ONEcode representation, output PSL-formatted alignments, or PAF-formatted alignments with or without CIGAR strings explicitly encoding the alignments. There is also a utility to chain FastGA’s alignments and display them in a dot-plot like view in Postscript files, and an interactive viewer is in development.
The effects of sample multiplexing on the detection sensitivity of antimicrobial resistance genes (ARGs) and pathogenic bacteria in metagenomic sequencing remain underexplored in newer sequencing technologies such as Oxford Nanopore … The effects of sample multiplexing on the detection sensitivity of antimicrobial resistance genes (ARGs) and pathogenic bacteria in metagenomic sequencing remain underexplored in newer sequencing technologies such as Oxford Nanopore Technologies (ONT), despite its critical importance for surveillance applications. Here, we evaluate how different multiplexing levels (four and eight samples per flowcell) on two ONT platforms, GridION and PromethION, influence the detection of ARGs, bacterial taxa and pathogens. While overall resistome and bacterial community profiles remained comparable across multiplexing levels, ARG detection was more comprehensive in the four-plex setting with low-abundance genes. Similarly, pathogen detection was more sensitive in the four-plex, identifying a broader range of low abundant bacterial taxa compared to the eight-plex. However, triplicate sequencing of the same microbiomes revealed that these differences were primarily due to sequencing variability rather than multiplexing itself, as similar inconsistencies were observed across replicates. Given that eight-plex sequencing is more cost-effective while still capturing the overall resistome and bacterial community composition, it may be the preferred option for general surveillance. Lower multiplexing levels may be advantageous for applications requiring enhanced sensitivity, such as detailed pathogen research. These findings highlight the trade-off between multiplexing efficiency, sequencing depth, and cost in metagenomic studies.
Introduction Mitochondrial genomes (mitogenomes) in Pinaceae are notable for their large size and complexity. This study investigates the mitogenome of the critically endangered Cathaya argyrophylla to understand the drivers of … Introduction Mitochondrial genomes (mitogenomes) in Pinaceae are notable for their large size and complexity. This study investigates the mitogenome of the critically endangered Cathaya argyrophylla to understand the drivers of its exceptional genome expansion. Methods We sequenced, assembled, and annotated the C. argyrophylla mitogenome. Comparative analyses were performed against other Pinaceae species and gymnosperms, examining repeat sequences, transposable elements (LINEs, LTRs), RNA editing events, chloroplast-derived sequence transfers (mtpts), and nuclear genome homology. Results The C. argyrophylla mitogenome is a record-breaking 18.99 Mb. While C. argyrophylla and other extremely large Pinaceae mitogenomes possess substantial repeats and elevated transposon activity, these factors alone do not explain their size. Significant incorporation of mtpts was observed. Additionally, large mitogenomes exhibited distinct RNA editing patterns and reduced nuclear homology compared to smaller genomes. Discussion Massive Pinaceae mitogenomes are characterized by a combination of features: substantial repeat content, elevated transposon activity, extensive plastid sequence integration, and distinct RNA editing and nuclear homology patterns. This comprehensive analysis enhances our understanding of plant mitogenome evolution and provides a genomic foundation for C. argyrophylla conservation and potential applications.
This study investigated thermophilic bacterial communities from two Algerian hot springs: Hammam Debagh (94-98 °C), recognized as the second hottest spring in the world, and Hammam Bouhadjar (61-72 °C), one … This study investigated thermophilic bacterial communities from two Algerian hot springs: Hammam Debagh (94-98 °C), recognized as the second hottest spring in the world, and Hammam Bouhadjar (61-72 °C), one of the hottest in northwest Algeria. Thirty isolates were obtained, able to grow between 45 °C and 80 °C, tolerating pH 5.0-12.0 and NaCl concentrations up to 3%. Colonies displayed diverse morphologies, from circular and smooth to star-shaped and Saturn-like forms. All isolates were characterized as Gram-positive, catalase-positive rods or filamentous bacteria. Identification by MALDI-TOF, rep-PCR and 16S rRNA sequencing classified them mainly within Bacillus, Brevibacillus, Aneurinibacillus, Geobacillus, and Aeribacillus, with Geobacillus predominating. Rep-PCR provided higher resolution, revealing intra-species diversity overlooked by MALDI-TOF MS and 16S rRNA. A subset of six isolates, mainly Geobacillus spp., was selected based on phenotypic and genotypic diversity and tested for antimicrobial activity against thermophilic target isolates from the same hot spring environments. Strong inhibition zones (~24 mm) were observed, with Geobacillus thermoleovorans B8 displaying the highest activity. Optimization on Modified Nutrient Agar medium with Gelrite enhanced antimicrobial production and inhibition clarity. These findings highlight the ecological and biotechnological significance of thermophilic bacteria from Algerian geothermal ecosystems. While this study focused on microbial interactions within thermophilic communities, the promising inhibitory profiles reported here provide a foundation for future research targeting foodborne and antibiotic-resistant pathogens, as part of broader efforts in biopreservation and sustainable antimicrobial development.
A novel Gram-negative, oxidase- and catalase-positive, rod-shaped bacterium, designated strain KX21116 T , was isolated from the mussel Gigantidas platifrons collected from a cold seep field in the South China … A novel Gram-negative, oxidase- and catalase-positive, rod-shaped bacterium, designated strain KX21116 T , was isolated from the mussel Gigantidas platifrons collected from a cold seep field in the South China Sea. Strain KX21116 T grew optimally at 28 °C, pH 6.0 with 3% (w/v) NaCl, under aerobic and microaerobic conditions. Its genome size was 3.16 Mb, with a G+C content of 28.4 mol%. The 16S rRNA sequences revealed that strain KX21116 T was closely related to Arcobacter nitrofigilis DSM 7299 T (98.77% gene sequence similarity) and Arcobacter acticola AR-13 T (95.58%). Phylogenetic and phylogenomic analysis revealed that strain KX21116 T clustered with the type species of the genus Arcobacter , with A. nitrofigilis DSM 7299 T as its nearest neighbour. The genomic average nucleotide identity (orthoANI) values between strain KX21116 T and A. nitrofigilis DSM 7299 T were 92.74%, while the in silico DNA–DNA hybridization (GGDC) values of the two strains were 48.8%. The predominant fatty acids are C 16:0 , C 16:1 ω7 c/C1 6:1 ω6 c and C 18:1 ω7 c/C1 8:1 ω6 c. Based on a comparative analysis of phylogenetic, phylogenomic, phenotypic and chemotaxonomic characteristics, strain KX21116 T represents a novel species of the genus Arcobacter , for which the name Arcobacter iocasae sp. nov. is proposed. The type strain is KX21116 T (=MCCC 1K08505 T =KCTC 92900 T =JCM 35939 T ).
Two novel bacteria (designated N501-2 T and N40-8-2 T ) belonging to the genera Lacticaseibacillus and Levilactobacillus were isolated from traditional Chinese pickle (ā€˜Suan cai’) and identified. Strain N501-2 T … Two novel bacteria (designated N501-2 T and N40-8-2 T ) belonging to the genera Lacticaseibacillus and Levilactobacillus were isolated from traditional Chinese pickle (ā€˜Suan cai’) and identified. Strain N501-2 T was phylogenetically related to the type strains of Lacticaseibacillus baoqingensis , Lacticaseibacillus porcinae , Lacticaseibacillus manihotivorans and Lacticaseibacillus jixiensis , having 98.7–99.4% 16S rRNA gene sequence similarities. Strain N40-8-2 T was phylogenetically related to the type strains of Levilactobacillus fuyuanensis , Levilactobacillus parabrevis , Levilactobacillus tujiorum , Levilactobacillus hammesii , Levilactobacillus senmaizukei and Levilactobacillus tangyuanensis , having 98.4–99.4% 16S rRNA gene sequence similarities. Strain N501-2 T had 74.0–84.2% ANI (average nucleotide identity), 20.8–28.4% dDDH (digital DNA–DNA hybridization) and 73.5–86.8% AAI (average amino acid identity) values with L. manihotivorans DSM 13343 T , L. porcinae JCM 19617 T , L. jixiensis N163-3-2 T and L. baoqingensis 47-3 T . Analyses based on whole-genome sequences indicated that strain N40-8-2 T was most closely related to the type strains of L. fuyuanensis , L. parabrevis , L. tujiorum and L. hammesii , having less than 87.4% ANI, 33.6% dDDH and 93.2% AAI values. Based upon the data obtained in the present study, two novel species, Lacticaseibacillus salsurae sp. nov. and Levilactobacillus muriae sp. nov., are proposed, and the type strains are N501-2 T (=CCTCC AB 2024124 T =JCM 37150 T =LMG 33663 T ) and N40-8-2 T (=CCTCC AB 2024128 T =JCM 37001 T =LMG 33659 T ), respectively.
Summary Linseed ( Linum usitatissimum L.), a member of the Linaceae family, is a versatile crop valued for its oil, fibre, nutritional and medicinal applications. Recognized as a superfood, linseed … Summary Linseed ( Linum usitatissimum L.), a member of the Linaceae family, is a versatile crop valued for its oil, fibre, nutritional and medicinal applications. Recognized as a superfood, linseed is rich in omega‐3 fatty acid (~55%), lignans, high‐quality proteins, dietary fibre and bioactive secondary metabolites. Previously published genome assemblies of linseed are quite fragmented and non‐contiguous. In this study, we present a telomere‐to‐telomere (T2T) chromosome‐scale genome assembly of the Indian linseed variety T397 using advanced sequencing approaches. The assembly comprises ~595 Mb of genomic sequences, with a scaffold N50 of 32.86 Mb, spanning 15 chromosomes, including 29 telomeres and 15 centromeres. A total of 34 572 protein‐encoding genes were predicted with an average length of 2980.7 bp and an average of 5.0 exons per gene. Gene family analysis determines a considerable number of unique genes in linseed and its close relationship with Manihot esculenta and Ricinus communis. The higher expression of oleosin and FAD3 genes in linseed highlights their roles in oil accumulation and enrichment for omega‐3 fatty acid. The metabolites found in the seeds were enriched for the biosynthesis of unsaturated fatty acids. Various potential key structural genes and transcription factors that regulate oil metabolism especially unsaturated fatty acids biosynthesis has been identified. Overall, the present study provides the potential genomic resources for accelerated genetic studies and improvement of linseed.
Species within Corydalis are valued for medicinal and ornamental uses but taxonomic uncertainties persist due to limited genomic data. Here, we present the complete chloroplast genome of C. wilsonii, with … Species within Corydalis are valued for medicinal and ornamental uses but taxonomic uncertainties persist due to limited genomic data. Here, we present the complete chloroplast genome of C. wilsonii, with a quadripartite structure of 191,388 bp and 140 functional genes. Phylogenetic analysis robustly resolves C. wilsonii within a monophyletic clade alongside C. saxicola, C. tomentella, and C. fangshanensis, all nested within sect. Thalictrifoliae (bootstrap support = 100%). This study expands the chloroplast genomic resources for Corydalis and establishes a taxonomic framework to refine species identification and resolve evolutionary relationships within this ecologically and economically vital genus.
Two Gram-stain-negative, aerobic, non-motile, non-gliding, rod-shaped bacterial strains, designated as TBRC 19031 T and TBRC 19032, were isolated from water samples collected from the Mekong River, Thailand. Strain TBRC 19031 … Two Gram-stain-negative, aerobic, non-motile, non-gliding, rod-shaped bacterial strains, designated as TBRC 19031 T and TBRC 19032, were isolated from water samples collected from the Mekong River, Thailand. Strain TBRC 19031 T was obtained from Chiang Saen in the upstream section near the borders with China and Myanmar, while TBRC 19032 originated from Khong Chiam, in the downstream section where the river exits Thailand. Colonies of both strains were circular, smooth and deep yellow on Reasoner’s 2A agar and did not produce flexirubin-type pigments. Phylogenetic analysis with 16S rRNA gene sequences placed both strains within the genus Flavobacterium , showing the highest sequence similarity to Flavobacterium cheonhonense ARSA-15 T (98.29% for TBRC 19031 T and 98.22% for TBRC 19032). However, whole-genome comparisons between the strains and F. cheonhonense ARSA-15 T revealed average nt identity (89.39% and 89.29%), average aa identity (92.84% and 92.95%) and digital DNA–DNA hybridization (35.00% and 34.70%). The predominant fatty acids were iso-C 15:1 , iso-C 15:0 and iso-C 15:0 3-OH, and menaquinone MK-6 was the major respiratory quinone. The major polar lipids of both strains included phosphatidylethanolamine, steryl ester and diacylglycerol. The genome sizes were 3.02 and 3.04 Mbp, with G+C contents of 38.3% and 38.2% for TBRC 19031 T and TBRC 19032, respectively. Comparative genomic analyses revealed the absence of genes involved in sulphate reduction and denitrification pathways and the presence of a gene encoding phosphatidylinositol synthase, distinguishing them from other Flavobacterium within the clade. Ecological profiling using public metagenomic datasets showed that both strains were associated with lotic freshwater environments. This study not only introduces Flavobacterium mekongense sp. nov. as a new species but also provides broader insights into the ecology, metabolism and environmental distribution of freshwater Flavobacterium . The genomic features identified here offer promising leads for future studies in microbial ecology, comparative genomics and functional gene mining in aquatic ecosystems. The type strain is TBRC 19031 T (TBRC 19031 T =NBRC 117006 T ).
Nightjars (Aves: Caprimulgidae) are a species-rich family of birds, with the ā€œeared nightjarsā€ (Eurostopodinae) being an early-branching group endemic to the Indo-Pacific. While much research has focused on species-rich nightjar … Nightjars (Aves: Caprimulgidae) are a species-rich family of birds, with the ā€œeared nightjarsā€ (Eurostopodinae) being an early-branching group endemic to the Indo-Pacific. While much research has focused on species-rich nightjar genera and their higher-level relationships, the evolutionary history of Eurostopodinae (Eurostopodus, Lyncornis) remains understudied. We generated a genome-scale dataset to produce the first fully sampled phylogeny of all Eurostopodus and one Lyncornis species, including sequencing two type specimens of critically endangered and extinct species. Tree-building methods inferred concordant, well-resolved topologies that reveal intriguing biogeographic patterns within Eurostopodus. Our results show Eurostopodus as sister to all other nightjars, while Lyncornis, previously considered related, is more closely allied with other caprimulgids. We propose that the term ā€œeared nightjarsā€ should apply only to the two Lyncornis species, which should be classified within the subfamily Caprimulginae. Accordingly, since only Eurostopodus species remain in Eurostopodinae, we recommend renaming this subfamily ā€œIndo-Pacific nightjarsā€ to reflect their geographic distribution in this significant region.