Biochemistry, Genetics and Molecular Biology â€ș Molecular Biology

Machine Learning in Bioinformatics

Description

This cluster of papers focuses on the prediction of protein subcellular localization using various computational methods such as amino acid composition, machine learning algorithms like support vector machines, and the analysis of signal peptides and transmembrane topology. The research aims to improve the accuracy and reliability of predicting the subcellular location of proteins, which has significant implications for understanding protein function and cellular processes.

Keywords

Subcellular Localization; Protein; Prediction; Amino Acid Composition; Machine Learning; Support Vector Machines; Signal Peptides; Transmembrane Topology; Enzyme Subfamily Classes; Bioinformatics

This chapter contains sections titled: Historical Introduction The Chou and Fasman Predictive Method Definition of Conformational Regions Refinement of Conformational Parameters Application of Chou-Fasman Method Comparison of Predictive Methods Computerized 
 This chapter contains sections titled: Historical Introduction The Chou and Fasman Predictive Method Definition of Conformational Regions Refinement of Conformational Parameters Application of Chou-Fasman Method Comparison of Predictive Methods Computerized Chou-Fasman Method Future Directions
Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years 
 Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836-2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences-some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two-level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets-one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches-especially those relying on homology search or sequence annotations. Our two-level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two-level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization.
Abstract Motivation: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize 
 Abstract Motivation: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. Results: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of ∌10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. Availability: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.
Vaccine development in the post-genomic era often begins with the in silico screening of genome information, with the most probable protective antigens being predicted rather than requiring causative microorganisms to 
 Vaccine development in the post-genomic era often begins with the in silico screening of genome information, with the most probable protective antigens being predicted rather than requiring causative microorganisms to be grown. Despite the obvious advantages of this approach – such as speed and cost efficiency – its success remains dependent on the accuracy of antigen prediction. Most approaches use sequence alignment to identify antigens. This is problematic for several reasons. Some proteins lack obvious sequence similarity, although they may share similar structures and biological properties. The antigenicity of a sequence may be encoded in a subtle and recondite manner not amendable to direct identification by sequence alignment. The discovery of truly novel antigens will be frustrated by their lack of similarity to antigens of known provenance. To overcome the limitations of alignment-dependent methods, we propose a new alignment-free approach for antigen prediction, which is based on auto cross covariance (ACC) transformation of protein sequences into uniform vectors of principal amino acid properties. Bacterial, viral and tumour protein datasets were used to derive models for prediction of whole protein antigenicity. Every set consisted of 100 known antigens and 100 non-antigens. The derived models were tested by internal leave-one-out cross-validation and external validation using test sets. An additional five training sets for each class of antigens were used to test the stability of the discrimination between antigens and non-antigens. The models performed well in both validations showing prediction accuracy of 70% to 89%. The models were implemented in a server, which we call VaxiJen. VaxiJen is the first server for alignment-independent prediction of protective antigens. It was developed to allow antigen classification solely based on the physicochemical properties of proteins without recourse to sequence alignment. The server can be used on its own or in combination with alignment-based prediction methods. It is freely-available online at the URL: http://www.jenner.ac.uk/VaxiJen .
Post-translational modifications (PTMs) occur on almost all proteins analyzed to date. The function of a modified protein is often strongly affected by these modifications and therefore increased knowledge about the 
 Post-translational modifications (PTMs) occur on almost all proteins analyzed to date. The function of a modified protein is often strongly affected by these modifications and therefore increased knowledge about the potential PTMs of a target protein may increase our understanding of the molecular processes in which it takes part. High-throughput methods for the identification of PTMs are being developed, in particular within the fields of proteomics and mass spectrometry. However, these methods are still in their early stages, and it is indeed advantageous to cut down on the number of experimental steps by integrating computational approaches into the validation procedures. Many advanced methods for the prediction of PTMs exist and many are made publicly available. We describe our experiences with the development of prediction methods for phosphorylation and glycosylation sites and the development of PTM-specific databases. In addition, we discuss novel ideas for PTM visualization (exemplified by kinase landscapes) and improvements for prediction specificity (by using ESS--evolutionary stable sites). As an example, we present a new method for kinase-specific prediction of phosphorylation sites, NetPhosK, which extends our earlier and more general tool, NetPhos. The new server, NetPhosK, is made publicly available at the URL http://www.cbs.dtu.dk/services/NetPhosK/. The issues of underestimation, over-prediction and strategies for improving prediction specificity are also discussed.
Abstract 4 Also at the Department of Biological Sciences, University of California, Irvine, USA, to whom all correspondence should be addressed. We provide a unified overview of methods that currently 
 Abstract 4 Also at the Department of Biological Sciences, University of California, Irvine, USA, to whom all correspondence should be addressed. We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information. We briefly discuss the advantages and disadvantages of each approach. For classification tasks, we derive new learning algorithms for the design of prediction systems by directly optimising the correlation coefficient. We observe and prove several results relating sensitivity and specificity of optimal systems. While the principles are general, we illustrate the applicability on specific problems such as protein secondary structure and signal peptide prediction. Contact: [email protected]
A novel method to model and predict the location and orientation of alpha helices in membrane-spanning proteins is presented. It is based on a hidden Markov model (HMM) with an 
 A novel method to model and predict the location and orientation of alpha helices in membrane-spanning proteins is presented. It is based on a hidden Markov model (HMM) with an architecture that corresponds closely to the biological system. The model is cyclic with 7 types of states for helix core, helix caps on either side, loop on the cytoplasmic side, two loops for the non-cytoplasmic side, and a globular domain state in the middle of each loop. The two loop paths on the non-cytoplasmic side are used to model short and long loops separately, which corresponds biologically to the two known different membrane insertions mechanisms. The close mapping between the biological and computational states allows us to infer which parts of the model architecture are important to capture the information that encodes the membrane topology, and to gain a better understanding of the mechanisms and constraints involved. Models were estimated both by maximum likelihood and a discriminative method, and a method for reassignment of the membrane helix boundaries were developed. In a cross validated test on single sequences, our transmembrane HMM, TMHMM, correctly predicts the entire topology for 77% of the sequences in a standard dataset of 83 proteins with known topology. The same accuracy was achieved on a larger dataset of 160 proteins. These results compare favourably with existing methods.
The PROSITE database consists of biologically significant patterns and profiles formulated in such a way that with appropriate computational tools it can help to determine to which known family of 
 The PROSITE database consists of biologically significant patterns and profiles formulated in such a way that with appropriate computational tools it can help to determine to which known family of protein (if any) a new sequence belongs, or which known domain(s) it contains.
Abstract SUMMARY: The system SOSUI for the discrimination of membrane proteins and soluble ones together with the prediction of transmembrane helices was developed, in which the accuracy of the classification 
 Abstract SUMMARY: The system SOSUI for the discrimination of membrane proteins and soluble ones together with the prediction of transmembrane helices was developed, in which the accuracy of the classification of proteins was 99% and the corresponding value for the transmembrane helix prediction was 97%. AVAILABILITY: The system SOSUI is available through internet access: http://www.tuat.ac.jp/mitaku/sosui/. CONTACT: [email protected]. ac.jp.
When using conventional transmembrane topology and signal peptide predictors, such as TMHMM and SignalP, there is a substantial overlap between these two types of predictions. Applying these methods to five 
 When using conventional transmembrane topology and signal peptide predictors, such as TMHMM and SignalP, there is a substantial overlap between these two types of predictions. Applying these methods to five complete proteomes, we found that 30-65% of all predicted signal peptides and 25-35% of all predicted transmembrane topologies overlap. This impairs predictions of 5-10% of the proteome, hence this is an important issue in protein annotation. To address this problem, we previously designed a hidden Markov model, Phobius, that combines transmembrane topology and signal peptide predictions. The method makes an optimal choice between transmembrane segments and signal peptides, and also allows constrained and homology-enriched predictions. We here present a web interface (http://phobius.cgb.ki.se and http://phobius.binf.ku.dk) to access Phobius.
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional 
 Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
The GLIMMER system for microbial gene identification finds ∌97–98% of all genes in a genome when compared with published annotation. This paper reports on two new results: (i) significant technical 
 The GLIMMER system for microbial gene identification finds ∌97–98% of all genes in a genome when compared with published annotation. This paper reports on two new results: (i) significant technical improvements to GLIMMER that improve its accuracy still further, and (ii) a comprehensive evaluation that demonstrates that the accuracy of the system is likely to be higher than previously recognized. A significant proportion of the genes missed by the system appear to be hypothetical proteins whose existence is only supported by the predictions of other programs. When the analysis is restricted to genes that have significant homology to genes in other organisms, GLIMMER misses <1% of known genes.
Most of the proteins that are used in mitochondria are imported through the double membrane of the organelle. The information that guides the protein to mitochondria is contained in its 
 Most of the proteins that are used in mitochondria are imported through the double membrane of the organelle. The information that guides the protein to mitochondria is contained in its sequence and structure, although no direct evidence can be obtained. In this article, discriminant analysis has been performed with 47 parameters and a large set of mitochondrial proteins extracted from the SwissProt database. A computational method that facilitates the analysis and objective prediction of mitochondrially imported proteins has been developed. If only the amino acid sequence is considered, 75–97% of the mitochondrial proteins studied have been predicted to be imported into mitochondria. Moreover, the existence of mitochondrial‐targeting sequences is predicted in 76–94% of the analyzed mitochondrial precursor proteins. As a practical application, the number of unknown yeast open reading frames that might be mitochondrial proteins has been predicted, which revealed that many of them are clustered.
Journal Article A new method for predicting signal sequence cleavage sites Get access Gunnar von Heijne Gunnar von Heijne Research Group for Theoretical Biophysics, Department of Theoretical Physics, Royal Institute 
 Journal Article A new method for predicting signal sequence cleavage sites Get access Gunnar von Heijne Gunnar von Heijne Research Group for Theoretical Biophysics, Department of Theoretical Physics, Royal Institute of TechnologyS-100 44 Stockholm, Sweden Search for other works by this author on: Oxford Academic PubMed Google Scholar Nucleic Acids Research, Volume 14, Issue 11, 11 June 1986, Pages 4683–4690, https://doi.org/10.1093/nar/14.11.4683 Published: 11 June 1986 Article history Received: 05 March 1986 Revision received: 05 May 1986 Accepted: 05 May 1986 Published: 11 June 1986
Abstract The cellular attributes of a protein, such as which compartment of a cell it belongs to and how it is associated with the lipid bilayer of an organelle, are 
 Abstract The cellular attributes of a protein, such as which compartment of a cell it belongs to and how it is associated with the lipid bilayer of an organelle, are closely correlated with its biological functions. The success of human genome project and the rapid increase in the number of protein sequences entering into data bank have stimulated a challenging frontier: How to develop a fast and accurate method to predict the cellular attributes of a protein based on its amino acid sequence? The existing algorithms for predicting these attributes were all based on the amino acid composition in which no sequence order effect was taken into account. To improve the prediction quality, it is necessary to incorporate such an effect. However, the number of possible patterns for protein sequences is extremely large, which has posed a formidable difficulty for realizing this goal. To deal with such a difficulty, the pseudo‐amino acid composition is introduced. It is a combination of a set of discrete sequence correlation factors and the 20 components of the conventional amino acid composition. A remarkable improvement in prediction quality has been observed by using the pseudo‐amino acid composition. The success rates of prediction thus obtained are so far the highest for the same classification schemes and same data sets. It has not escaped from our notice that the concept of pseudo‐amino acid composition as well as its mathematical framework and biochemical implication may also have a notable impact on improving the prediction quality of other protein features. Proteins 2001;43:246–255. © 2001 Wiley‐Liss, Inc.
Prediction of 3-dimensional protein structures from amino acid sequences represents one of the most important problems in computational structural biology. The community-wide Critical Assessment of Structure Prediction (CASP) experiments have 
 Prediction of 3-dimensional protein structures from amino acid sequences represents one of the most important problems in computational structural biology. The community-wide Critical Assessment of Structure Prediction (CASP) experiments have been designed to obtain an objective assessment of the state-of-the-art of the field, where I-TASSER was ranked as the best method in the server section of the recent 7th CASP experiment. Our laboratory has since then received numerous requests about the public availability of the I-TASSER algorithm and the usage of the I-TASSER predictions. An on-line version of I-TASSER is developed at the KU Center for Bioinformatics which has generated protein structure predictions for thousands of modeling requests from more than 35 countries. A scoring function (C-score) based on the relative clustering structural density and the consensus significance score of multiple threading templates is introduced to estimate the accuracy of the I-TASSER predictions. A large-scale benchmark test demonstrates a strong correlation between the C-score and the TM-score (a structural similarity measurement with values in [0, 1]) of the first models with a correlation coefficient of 0.91. Using a C-score cutoff > -1.5 for the models of correct topology, both false positive and false negative rates are below 0.1. Combining C-score and protein length, the accuracy of the I-TASSER models can be predicted with an average error of 0.08 for TM-score and 2 Å for RMSD. The I-TASSER server has been developed to generate automated full-length 3D protein structural predictions where the benchmarked scoring system helps users to obtain quantitative assessments of the I-TASSER models. The output of the I-TASSER server for each query includes up to five full-length models, the confidence score, the estimated TM-score and RMSD, and the standard deviation of the estimations. The I-TASSER server is freely available to the academic community at http://zhang.bioinformatics.ku.edu/I-TASSER .
In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge 
 In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
WoLF PSORT is an extension of the PSORT II program for protein subcellular location prediction. WoLF PSORT converts protein amino acid sequences into numerical localization features; based on sorting signals, 
 WoLF PSORT is an extension of the PSORT II program for protein subcellular location prediction. WoLF PSORT converts protein amino acid sequences into numerical localization features; based on sorting signals, amino acid composition and functional motifs such as DNA-binding motifs. After conversion, a simple k-nearest neighbor classifier is used for prediction. Using html, the evidence for each prediction is shown in two ways: (i) a list of proteins of known localization with the most similar localization features to the query, and (ii) tables with detailed information about individual localization features. For convenience, sequence alignments of the query to similar proteins and links to UniProt and Gene Ontology are provided. Taken together, this information allows a user to understand the evidence (or lack thereof) behind the predictions made for particular proteins. WoLF PSORT is available at wolfpsort.org
Abstract Motivation: PSORTb has remained the most precise bacterial protein subcellular localization (SCL) predictor since it was first made available in 2003. However, the recall needs to be improved and 
 Abstract Motivation: PSORTb has remained the most precise bacterial protein subcellular localization (SCL) predictor since it was first made available in 2003. However, the recall needs to be improved and no accurate SCL predictors yet make predictions for archaea, nor differentiate important localization subcategories, such as proteins targeted to a host cell or bacterial hyperstructures/organelles. Such improvements should preferably be encompassed in a freely available web-based predictor that can also be used as a standalone program. Results: We developed PSORTb version 3.0 with improved recall, higher proteome-scale prediction coverage, and new refined localization subcategories. It is the first SCL predictor specifically geared for all prokaryotes, including archaea and bacteria with atypical membrane/cell wall topologies. It features an improved standalone program, with a new batch results delivery system complementing its web interface. We evaluated the most accurate SCL predictors using 5-fold cross validation plus we performed an independent proteomics analysis, showing that PSORTb 3.0 is the most accurate but can benefit from being complemented by Proteome Analyst predictions. Availability: http://www.psort.org/psortb (download open source software or use the web interface). Contact: [email protected] Supplementary Information: Supplementary data are availableat Bioinformatics online.
We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The 
 We have developed a new method for the identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequence. The method performs significantly better than previous prediction schemes and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, though with lower precision. Predictions can be made on a publicly available WWW server.
Abstract The PSIPRED Workbench is a web server offering a range of predictive methods to the bioscience community for 20 years. Here, we present the work we have completed to 
 Abstract The PSIPRED Workbench is a web server offering a range of predictive methods to the bioscience community for 20 years. Here, we present the work we have completed to update the PSIPRED Protein Analysis Workbench and make it ready for the next 20 years. The main focus of our recent website upgrade work has been the acceleration of analyses in the face of increasing protein sequence database size. We additionally discuss any new software, the new hardware infrastructure, our webservices and web site. Lastly we survey updates to some of the key predictive algorithms available through our website.
Abstract Summary KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive 
 Abstract Summary KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds. KofamKOALA is faster than existing KO assignment tools with its accuracy being comparable to the best performing tools. Function annotation by KofamKOALA helps linking genes to KEGG resources such as the KEGG pathway maps and facilitates molecular network reconstruction. Availability and implementation KofamKOALA, KofamScan and KOfam are freely available from GenomeNet (https://www.genome.jp/tools/kofamkoala/). Supplementary information Supplementary data are available at Bioinformatics online.
SMART (Simple Modular Architecture Research Tool) is a web resource (https://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 9 contains 
 SMART (Simple Modular Architecture Research Tool) is a web resource (https://smart.embl.de) for the identification and annotation of protein domains and the analysis of protein domain architectures. SMART version 9 contains manually curated models for more than 1300 protein domains, with a topical set of 68 new models added since our last update article (1). All the new models are for diverse recombinase families and subfamilies and as a set they provide a comprehensive overview of mobile element recombinases namely transposase, integrase, relaxase, resolvase, cas1 casposase and Xer like cellular recombinase. Further updates include the synchronization of the underlying protein databases with UniProt (2), Ensembl (3) and STRING (4), greatly increasing the total number of annotated domains and other protein features available in architecture analysis mode. Furthermore, SMART's vector-based protein display engine has been extended and updated to use the latest web technologies and the domain architecture analysis components have been optimized to handle the increased number of protein features available.
Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised 
 Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction.
Abstract Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% 
 Abstract Protein structures can provide invaluable information, both for reasoning about biological processes and for enabling interventions such as structure-based drug development or targeted mutagenesis. After decades of effort, 17% of the total residues in human protein sequences are covered by an experimentally determined structure 1 . Here we markedly expand the structural coverage of the proteome by applying the state-of-the-art machine learning method, AlphaFold 2 , at a scale that covers almost the entire human proteome (98.5% of human proteins). The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We introduce several metrics developed by building on the AlphaFold model and use them to interpret the dataset, identifying strong multi-domain predictions as well as regions that are likely to be disordered. Finally, we provide some case studies to illustrate how high-quality predictions could be used to generate biological hypotheses. We are making our predictions freely available to the community and anticipate that routine large-scale and high-accuracy structure prediction will become an important tool that will allow new questions to be addressed from a structural perspective.
Abstract Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are 
 Abstract Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.
Viruses are transmitted through multiple routes and can cause a wide range of diseases. Antiviral peptides (AVPs) have emerged as a cost-effective and low-side-effect strategy for combating viral infections. However, 
 Viruses are transmitted through multiple routes and can cause a wide range of diseases. Antiviral peptides (AVPs) have emerged as a cost-effective and low-side-effect strategy for combating viral infections. However, identifying antiviral peptides experimentally is both resource-intensive and time-consuming. With the advancement of artificial intelligence, accurately predicting antiviral peptide sequences has become increasingly critical to accelerate discovery efforts. In this study, we constructed a novel benchmark data set by integrating publicly available databases and literature resources. We developed an antiviral peptide prediction model named iAVP-RFVOT, which employs the BLOSUM62 matrix as the initial feature for peptide sequences and applies unified manifold approximation and projection (UMAP) embedding learning and Kozachenko-Leonenko estimator-based differential entropy calculation to extract derivative features. Following rigorous feature engineering, data rebalancing to address class imbalance, and optimization of an ensemble random forest classifier, we achieved a 5-fold cross-validation accuracy of 87.6% and a Matthew's correlation coefficient of 0.753. Through comprehensive evaluation on our independently constructed test set, the iAVP-RFVOT model demonstrates a predictive accuracy of 85.8% and a Matthew's correlation coefficient of 0.519, which substantially surpasses the performance of conventional state-of-the-art (SOTA) models.
Network-based GWAS (NetWAS) has advanced brain imaging research by identifying genetic modules associated with brain alterations. However, how imaging risk genes exert functions in brain diseases, particularly their mediation through 
 Network-based GWAS (NetWAS) has advanced brain imaging research by identifying genetic modules associated with brain alterations. However, how imaging risk genes exert functions in brain diseases, particularly their mediation through imaging quantitative traits (iQTs), remains underexplored. We propose a module-level polygenic risk score (MPRS)-based NetWAS framework to uncover genetic modules associated with Alzheimer’s disease (AD) through the mediation of an iQT, using amygdala density as a case study. Our framework integrates genotype data, brain imaging phenotypes, clinical diagnosis of AD, and protein–protein interaction (PPI) networks to identify AD-relevant modules (ADMs) influenced by iQT-associated genetic variants. Specifically, we conducted a genome-wide association study (GWAS) of amygdala density (N=1515) to identify variants associated with iQT. These variants were mapped onto a PPI network and network propagation was performed to prompt amygdala modules. The meta-GWAS of AD (N1=63,926; N2=455,267) was used to calculate MPRS to further identify AD-relevant modules (ADMs). Four modules that showed significant differences in MPRS between AD and controls were identified as ADM. Post-hoc analyses revealed that these ADMs demonstrated strong modularity, showed increased sensitivity to early stages of AD, and significantly mediated the link between ADMs and AD progression through the amygdala. Furthermore, these modules exhibited high tissue specificity within the amygdala and were enriched in AD-related biological pathways. Our MPRS-based framework bridges genetics, intermediate traits, and clinical outcomes and can be adapted for broader biomedical applications.
This study proposes a deep learning framework for Protein Secondary Structure Prediction (PSSP) that prioritizes computational efficiency while preserving classification accuracy. Leveraging ProtBERT-derived embeddings, we apply autoencoder-based dimensionality reduction to 
 This study proposes a deep learning framework for Protein Secondary Structure Prediction (PSSP) that prioritizes computational efficiency while preserving classification accuracy. Leveraging ProtBERT-derived embeddings, we apply autoencoder-based dimensionality reduction to compress high-dimensional sequence representations. These are segmented into fixed-length subsequences, enabling efficient input formatting for a Bi-LSTM-based classifier. Our experiments, conducted on a curated PISCES-based dataset, reveal that reducing input dimensions from 1024 to 256 preserves over 99% of predictive performance (Q3 F1 score: 0.8049 → 0.8023) while reducing GPU memory usage by 67% and training time by 43%. Moreover, subsequence lengths of 50 residues provide an optimal trade-off between contextual learning and training stability. Compared to baseline configurations, the proposed framework reduces training overhead substantially without compromising structural accuracy in both the Q3 and Q8 classification schemes. These findings offer a practical pathway for scalable protein structure prediction, particularly in resource-constrained environments.
The fast-advancing mass spectrometry and related technologies have greatly extended the depth of coverage in large-scale proteomics studies, including single-cell applications. As sample numbers grow rapidly, it is often challenging 
 The fast-advancing mass spectrometry and related technologies have greatly extended the depth of coverage in large-scale proteomics studies, including single-cell applications. As sample numbers grow rapidly, it is often challenging to interpret the proteins with missing values that are often presented as "NA" (not available). It could be the evidence of no expression, low expression below the detection threshold, or false negative detection due to technical issues. Existing methods for missing values imputation, while generally useful, rarely consider the non-random NA values that inform biological significance. In the current study, we developed Biologically Informative NA Deconvolution (BIND) that applies an adaptive neighborhood-based modeling to deconvolve the nature of NAs as "biological" (low/no expression) or technical (experimental errors). Applying to multiple cell line datasets and human tissue extracellular vesicle datasets, BIND excavated the NAs that indicated "hallmark absence" of unique proteins. This led to improvements in protein-protein interaction analysis and the identification of novel disease biomarkers. To facilitate its public accessibility, we compiled BIND into a web server that features functional online operations and interactive visualizations. Furthermore, we demonstrated that the BIND server could deconvolve the NAs and improve the analyses of single-cell proteomics datasets. Overall, BIND delineates the biological significance of missing values rather than treating them as a burden, providing a critical perspective for understanding the complex proteome in various biological contexts.
Abstract Accurate prediction of protein function is fundamental to understanding biological processes, with computational methods becoming increasingly essential as experimental methods struggle to keep pace with the rate of newly 
 Abstract Accurate prediction of protein function is fundamental to understanding biological processes, with computational methods becoming increasingly essential as experimental methods struggle to keep pace with the rate of newly discovered proteins. Despite significant advances in machine learning approaches, existing methods often fail to capture the complex relationships between protein structure, evolution, and function, leading to limited prediction accuracy. The challenge lies in effectively integrating diverse biological data types while maintaining computational efficiency. Here, we show that GOBeacon , a novel ensemble model integrating structure‐aware protein language model embeddings with protein–protein interaction networks, achieves high accuracy in protein function prediction. By employing a contrastive learning framework, GOBeacon demonstrates superior performance on the sequence‐based CAFA3 benchmark, achieving F max scores of 0.561 ( BP ), 0.583 ( MF ), and 0.651 ( CC ), outperforming existing methods including domain‐ PFP and DeepGOPlus . The model's effectiveness extends to structure‐based function prediction tasks, where it matches or exceeds the performance of specialized structure‐based tools like HEAL and DeepFRI , while not being explicitly trained on structure. We anticipate that GOBeacon 's architecture will serve as a foundation for next‐generation protein analysis tools, while its modular design enables future integration of additional data types and improved prediction capabilities. These advances represent a significant step toward reliable automated protein function annotation, addressing a critical bottleneck in modern biology. GOBeacon is now publicly available: https://github.com/wlin16/GOBeacon.git
We compared the performance of three widely used protein structure prediction tools - AlphaFold2, ESMFold, and OmegaFold - using a dataset of over 1,300 newly created records from the PDB 
 We compared the performance of three widely used protein structure prediction tools - AlphaFold2, ESMFold, and OmegaFold - using a dataset of over 1,300 newly created records from the PDB database. These structures, resolved between July 2022 and July 2024, ensure unbiased evaluation, as they were unavailable during the training of these tools. Using metrics such as root mean square deviation (RMSD), template modeling score (TM-score), and predicted local distance difference test (pLDDT), we found that AlphaFold2 consistently achieves the highest accuracy but depends on high-quality sequence alignments. In contrast, ESMFold and OmegaFold provide faster predictions and excel in challenging cases, such as rapidly evolving or designed proteins with limited sequence homology. Comparing ESMFold and OmegaFold, ESMFold achieves higher confidence scores (pLDDT) and structural similarity (TM-score). OmegaFold is competitive in specific contexts, such as de novo-designed proteins or sequences with limited evolutionary information. Additionally, we demonstrate that machine learning models trained on protein language model embeddings and pLDDT confidence scores can predict potential structure prediction failures, helping to identify challenging cases early in the pipeline.
AlphaFold2 and AlphaFold3 have revolutionized protein structure prediction by enabling high-accuracy tertiary structure predictions for most single-chain proteins (monomers). However, obtaining high-quality predictions for some hard protein targets with shallow 
 AlphaFold2 and AlphaFold3 have revolutionized protein structure prediction by enabling high-accuracy tertiary structure predictions for most single-chain proteins (monomers). However, obtaining high-quality predictions for some hard protein targets with shallow or noisy multiple sequence alignments (MSAs) and complicated multi-domain architectures remains challenging. Here, we present MULTICOM4, an integrative protein structure prediction system that uses diverse MSA generation, large-scale model sampling, and an ensemble model quality assessment (QA) strategy of combining individual QA methods to improve model generation and ranking of AlphaFold2 and AlphaFold3. In the 16th Critical Assessment of Techniques for Protein Structure Prediction (CASP16), our predictors built on MULTICOM4 ranked among the top performers out of 120 predictors in tertiary structure prediction and outperformed a standard AlphaFold3 predictor. The average TM-score of our best performing predictor MULTCOM's top-1 prediction for 84 CASP16 domain is 0.902. It achieved high accuracy (TM-score > 0.9) for 73.8% of the 84 domains and correct fold predictions (TM-score > 0.5) for 97.6% domains in terms of top-1 prediction. In terms of bestof-top-5 prediction, it predicted correct folds for all the domains. The results show that MSA engineering through the use of different protein sequence databases, alignment tools, and domain segmentation as well as extensive model sampling are the key to generate accurate and correct structural models. Additionally, using multiple complementary QA methods and model clustering can improve the robustness and reliability of model ranking.
The prevalence of Leukaemia, a malignant blood cancer that originates from hematopoietic progenitor cells, is increasing in Southeast Asia, with a worrisome fatality rate of 54%. Predicting outcomes in the 
 The prevalence of Leukaemia, a malignant blood cancer that originates from hematopoietic progenitor cells, is increasing in Southeast Asia, with a worrisome fatality rate of 54%. Predicting outcomes in the early stages is vital for improving the chances of patient recovery. The aim of this research is to enhance early-stage prediction systems in a substantial manner. Using Machine Learning and Data Science, we exploit protein sequential data from commonly altered genes including BCL2, HSP90, PARP, and RB to make predictions for Chronic Myeloid Leukaemia (CML). The methodology we implement is based on the utilisation of reliable methods for extracting features, namely Di-peptide Composition (DPC), Amino Acid Composition (AAC), and Pseudo amino acid composition (Pse-AAC). We also take into consideration the identification and handling of outliers, as well as the validation of feature selection using the Pearson Correlation Coefficient (PCA). Data augmentation guarantees a comprehensive dataset for analysis. By utilising several Machine Learning models such as Support Vector Machine (SVM), XGBoost, Random Forest (RF), K Nearest Neighbour (KNN), Decision Tree (DT), and Logistic Regression (LR), we have achieved accuracy rates ranging from 66% to 94%. These classifiers are thoroughly evaluated utilising performance criteria such as accuracy, sensitivity, specificity, F1-score, and the confusion matrix.The solution we suggest is a user-friendly online application dashboard that can be used for early detection of CML. This tool has significant implications for practitioners and may be used in healthcare institutions and hospitals.
Bitter peptides are short amino acid chains that produce a bitter taste. These peptides are made primarily in food processing through the chemical reduction of peptides. The bitterness arises from 
 Bitter peptides are short amino acid chains that produce a bitter taste. These peptides are made primarily in food processing through the chemical reduction of peptides. The bitterness arises from the specific sequence of amino acids in peptides, which interact with the bitter taste receptors on the human tongue. These peptides influence nutrition and health, offering insights into protein digestion and bioactive advantages. Hence, correctly identifying bitter peptides is pivotal for revealing the biochemical properties of efficient medication. The computational approach is most suitable for identifying bitterness, where most studies obtained insufficient outcomes. Therefore, the current study developed an ensemble-based framework called "BitterEN", where we integrate the Gradient Boosting (GB) and Multi-layer Perception (MLP) methods. Our proposed method improved more than 3 % of accuracy compare to all of the state-of-the-arts methods, where the proposed approach achieved 0.995 accuracy in merged feature extractions with the Random Forest (RF) feature selection method. We used 50 iterations over the performance evaluation phases to enable a more exact generalization of model performance. In addition, we provided a convenient GitHub-based version of our bitter peptide identification. It highlights the practical applicability of these findings. We are optimistic that the proposed approach might benefit many fields, including healthcare development and nutritional science.
Background Epigenetic modifications play a vital role in the pathogenesis of human diseases, particularly neurodegenerative disorders such as Alzheimer's disease (AD), where dysregulated histone modifications are strongly implicated in disease 
 Background Epigenetic modifications play a vital role in the pathogenesis of human diseases, particularly neurodegenerative disorders such as Alzheimer's disease (AD), where dysregulated histone modifications are strongly implicated in disease mechanisms. While recent advances underscore the importance of accurately identifying these modifications to elucidate their contribution to AD pathology, existing computational methods remain limited by their generic approaches that overlook disease-specific epigenetic signatures. Results To bridge this gap, we developed a novel large language model (LLM)-based deep learning framework tailored for disease-contextual prediction of histone modifications and variant effects. Focusing on AD as a case study, we integrated epigenomic data from multiple patient samples to construct a comprehensive, disease-specific histone modification dataset, enabling our model to learn AD-associated molecular signatures. A key innovation of our approach is the incorporation of a Mixture of Experts (MoE) architecture, which effectively distinguishes between disease and healthy epigenetic states, allowing for precise identification of AD-relevant epigenetic modification patterns. Our model demonstrates robust performance in disease-specific histone modification prediction, achieving mean area under receiver-operating curves (AUROCs) ranging from 0.7863 to 0.9142, significantly outperforming existing state-of-the-art methods that lack disease context. Beyond accurate modification site prediction, our framework provides important biological insights by successfully prioritizing AD-associated genetic variants, which show significant enrichment in disease-relevant pathways, supporting their potential functional role in AD pathogenesis. These findings suggest that differential modification loci identified by our model may represent key regulatory elements in AD. Conclusions Our framework establishes a powerful new paradigm for epigenetic research that can be extended to other complex diseases, offering both a valuable tool for variant effect interpretation and a promising strategy for uncovering novel disease mechanisms through epigenetic profiling.
The genetic variations in human genome causes considerable in the phenotype which is influenced by single nucleotide polymorphism. It is very challenging to determine which SNP in a candidate gene 
 The genetic variations in human genome causes considerable in the phenotype which is influenced by single nucleotide polymorphism. It is very challenging to determine which SNP in a candidate gene is responsible for a given phenotype, and requires testing hundreds or thousands of SNPs. The SNPs are utilized to map the susceptibility of genes involved in complex diseases and to connect the genetic variants that determine an individual's reaction to different medications. The hardest part of the mapping is deciding which set of SNPs to use. Only those SNPs with functional significance may be included in the set of SNPs selected for a given study after screening. One such prediction tool that helps to distinguish between SNPs with functional significance and neutral SNPs is the tool called "Bioinfor-matics.". One of the microvascular complications of diabetes mellitus is diabetic nephropathy. As the nephropathy advances, the patients depend on the renal replacement therapy. Angiotensin converting enzyme is a part of Renin-angiotensin system that plays an important role in maintaining the blood pressure and renal hemodynamics. To analyze and extract the ACE isoform 1 precursor gene's functional SNP by the bioinformatic tool and analysis of ACE rs267604983 gene by SIFT and PROVEAN tool and performing HOPE modelling. SIFT and PROVEAN bioinformatics tools were applied to extract the functional SNP's of ACE isoform precursor 1 gene. The database yielded about 9,680 single nucleotide polymorphisms. Coding variations were 100%, according to SIFT analysis of the ACE precursor gene. 94% of those projected were met. 30% were destructive, and were tolerated. Merely 6% were synonymous, while the remaining 94% were not. According to PROVEAN, 25% of the samples were harmful and 65% were tolerable. In conclusion, new information about the complexities of diabetic nephropathy may be revealed by combining in silico analysis with wet laboratory research. For those who are at risk of diabetic nephropathy, customized medicine techniques and targeted medicines may become possible if the predictions made by bioinformatics tools match the results of experiments.
In recent years, viral diseases have exhibited a significant incidence of infections and fatalities. The analysis of viral genomic sequences can be efficacious in evaluating the present and potentially forthcoming 
 In recent years, viral diseases have exhibited a significant incidence of infections and fatalities. The analysis of viral genomic sequences can be efficacious in evaluating the present and potentially forthcoming condition of viruses. Considering the importance of the internal structure of the cell and the nucleotide sequences within it, analyzing nucleotide sequences can provide a range of discussable features. On the other hand, it has been demonstrated that the use of graph algorithms and machine learning in the analysis and examination of virus samples and even viral variants can yield beneficial results. This study proposes a novel approach that utilizes complex networks and probabilistic graph modeling methods to analyze viral genomic sequences for feature extraction. The proposed approach, which relies on the PageRank centrality algorithm, operates on codons that are associated with the nucleotide sequences. Experiments with machine learning algorithms were conducted on multiple datasets of viruses and various variants of coronavirus and influenza viruses. The use of a decision tree classifier model on the extracted distinguishing features enabled the differentiation of coronavirus samples from other samples. The high discriminative capability of the graph node centrality feature played a significant role in these experiments, establishing a meaningful connection with genetic concepts as well. The decision tree classifier applied on 173,228 genomic sequence samples originating from 30 distinct virus types, showed a remarkable accuracy rate of 99.73%. The proposed algorithm was successfully tested on several types of viruses, and the interpretability of the extracted features also enabled its structural analysis. The use of a graph-based approach on genetic features containing information about the internal structure of nucleotides yielded results that could be significant for the identification of any type of virus or specific viral variant.
Many extracellular components secreted by Group A Streptococcus, also called S. pyogenes, are believed to be virulence factors. Therefore, this study aimed to predict antibiotic target proteins from S. pyogenes, 
 Many extracellular components secreted by Group A Streptococcus, also called S. pyogenes, are believed to be virulence factors. Therefore, this study aimed to predict antibiotic target proteins from S. pyogenes, rank the proteins based on their text-mining scores using the network pharmacology approach and provide an array of genes that could be identified following genome sequencing techniques. Cytoscape software version 3.9.1 was employed in this process. STRING; PubMed query was used as the data source to import networks from public databases . The species selected was Streptococcus pyogenes, and the PubMed query was Antibiotics Resistance Streptococcus. The confidence score was set at 0.7 while the maximum number of S. pyogenes target proteins to be downloaded was 50. The PubMed query returned 15,614 results, of which 9999 were downloaded, including target Proteins from S. pyogenes species widely mentioned in published abstracts. Nodes in networks represent the proteins, and edges indicate interactions between the proteins. The outcome displayed target proteins for S. pyogenes, including Pbp1A, gyrA, parC, MefE, gyrB-2, folP, adhP, GlcK, pheS, AKZ49992.1, and scpB, with high text-mining scores arranged in descending order. The text-mining score showed how frequently the target proteins are mentioned in the STRING database, with the most mentioned protein (pbp1A) having the highest score. Pbp1A, gyrA, parC, MefE, gyrB-2, folP, adhP, GlcK, pheS, AKZ49992.1, and scpB were predicted as proteins with high text-mining evidence in S.pyogenes resistance using the network pharmacology approach.
Abstract Many antimicrobial peptides (AMPs) function by disrupting the cell membranes of microbes. While this ability is crucial for their efficacy, it also raises questions about their safety. Specifically, the 
 Abstract Many antimicrobial peptides (AMPs) function by disrupting the cell membranes of microbes. While this ability is crucial for their efficacy, it also raises questions about their safety. Specifically, the membrane‐disrupting ability could lead to hemolysis. Traditionally, the hemolytic activity of AMPs is evaluated through experiments. To reduce the cost of assessing the safety of an AMP as a drug, we introduce ConsAMPHemo, a two‐stage framework based on deep learning. ConsAMPHemo performs conventional binary classification of the hemolytic activities of AMPs and predicts their hemolysis concentrations through regression. Our model demonstrates excellent classification performance, achieving an accuracy of 99.54%, 82.57%, and 88.04% on three distinct datasets, respectively. Regarding regression prediction, the model achieves a Pearson correlation coefficient of 0.809. Additionally, we identify the correlation between features and hemolytic activity. The insights gained from this work shed light on the underlying physics of the hemolytic nature of an AMP. Therefore, our study contributes to the development of safer AMPs through cost‐effective hemolytic activity prediction and by revealing the design principles for AMPs with low hemolytic toxicity. The codes and datasets of ConsAMPHemo are available at https://github.com/Cpillar/ConsAMPHemo .
Background: Genetic variation provides a foundation for understanding evolution. With the rise of artificial intelligence, machine learning has emerged as a powerful tool for identifying genomic footprints of evolutionary processes 
 Background: Genetic variation provides a foundation for understanding evolution. With the rise of artificial intelligence, machine learning has emerged as a powerful tool for identifying genomic footprints of evolutionary processes through simulation-based predictive modeling. However, existing approaches require prior knowledge of the factors shaping genetic variation, whereas uncovering anomalous genomic regions regardless of their causes remains an equally important and complementary endeavor. Methods: To address this problem, we introduce ANDES (ANomaly DEtection using Summary statistics), a suite of algorithms that apply statistical techniques to extract features for unsupervised anomaly detection. A key innovation of ANDES is its ability to account for autocovariation due to linkage disequilibrium by fitting curves to contiguous windows and computing their first and second derivatives, thereby capturing the "velocity" and "acceleration" of genetic variation. These features are then used to train models that flag biologically significant or artifactual regions. Results: Application to human genomic data demonstrates that ANDES successfully detects anomalous regions that colocalize with genes under positive or balancing selection. Moreover, these analyses reveal a non-uniform distribution of anomalies, which are enriched in specific autosomes, intergenic regions, introns, and regions with low GC content, repetitive sequences, and poor mappability. Conclusions: ANDES thus offers a novel, model-agnostic framework for uncovering anomalous genomic regions in both model and non-model organisms.
Proteins within a family sharing sequence and structure similarity due to a common evolutionary origin often also share functional similarities. Clustering of proteins therefore offers valuable insights, enabling the transfer 
 Proteins within a family sharing sequence and structure similarity due to a common evolutionary origin often also share functional similarities. Clustering of proteins therefore offers valuable insights, enabling the transfer of features and annotations from well-studied proteins to less-investigated ones. On a local scale, clustering helps identify patterns within specific protein families. On a larger scale, it provides insights into the entire protein universe, showcasing relationships that may not be immediately apparent. Traditionally, this was done at the sequence level or with the use of experimentally resolved protein structures, but the advent of deep learning in protein bioinformatics has brought new options to the table, increasing the breadth, depth, and diversity of similarity metrics and clustering approaches.
Early and accurate detection of Alzheimer's disease (AD) remains a critical challenge for precision health. Traditional cognitive assessments often miss subtle, individualized patterns of decline, while conventional linguistic analyses focus 
 Early and accurate detection of Alzheimer's disease (AD) remains a critical challenge for precision health. Traditional cognitive assessments often miss subtle, individualized patterns of decline, while conventional linguistic analyses focus on word-level features that may overlook fine-grained speech disruptions. We test the hypothesis that character-level features in speech transcripts capturing pauses, repetitions, and hesitations at the finest linguistic granularity can serve as novel biomarkers for cognitive decline, revealing personalized linguistic signatures that manifest uniquely in each individual. Our biomarker discovery framework employs symbolic character-level encoding followed by recurrence quantification analysis to transform speech transcripts into visual recurrence plots that reveal temporal speech dynamics. Siamese networks learn embeddings from these plots to capture discriminative patterns at the character level. We validate our hypothesis using the DementiaBank corpus, demonstrating that character-level biomarkers achieve superior discriminative capability compared to conventional word-level approaches (95.9\% vs. 87.5\% AUC), while providing interpretable recurrence plot visualizations. Our findings establish that character-level linguistic features contain significant biomarker information for cognitive assessment, representing a fundamental shift from word-based to character-based analysis for precision health applications in dementia screening.
Identifying protective antigens (PAs), i.e., targets for bacterial vaccines, is challenging as conducting in-vivo tests at the proteome scale is impractical. Reverse Vaccinology (RV) aids in narrowing down the pool 
 Identifying protective antigens (PAs), i.e., targets for bacterial vaccines, is challenging as conducting in-vivo tests at the proteome scale is impractical. Reverse Vaccinology (RV) aids in narrowing down the pool of candidates through computational screening of proteomes. Within RV, one prominent approach is to train Machine Learning (ML) models to classify PAs. These models can be used to predict unseen protein sequences and assist researchers in selecting promising candidates. Traditionally, proteins are fed into these models as vectors of biological and physico-chemical descriptors derived from their residue sequences. However, this method relies on multiple third-party software packages, which may be unreliable, difficult to use, or no longer maintained. Furthermore, selecting descriptors is susceptible to biases. Hence, Protein Sequence Embeddings (PSEs)—high-dimensional vectorial representations of protein sequences obtained from pretrained deep neural networks—have emerged as an alternative to descriptors, offering data-driven feature extraction and a streamlined computational pipeline. We introduce PSEs as a descriptor-free representation of protein sequences for ML in RV. We conducted a thorough comparison of PSE-based and descriptor-based pipelines for PA classification across 10 bacterial species evaluated independently. Our results show that the PSE-based pipeline, which leverages the FAIR ESM-2 protein language model, outperformed the descriptor-based pipeline in 9 out of 10 species, with a mean Area Under the Receiver Operating Characteristics curve (AUROC) of 0.875 versus 0.855. Additionally, it achieved superior performance on the iBPA benchmark (0.86 AUROC vs. 0.82) compared to other methods in the literature. Lastly, we applied the pipeline to rank unseen proteomes based on protective potential to guide candidate selection for pre-clinical testing. Compared to the standard RV practice of ranking candidates according to their biological descriptors, our approach reduces the number of pre-clinical tests needed to identify PAs by up to 83% on average.