Survey and Improvement Strategies for Gene Prioritization with Large Language Models

Abstract

Abstract Motivation Rare diseases remain difficult to diagnose due to limited patient data and genetic diversity, with many cases remaining undiagnosed despite advances in variant prioritization tools. While large language models have shown promise in medical applications, their optimal application for trustworthy and accurate gene prioritization downstream of modern prioritization tools has not been systematically evaluated. Results We benchmarked various language models for gene prioritization using multi-agent and Human Phenotype Ontology classification approaches to categorize patient cases by phenotype-based solvability levels. To address language model limitations in ranking large gene sets, we implemented a divide-and-conquer strategy with mini-batching and token limiting for improved efficiency. GPT-4 outperformed other language models across all patient datasets, demonstrating superior accuracy in ranking causal genes. Multi-agent and Human Phenotype Ontology classification approaches effectively distinguished between confidently-solved and challenging cases. However, we observed bias towards well-studied genes and input order sensitivity as notable language model limitations. Our divide-and-conquer strategy enhanced accuracy, overcoming positional and gene frequency biases in literature. This framework optimized the overall process for identifying disease-causal genes compared to baseline evaluation, better enabling targeted diagnostic and therapeutic interventions and streamlining diagnosis of rare genetic disorders. Availability and implementation Software and additional material is available at: https://github.com/LiuzLab/GPT-Diagnosis

Locations

  • Bioinformatics Advances
  • arXiv (Cornell University)

Ask a Question About This Paper

Summary

This work explores the application and optimization of Large Language Models (LLMs) for gene prioritization in the diagnosis of rare genetic diseases, an area historically challenging due to limited patient data and genetic heterogeneity. The paper addresses the urgent need for enhanced diagnostic methodologies by leveraging the advanced reasoning capabilities of LLMs to analyze clinical phenotypes and candidate genes.

The significance of this research lies in demonstrating the considerable potential of LLMs to streamline the diagnostic process for rare genetic disorders, facilitate the reanalysis of previously unsolved cases, and accelerate the discovery of novel disease-associated genes. This advancement is crucial for developing more precise and effective diagnostic and therapeutic interventions. A key contribution is the identification and mitigation of inherent LLM biases, such as favoring well-studied genes and exhibiting sensitivity to input order, which are critical considerations for the reliable deployment of AI in medical diagnostics.

Key innovations introduced include:
1. Comprehensive LLM Benchmarking for Gene Prioritization: The paper provides a systematic evaluation of various proprietary (GPT-4, GPT-3.5) and open-source (Mixtral-8x7B, Llama-2-70B, BioMistral-7B) LLMs on real-world patient data from three distinct clinical cohorts (Baylor Genetics, Undiagnosed Diseases Network, Deciphering Developmental Disorders). This benchmarking establishes GPT-4 as the leading performer, achieving approximately 30% accuracy in vanilla form for top-ranked causal genes.
2. Multi-Agent Classification System: A novel two-step LLM pipeline is developed to categorize patient cases based on the solvability of their phenotype-gene associations. An “evaluator agent” generates an essay assessing gene candidates, which is then summarized by a “summarizer agent” to classify cases as having a direct (‘Yes’) or indirect (‘No’) association. This approach helps differentiate between confidently solvable and challenging cases, allowing for more nuanced analysis of LLM performance.
3. Human Phenotype Ontology (HPO) Phenotype Classification: The study integrates HPO-based analysis, using a dataset specificity index (DsI), to quantify the specificity of patient phenotypes. It demonstrates that cases with “Highly Specific HPO” terms consistently lead to better LLM performance in identifying causal genes, highlighting the importance of rich phenotypic descriptions.
4. Divide-and-Conquer Strategy for Bias Mitigation: To counteract observed biases where LLMs tend to prioritize frequently referenced genes and are sensitive to the input order of gene lists, a novel divide-and-conquer strategy is proposed and validated. This method breaks down the gene prioritization task by splitting large sets of candidate genes into smaller, uniformly sized subsets, estimating in-group probabilities, and averaging these probabilities across multiple samplings. This technique significantly enhances accuracy, particularly for longer gene lists, and effectively mitigates both literature and positional biases.

The main prior ingredients that enabled this research include:
* Large Language Models (LLMs): The foundational technology, specifically transformer-based architectures, trained on vast textual datasets. Their emergent capabilities in understanding, processing, and generating human-like text, coupled with their exposure to medical literature during pre-training, are indispensable.
* Human Phenotype Ontology (HPO): A crucial standardized vocabulary that provides a structured representation of phenotypic abnormalities, enabling the systematic description of patient symptoms and facilitating the analysis of phenotype-gene associations.
* Next-Generation Sequencing Data (Whole Exome/Genome Sequencing): The availability of clinical sequencing data, which identifies candidate genetic variants, forms the basis for the gene prioritization task.
* Existing Gene Prioritization Methodologies: Prior research and tools in gene prioritization, while often limited by structured database reliance or scalability issues for rare diseases, established the conceptual framework and demonstrated the clinical utility of such approaches.
* Probabilistic Ranking and Log-Likelihood Ratios: Established statistical methods used to quantify the likelihood of gene causality, particularly important for open-source LLMs that can output token probabilities.
* Real-World Clinical Datasets: Access to large, resolved patient datasets from clinical diagnostic labs was critical for robust benchmarking and validating the LLM-based gene prioritization methods in a clinically relevant context.
* Prompt Engineering and LLM Interaction Strategies: General techniques for effectively querying LLMs, including ensemble prompting, mini-batch inference, and output clipping, were utilized to enhance the robustness, efficiency, and control of LLM responses.

Rare diseases are challenging to diagnose due to limited patient data and genetic diversity. Despite advances in variant prioritization, many cases remain undiagnosed. While large language models (LLMs) have performed … Rare diseases are challenging to diagnose due to limited patient data and genetic diversity. Despite advances in variant prioritization, many cases remain undiagnosed. While large language models (LLMs) have performed well in medical exams, their effectiveness in diagnosing rare genetic diseases has not been assessed. To identify causal genes, we benchmarked various LLMs for gene prioritization. Using multi-agent and Human Phenotype Ontology (HPO) classification, we categorized patients based on phenotypes and solvability levels. As gene set size increased, LLM performance deteriorated, so we used a divide-and-conquer strategy to break the task into smaller subsets. At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly. The multi-agent and HPO approaches helped distinguish confidently solved cases from challenging ones, highlighting the importance of known gene-phenotype associations and phenotype specificity. We found that cases with specific phenotypes or clear associations were more accurately solved. However, we observed biases toward well-studied genes and input order sensitivity, which hindered gene prioritization. Our divide-and-conquer strategy improved accuracy by overcoming these biases. By utilizing HPO classification, novel multi-agent techniques, and our LLM strategy, we improved causal gene identification accuracy compared to our baseline evaluation. This approach streamlines rare disease diagnosis, facilitates reanalysis of unsolved cases, and accelerates gene discovery, supporting the development of targeted diagnostics and therapies.
Phenotype-driven gene prioritization is a critical process in the diagnosis of rare genetic disorders for identifying and ranking potential disease-causing genes based on observed physical traits or phenotypes. While traditional … Phenotype-driven gene prioritization is a critical process in the diagnosis of rare genetic disorders for identifying and ranking potential disease-causing genes based on observed physical traits or phenotypes. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models have opened doors to the potential of AI predictions through extensive training on diverse corpora and complex models. This study conducted a comprehensive evaluation of five large language models, including two Generative Pre-trained Transformers series, and three Llama2 series, assessing their performance across three key metrics: task completeness, gene prediction accuracy, and adherence to required output structures. Various experiments explored combinations of models, prompts, input types, and task difficulty levels. Our findings reveal that even the best-performing LLM, GPT-4, achieved an accuracy of 16.0%, which still lags behind traditional bioinformatics tools. Prediction accuracy increased with the parameter/model size. A similar increasing trend was observed for the task completion rate, with complicated prompts more likely to increase task completeness in models smaller than GPT-4. However, complicated prompts are more likely to decrease the structure compliance rate, but no prompt effects on GPT-4. Compared to HPO term-based input, LLM was also able to achieve better than random prediction accuracy by taking free-text input, but slightly lower than with the HPO input. Bias analysis showed that certain genes, such as MECP2, CDKL5, and SCN1A, are more likely to be top-ranked, potentially explaining the variances observed across different datasets. This study provides valuable insights into the integration of LLMs within genomic analysis, contributing to the ongoing discussion on the utilization of advanced LLMs in clinical workflows.
The intricate relationship between genetic variation and human diseases has been a focal point of medical research, evidenced by the identification of risk genes regarding specific diseases. The advent of … The intricate relationship between genetic variation and human diseases has been a focal point of medical research, evidenced by the identification of risk genes regarding specific diseases. The advent of advanced genome sequencing techniques has significantly improved the efficiency and cost-effectiveness of detecting these genetic markers, playing a crucial role in disease diagnosis and forming the basis for clinical decision-making and early risk assessment. To overcome the limitations of existing databases that record disease-gene associations from existing literature, which often lack real-time updates, we propose a novel framework employing Large Language Models (LLMs) for the discovery of diseases associated with specific genes. This framework aims to automate the labor-intensive process of sifting through medical literature for evidence linking genetic variations to diseases, thereby enhancing the efficiency of disease identification. Our approach involves using LLMs to conduct literature searches, summarize relevant findings, and pinpoint diseases related to specific genes. This paper details the development and application of our LLM-powered framework, demonstrating its potential in streamlining the complex process of literature retrieval and summarization to identify diseases associated with specific genetic variations.
<title>Abstract</title> We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene–drug–phenotype scenarios, we … <title>Abstract</title> We evaluated the ability of large language models (LLMs) to generate clinically accurate pharmacogenomic (PGx) recommendations aligned with CPIC guidelines. Using a benchmark of 599 curated gene–drug–phenotype scenarios, we compared five leading models, including GPT-4o and fine-tuned LLaMA variants, through both standard lexical metrics and a novel semantic evaluation framework (LLM Score) validated by expert review. General-purpose models frequently produced incomplete or unsafe outputs, while our domain-adapted model achieved superior performance, with an LLM Score of 0.92 and significantly faster inference speed. Results highlight the importance of fine-tuning and structured prompting over model scale alone. This work establishes a robust framework for evaluating PGx-specific LLMs and demonstrates the feasibility of safer, AI-driven personalized medicine.
The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools … The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity. This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state-of-the-art tools, especially in handling ambiguous and clinically uncertain variants. The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and reliability of genetic variant classification systems. These findings support the deployment of these advanced computational models in clinical environments, where they can significantly enhance the diagnostic processes for genetic disorders, ultimately pushing the boundaries of personalized medicine by offering more detailed and actionable genetic insights.
Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates … Classifying cancer genetic variants based on clinical actionability is crucial yet challenging in precision oncology. Large language models (LLMs) offer potential solutions, but their performance remains underexplored. This study evaluates GPT-4o, Llama 3.1, and Qwen 2.5 in classifying genetic variants from the OncoKB and CIViC databases, as well as a real-world dataset derived from FoundationOne CDx reports. GPT-4o achieved the highest accuracy (0.7318) in distinguishing clinically relevant variants from variants of unknown clinical significance (VUS), outperforming Qwen 2.5 (0.5731) and Llama 3.1 (0.4976). LLMs demonstrated better concordance with expert annotations for variants with strong clinical evidence but exhibited greater inconsistencies for those with weaker evidence. All three models showed a tendency to assign variants to higher evidence levels, suggesting a propensity for overclassification. Prompt engineering significantly improved accuracy, while retrieval-augmented generation (RAG) further enhanced performance. Stability analysis across 100 iterations revealed greater consistency with the CIViC system than with OncoKB. These findings highlight the promise of LLMs in cancer genetic variant classification while underscoring the need for further optimization to improve accuracy, consistency, and clinical applicability.
To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models - PhenoBCBERT and PhenoGPT - for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While … To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models - PhenoBCBERT and PhenoGPT - for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes, due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models (LLMs) to automate the detection of phenotype terms, including those not in the current HPO. We compared these models to PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also showed strong performance in case studies on biomedical literature. We evaluated the strengths and weaknesses of BERT-based and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.
Background. Large Language Models (LLMs) hold promise for improving genetic variant literature review in clinical testing. We assessed Generative Pretrained Transformer 4's (GPT-4) performance, nondeterminism, and drift to inform its … Background. Large Language Models (LLMs) hold promise for improving genetic variant literature review in clinical testing. We assessed Generative Pretrained Transformer 4's (GPT-4) performance, nondeterminism, and drift to inform its suitability for use in complex clinical processes. Methods. A 2-prompt process for classification of functional evidence was optimized using a development set of 45 articles. The prompts asked GPT-4 to supply all functional data present in an article related to a variant or indicate that no functional evidence is present. For articles indicated as containing functional evidence, a second prompt asked GPT-4 to classify the evidence into pathogenic, benign, or intermediate/inconclusive categories. A final test set of 72 manually classified articles was used to test performance. Results. Over a 2.5-month period (Dec 2023-Feb 2024), we observed substantial differences in intraday (nondeterminism) and across day (drift) results, which lessened after 1/18/24. This variability is seen within and across models in the GPT-4 series, affecting different performance statistics to different degrees. Twenty runs after 1/18/24 identified articles containing functional evidence with 92.2% sensitivity, 95.6% positive predictive value (PPV) and 86.3% negative predictive value (NPV). The second prompt's identified pathogenic functional evidence with 90.0% sensitivity, 74.0% PPV and 95.3% NVP and for benign evidence with 88.0% sensitivity, 76.6% PPV and 96.9% NVP. Conclusion. Nondeterminism and drift within LLMs must be assessed and monitored when introducing LLM based functionality into clinical workflows. Failing to do this assessment or accounting for these challenges could lead to incorrect or missing information that is critical for patient care. The performance of our prompts appears adequate to assist in article prioritization but not in automated decision making.
Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential … Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.
Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large … Rare diseases present unique challenges in healthcare, often suffering from delayed diagnosis and fragmented information landscapes. The scarcity of reliable knowledge in these conditions poses a distinct challenge for Large Language Models (LLMs) in supporting clinical management and delivering precise patient information underscoring the need for focused training on these 'zebra' cases. We present Zebra-Llama, a specialized context-aware language model with high precision Retrieval Augmented Generation (RAG) capability, focusing on Ehlers-Danlos Syndrome (EDS) as our case study. EDS, affecting 1 in 5,000 individuals, exemplifies the complexities of rare diseases with its diverse symptoms, multiple subtypes, and evolving diagnostic criteria. By implementing a novel context-aware fine-tuning methodology trained on questions derived from medical literature, patient experiences, and clinical resources, along with expertly curated responses, Zebra-Llama demonstrates unprecedented capabilities in handling EDS-related queries. On a test set of real-world questions collected from EDS patients and clinicians, medical experts evaluated the responses generated by both models, revealing Zebra-Llama's substantial improvements over base model (Llama 3.1-8B-Instruct) in thoroughness (77.5% vs. 70.1%), accuracy (83.0% vs. 78.8%), clarity (74.7% vs. 72.0%) and citation reliability (70.6% vs. 52.3%). Released as an open-source resource, Zebra-Llama not only provides more accessible and reliable EDS information but also establishes a framework for developing specialized AI solutions for other rare conditions. This work represents a crucial step towards democratizing expert-level knowledge in rare disease management, potentially transforming how healthcare providers and patients navigate the complex landscape of rare diseases.
Abstract We introduce Phen-Gen, a method which combines patient’s disease symptoms and sequencing data with prior domain knowledge to identify the causative gene(s) for rare disorders. Simulations reveal that the … Abstract We introduce Phen-Gen, a method which combines patient’s disease symptoms and sequencing data with prior domain knowledge to identify the causative gene(s) for rare disorders. Simulations reveal that the causal variant is ranked first in 88% cases when it is coding; which is 52% advantage over a genotype-only approach and outperforms existing methods by 13-58%. If disease etiology is unknown, the causal variant is assigned top-rank in 71% of simulations.
Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. … Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at \url{https://github.com/Liu-Hy/GenoTex}.
Rare variant association analysis, which assesses the aggregate effect of rare damaging variants within a gene, is a powerful strategy for advancing knowledge of human biology. Numerous models have been … Rare variant association analysis, which assesses the aggregate effect of rare damaging variants within a gene, is a powerful strategy for advancing knowledge of human biology. Numerous models have been proposed to identify damaging coding variants, with the most recent ones employing deep learning and large language models (LLM) to predict the impact of changes in coding sequences. Here, we use newly available proteomics data on 2,898 proteins across 46,665 individuals to evaluate and refine LLM predictors of damaging variants. Using one of these refined models, we evaluate association between rare damaging variants and human phenotypes at 241 positive control gene-trait pairs. Among these gene-trait pairs, our proteomics-guided model outperforms an ensemble of conventional approaches including PolyPhen2, Mutation Taster, SIFT, and LRT, as well as newer machine learning approaches for identifying damaging missense variants, such as CADD, ESM-1v, ESM-1b and AlphaMissense. When attempting to recover known associations by correctly separating damaging singleton missense variants from other singleton variants, our approach recapitulates 36.5% of gene-trait pairs with known associations, exceeding all the alternatives we considered. Furthermore, when we apply our model to 10 exemplary traits from the UK Biobank, we identify 177 gene-trait associations - again exceeding all other approaches. Our results demonstrate that summary statistics from large-scale human proteomics data enable evaluation and refinement of coding variant classification LLMs, improving discovery potential in human genetic studies.
Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high … Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.
This week’s recap highlights compendium of human gene functions derived from evolutionary modelling from the Gene Ontology Consortium, an AI reasoning model applied to rare disease diagnosis, an agentic AI … This week’s recap highlights compendium of human gene functions derived from evolutionary modelling from the Gene Ontology Consortium, an AI reasoning model applied to rare disease diagnosis, an agentic AI for scRNA-seq data exploration, and applying FAIR principles to scientific workflows.
Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly … Objectives: The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of Natural Language Processing (NLP) techniques, particularly Large Language Models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data. Methods: Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type. Results: A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility. Discussion: The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability.
Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic … Term normalization is the process of mapping a term from free text to a standardized concept and its machine-readable code in an ontology. Accurate normalization of terms that capture phenotypic differences between patients and diseases is critical to the success of precision medicine initiatives. A large language model (LLM), such as GPT-4o, can normalize terms to the Human Phenotype Ontology (HPO), but it may retrieve incorrect HPO IDs. Reported accuracy rates for LLMs on these tasks may be inflated due to imbalanced test datasets skewed towards high-frequency terms. In our study, using a comprehensive dataset of 268,776 phenotype annotations for 12,655 diseases from the HPO, GPT-4o achieved an accuracy of 13.1% in normalizing 11,225 unique terms. However, the accuracy was unevenly distributed, with higher-frequency and shorter terms normalized more accurately than lower-frequency and longer terms. Feature importance analysis, using SHAP and permutation methods, identified low-term frequency as the most significant predictor of normalization errors. These findings suggest that training and evaluation datasets for LLM-based term normalization should balance low- and high-frequency terms to improve model performance, particularly for infrequent terms critical to precision medicine.
Objective: ICD codes are commonly used to filter patient cohorts but may not accurately reflect disease presence. Furthermore, many health problems are recorded in unstructured clinical notes, complicating cohort discovery … Objective: ICD codes are commonly used to filter patient cohorts but may not accurately reflect disease presence. Furthermore, many health problems are recorded in unstructured clinical notes, complicating cohort discovery from EHR data. Existing computed phenotyping methods have limitations in identifying evolving disease patterns and incomplete modeling. This study explores the potential of LLMs, by evaluating GPT-4o type II diabetes mellitus (T2DM) phenotyping ability using Retrieval-Augmented Generation (RAG). Methods: A RAG system was built, leveraging 275 patients entire notes. We performed total 336 experiments to study the sensitivity of RAG to various chunk sizes, the number of chunks, and prompts across seven embedding models. Then the effectiveness of GPT-4o in T2DM phenotyping was assessed using optimized RAG configurations, comparing with ICD code and PheNorm phenotype performance. Token usage was also evaluated. Results: The results show that GPT-4o with optimized RAG significantly outperformed ICD-10 and PheNorm in sensitivity, NPV, and F1, although PPV and specificity need improvement. When used with general embedding models or a zero-shot prompt, the results showed better sensitivity, NPV, and F1-scores, while domain-specific models and a few-shot prompt excelled in specificity and PPV. Furthermore, RAG optimization allowed lower-ranked embedding models achieve reliable performance. Gte-Qwen2-1.5B-instruct and GatorTronS provided the highest performance in specific evaluation metrics at a substantially lower cost. Conclusion: Optimized RAG configurations significantly enhanced key performance metrics compared to existing methods. This study provides valuable insights into optimal configurations and cost-effective embedding model choices, while identifying limitations such as ranking issues and contextual misinterpretation by LLM.
Identifying disease phenotypes from electronic health records (EHRs) is critical for numerous secondary uses. Manually encoding physician knowledge into rules is particularly challenging for rare diseases due to inadequate EHR … Identifying disease phenotypes from electronic health records (EHRs) is critical for numerous secondary uses. Manually encoding physician knowledge into rules is particularly challenging for rare diseases due to inadequate EHR coding, necessitating review of clinical notes. Large language models (LLMs) offer promise in text understanding but may not efficiently handle real-world clinical documentation. We propose a zero-shot LLM-based method enriched by retrieval-augmented generation and MapReduce, which pre-identifies disease-related text snippets to be used in parallel as queries for the LLM to establish diagnosis. We show that this method as applied to pulmonary hypertension (PH), a rare disease characterized by elevated arterial pressures in the lungs, significantly outperforms physician logic rules ($F_1$ score of 0.62 vs. 0.75). This method has the potential to enhance rare disease cohort identification, expanding the scope of robust clinical research and care gap identification.