Abstract Motivation Rare diseases remain difficult to diagnose due to limited patient data and genetic diversity, with many cases remaining undiagnosed despite advances in variant prioritization tools. While large language models have shown promise in medical applications, their optimal application for trustworthy and accurate gene prioritization downstream of modern prioritization tools has not been systematically evaluated. Results We benchmarked various language models for gene prioritization using multi-agent and Human Phenotype Ontology classification approaches to categorize patient cases by phenotype-based solvability levels. To address language model limitations in ranking large gene sets, we implemented a divide-and-conquer strategy with mini-batching and token limiting for improved efficiency. GPT-4 outperformed other language models across all patient datasets, demonstrating superior accuracy in ranking causal genes. Multi-agent and Human Phenotype Ontology classification approaches effectively distinguished between confidently-solved and challenging cases. However, we observed bias towards well-studied genes and input order sensitivity as notable language model limitations. Our divide-and-conquer strategy enhanced accuracy, overcoming positional and gene frequency biases in literature. This framework optimized the overall process for identifying disease-causal genes compared to baseline evaluation, better enabling targeted diagnostic and therapeutic interventions and streamlining diagnosis of rare genetic disorders. Availability and implementation Software and additional material is available at: https://github.com/LiuzLab/GPT-Diagnosis
This work explores the application and optimization of Large Language Models (LLMs) for gene prioritization in the diagnosis of rare genetic diseases, an area historically challenging due to limited patient data and genetic heterogeneity. The paper addresses the urgent need for enhanced diagnostic methodologies by leveraging the advanced reasoning capabilities of LLMs to analyze clinical phenotypes and candidate genes.
The significance of this research lies in demonstrating the considerable potential of LLMs to streamline the diagnostic process for rare genetic disorders, facilitate the reanalysis of previously unsolved cases, and accelerate the discovery of novel disease-associated genes. This advancement is crucial for developing more precise and effective diagnostic and therapeutic interventions. A key contribution is the identification and mitigation of inherent LLM biases, such as favoring well-studied genes and exhibiting sensitivity to input order, which are critical considerations for the reliable deployment of AI in medical diagnostics.
Key innovations introduced include:
1. Comprehensive LLM Benchmarking for Gene Prioritization: The paper provides a systematic evaluation of various proprietary (GPT-4, GPT-3.5) and open-source (Mixtral-8x7B, Llama-2-70B, BioMistral-7B) LLMs on real-world patient data from three distinct clinical cohorts (Baylor Genetics, Undiagnosed Diseases Network, Deciphering Developmental Disorders). This benchmarking establishes GPT-4 as the leading performer, achieving approximately 30% accuracy in vanilla form for top-ranked causal genes.
2. Multi-Agent Classification System: A novel two-step LLM pipeline is developed to categorize patient cases based on the solvability of their phenotype-gene associations. An “evaluator agent” generates an essay assessing gene candidates, which is then summarized by a “summarizer agent” to classify cases as having a direct (‘Yes’) or indirect (‘No’) association. This approach helps differentiate between confidently solvable and challenging cases, allowing for more nuanced analysis of LLM performance.
3. Human Phenotype Ontology (HPO) Phenotype Classification: The study integrates HPO-based analysis, using a dataset specificity index (DsI), to quantify the specificity of patient phenotypes. It demonstrates that cases with “Highly Specific HPO” terms consistently lead to better LLM performance in identifying causal genes, highlighting the importance of rich phenotypic descriptions.
4. Divide-and-Conquer Strategy for Bias Mitigation: To counteract observed biases where LLMs tend to prioritize frequently referenced genes and are sensitive to the input order of gene lists, a novel divide-and-conquer strategy is proposed and validated. This method breaks down the gene prioritization task by splitting large sets of candidate genes into smaller, uniformly sized subsets, estimating in-group probabilities, and averaging these probabilities across multiple samplings. This technique significantly enhances accuracy, particularly for longer gene lists, and effectively mitigates both literature and positional biases.
The main prior ingredients that enabled this research include:
* Large Language Models (LLMs): The foundational technology, specifically transformer-based architectures, trained on vast textual datasets. Their emergent capabilities in understanding, processing, and generating human-like text, coupled with their exposure to medical literature during pre-training, are indispensable.
* Human Phenotype Ontology (HPO): A crucial standardized vocabulary that provides a structured representation of phenotypic abnormalities, enabling the systematic description of patient symptoms and facilitating the analysis of phenotype-gene associations.
* Next-Generation Sequencing Data (Whole Exome/Genome Sequencing): The availability of clinical sequencing data, which identifies candidate genetic variants, forms the basis for the gene prioritization task.
* Existing Gene Prioritization Methodologies: Prior research and tools in gene prioritization, while often limited by structured database reliance or scalability issues for rare diseases, established the conceptual framework and demonstrated the clinical utility of such approaches.
* Probabilistic Ranking and Log-Likelihood Ratios: Established statistical methods used to quantify the likelihood of gene causality, particularly important for open-source LLMs that can output token probabilities.
* Real-World Clinical Datasets: Access to large, resolved patient datasets from clinical diagnostic labs was critical for robust benchmarking and validating the LLM-based gene prioritization methods in a clinically relevant context.
* Prompt Engineering and LLM Interaction Strategies: General techniques for effectively querying LLMs, including ensemble prompting, mini-batch inference, and output clipping, were utilized to enhance the robustness, efficiency, and control of LLM responses.