Enhancing SNV Prioritization in WGS Cancer Diagnostics
2024 (English)Independent thesis Advanced level (degree of Master (One Year)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Background: Whole-genome sequencing (WGS) has revolutionized cancer diagnostics, enabling the identification of somatic single nucleotide variants (SNVs) critical for precision oncology. However, interpreting these variants remains challenging due to annotation inconsistencies and biases, particularly in tumor-only sequencing workflows. To address these challenges, the Balsamic pipeline integrates a scoring model that prioritizes variants by their likelihood of pathogenicity. This thesis evaluates the scoring model integrated within the Balsamic pipeline, designed to prioritize clinically relevant variants in tumor-only analyses. Using an initial dataset of 8,145 variants and an extended dataset derived from ClinVar, the model's ability to distinguish between benign and pathogenic variants was assessed.
Results: The initial analysis revealed high specificity for benign variants, with 98% correctly classified, but struggled with pathogenic variants, achieving a recall of 54% and an Fl-score of 0.59 at the optimal threshold. Feature contribution analysis identified "Consequence" (CON) and "Clinical Significance" (CLIN) as key predictors, leading to reweighting efforts to improve separation between groups. Using the extended dataset, model performance improved significantly, achieving a precision of 92% and an Fl-score of 0.90 at the optimal threshold, demonstrating the potential of balanced datasets and combined features like the newly introduced "COMBINED SCORE" for enhancing classification accuracy.
Conclusions: This work highlights the importance of tailored feature weighting, dataset balance, and innovative feature engineering in improving variant prioritization workflows. Future research should focus on integrating real-world clinical data and leveraging machine learning to refine predictive capabilities further. This work provides a foundation for improving variant interpretation workflows in precision oncology.
Place, publisher, year, edition, pages
2024. , p. 50
National Category
Medical Genetics and Genomics
Identifiers
URN: urn:nbn:se:his:diva-24866OAI: oai:DiVA.org:his-24866DiVA, id: diva2:1932387
External cooperation
Clinical Genomics, SciLifeLab
Subject / course
Bioinformatics
Educational program
Bioinformatics - Master’s Programme
Supervisors
Examiners
2025-01-312025-01-292025-09-29Bibliographically approved