Independent thesis Advanced level (degree of Master (Two Years)), 30 credits / 45 HE credits
In this work we present a novel method to extract potential hub genes, transcription factors and regions with densely interconnected protein-protein-interaction networks from RNAseq data. To achieve this we deploy variational autoencoders, a generative machine learning framework, and extract the gene-wise reconstruction errors. This reconstruction error produced during training is considered as a measurement of impact for a gene on the transcriptome here.
The method can handle big datasets (3.5Gb and more) in reasonable time on computers for domestic usage without any gpu-acceleration. This circumstance allows users without access to large amounts of computational resources to also work with expression data of large size.
The final ranking based on reconstruction errors underlies less of a bias compared to most hub gene inference methods currently available. Also no prior gene regulatory network inference is required. However, the introduction of a bias can help to focus on certain genes of interest. Here we biased by using genes present in the STRING data base to also ease the following analysis.
Analysis of reconstruction error showed a tendency for genes with low reconstruction error to capture genes with central meaning to the data set used for training. In case of healthy cells this was genes associated with house keeping mechanisms and for breast cancer data those genes were associated to breast cancer. In breast cancer specific data we found for example a high frequency of HOX family members linked specifically to breast cancer. For data covering different types of cancer here the picture was broader and covered a wide range of genes associated with different types of cancer.
There also was a high enrichment of transcription factors present in the genes with low reconstruction error. Not only the regions with lowest reconstruction error will reveal a high enrichment for transcription factors, also other regions show transcription factor enrichment. Transcription factors from these other regions will differ regarding their correlation patterns.
Regions with low reconstruction error and/or a high transcription factor enrichment show a high PPI-enrichment and exhibit densely interconnected networks.
2023. , p. 37