AI is THE BIG THING everyone seems to be talking about right now, with dozens of new machine learning models being released every month. I know everyone has their hardcore stances on A.I. (which is fair—just look at how ridiculous Meta's AI Instagram profiles were), but at least in genetics, there have been some truly exciting developments. Algorithms like DeepVariant, AlphaFold, EVE (Evolutionary Model of Variants), and PolyPhen-2 are driving advancements in genome analysis, variant annotation, determining protein shape, drug discovery, and more. But a lot of these models are trained on data from species for which we already have abundant sequencing data, like humans, mice, or fruit flies. It gets a lot more difficult when trying to train AI for non-model species, where high-quality reference genomes and extensive datasets are often lacking.
The obvious answer to get around this—just sequence these non-model organisms more—is easier said than done. Sequencing non-model organisms often comes with significant challenges, including high costs, limited availability of samples, and technical difficulties associated with assembling and annotating their genomes. Many non-model species have large, highly repetitive genomes or unique features that make them more complex to sequence and analyze compared to model organisms. I understand this all too well; when I was trying to assemble and annotate nudibranch (sea slug) mitochondrial short reads, I encountered several challenges due to their unusual genomes.
So we need other tools to solve this. Luckily, there are already many researchers working on creative solutions to address these challenges. Lets dive right in.
The first of these tools is called transfer learning. Transfer learning is exactly what it sounds like. You take your preexisting model, already trained on an organism with a lot of data and “fine-tune” it for use with data from your non-model organism. For example, if a machine learning model has been trained to predict gene function in Arabidopsis thaliana, a plant commonly used for genomic studies, the model has already learned a lot about gene structures, promoter regions, and functional annotation. These learned features (or parameters) can then be transferred to a non-model plant like Artemisia annua, a medicinal plant with less extensive genomic data. Essentially you shrug and just use what little genomic data you may have for Artemisia to continue the training process. Often, the learning rate is lowered during this phase to make the adjustments subtle, so the model doesn't overfit to the new species' data but still adapts well to it.
There are ways to make this process of fine tuning even better. For instance, using phylogenetic insights can help guide the transfer process by incorporating evolutionary relationships into model training. In this scenario, phylogenetic trees—based on evolutionary relatedness—can provide context that enables the model to understand how traits and genomic sequences are conserved across species. In our example above, this means the AI doesn't just rely on the small amount of data from Artemisia but also incorporates evolutionary patterns learned from closely related species, such as other plants within the same genus or family.
While they provide a strong foundation for adapting AI models to non-model organisms, the limitations of data availability and the complexity of evolutionary processes still present significant challenges. This is where synthetic data generation enters the picture as the next frontier in improving model performance for non-model species. The concept of synthetic data generation, as the name suggests, involves creating artificial datasets that mimic the characteristics of real genomic data. Right now, generating these genomic datasets is an emerging field in its earliest stages, but there have been some promising developments. In 2022, Illumina released a case study which was aimed at solving the large issue of sharing sensitive genetic data due to privacy concerns and regulatory restrictions. While this isn't exactly the same, I could easily envision future research applying these techniques to generate data for non-model organisms as well. Essentially, the researchers used a dataset from a previous paper, which included GWAS analyses of 68 phenotypes in 1,200 mice, with the goal of seeing if they could recreate the paper's findings on bone mineral density using their own synthetic data.
The results were encouraging. The synthetic model successfully recreated 177 out of 193 SNPs that were statistically significant in the real-world data, meaning the AI was accurate 93% of the time in recreating the most significant genetic variations. However, the study also revealed limitations. The synthetic model did introduce a decent bit of false positive GWAS associations, likely due to the small sample size of 1,200 mice used for training. Despite these limitations, I am almost positive that as the field of synthetic genomic data generation continues to advance, it will become an increasingly valuable tool for enhancing AI models and overcoming data gaps in non-model organisms.