Wednesday, January 8, 2025

How Do We Train AI For Non-Model Species?

AI is THE BIG THING everyone seems to be talking about right now, with dozens of new machine learning models being released every month. I know everyone has their hardcore stances on A.I. (which is fair—just look at how ridiculous Meta's AI Instagram profiles were), but at least in genetics, there have been some truly exciting developments. Algorithms like DeepVariant, AlphaFold, EVE (Evolutionary Model of Variants), and PolyPhen-2 are driving advancements in genome analysis, variant annotation, determining protein shape, drug discovery, and more. But a lot of these models are trained on data from species for which we already have abundant sequencing data, like humans, mice, or fruit flies. It gets a lot more difficult when trying to train AI for non-model species, where high-quality reference genomes and extensive datasets are often lacking.

The obvious answer to get around this—just sequence these non-model organisms more—is easier said than done. Sequencing non-model organisms often comes with significant challenges, including high costs, limited availability of samples, and technical difficulties associated with assembling and annotating their genomes. Many non-model species have large, highly repetitive genomes or unique features that make them more complex to sequence and analyze compared to model organisms. I understand this all too well; when I was trying to assemble and annotate nudibranch (sea slug) mitochondrial short reads, I encountered several challenges due to their unusual genomes

So we need other tools to solve this. Luckily, there are already many researchers working on creative solutions to address these challenges. Lets dive right in. 

The first of these tools is called transfer learning. Transfer learning is exactly what it sounds like. You take your preexisting model, already trained on an organism with a lot of data and “fine-tune” it for use with data from your non-model organism. For example, if a machine learning model has been trained to predict gene function in Arabidopsis thaliana, a plant commonly used for genomic studies, the model has already learned a lot about gene structures, promoter regions, and functional annotation. These learned features (or parameters) can then be transferred to a non-model plant like Artemisia annua, a medicinal plant with less extensive genomic data. Essentially you shrug and just use what little genomic data you may have for Artemisia to continue the training process. Often, the learning rate is lowered during this phase to make the adjustments subtle, so the model doesn't overfit to the new species' data but still adapts well to it.

Artemisia annua

There are ways to make this process of fine tuning even better. For instance, using phylogenetic insights can help guide the transfer process by incorporating evolutionary relationships into model training. In this scenario, phylogenetic trees—based on evolutionary relatedness—can provide context that enables the model to understand how traits and genomic sequences are conserved across species.  In our example above, this means the AI doesn't just rely on the small amount of data from Artemisia but also incorporates evolutionary patterns learned from closely related species, such as other plants within the same genus or family.

While they provide a strong foundation for adapting AI models to non-model organisms, the limitations of data availability and the complexity of evolutionary processes still present significant challenges. This is where synthetic data generation enters the picture as the next frontier in improving model performance for non-model species. The concept of synthetic data generation, as the name suggests, involves creating artificial datasets that mimic the characteristics of real genomic data. Right now, generating these genomic datasets is an emerging field in its earliest stages, but there have been some promising developments. In 2022, Illumina released a case study which was aimed at solving the large issue of sharing sensitive genetic data due to privacy concerns and regulatory restrictions. While this isn't exactly the same, I could easily envision future research applying these techniques to generate data for non-model organisms as well. Essentially, the researchers used a dataset from a previous paper, which included GWAS analyses of 68 phenotypes in 1,200 mice, with the goal of seeing if they could recreate the paper's findings on bone mineral density using their own synthetic data.

The results were encouraging. The synthetic model successfully recreated 177 out of 193 SNPs that were statistically significant in the real-world data, meaning the AI was accurate 93% of the time in recreating the most significant genetic variations. However, the study also revealed limitations. The synthetic model did introduce a decent bit of false positive GWAS associations, likely due to the small sample size of 1,200 mice used for training. Despite these limitations, I am almost positive that as the field of synthetic genomic data generation continues to advance, it will become an increasingly valuable tool for enhancing AI models and overcoming data gaps in non-model organisms.

Sunday, May 19, 2024

Science Roundup

Its been a while since I've written something for this blog, but with my first year of my PhD done I can now devote a little more time to it. Anyways here's a quick roundup of some cool topics I found interesting these past months. 

HACE - Targeted mutations 

Geneticists are often interested in the way that mutations in DNA affect the function of a protein (remember: DNA --> RNA --> Proteins). This can kind of be done on DNA in a laboratory setting through a variety of ways like exposing cells to UV radiation or through knockdown experiments but creating targeted mutations has proved difficult and complex. In a recent pre-print by Dawn Chen et al., the researchers have developed a new technology called HACE that is able to make small nicks in a single DNA strand using a guided version of Crisper-Cas9. The Cas9 bit has a section of RNA loaded on it that recruits an enzyme and makes random mutations as it moves along where the break in the DNA was made. This way, scientists can now make much more specific evaluations on how a mutation in a location of DNA may affect its function. 

Scale Eater Fish

I recently learned of the Scale Eater Fish (Perissodus microlepis), a species of freshwater fish in Zambia. These fish sneak up on other fish and, like their name suggests, bite off their scales and eat them. What makes scientists interested in them is that they are an unusual case study in natural selection. Scale eaters have mouths that are a bit crooked and bend either to the left or to the right.



Whether a population of scale eaters has more left or right bent individuals in a given year depends entirely on how sick of their BS the other fish in the lake are. Fish start to become protective of whichever side is currently more likely to have their scales bitten off. So, when there are more right mouthed scale eaters, the other fish learn to begin to watch their right sides. This leads to a disadvantage for right mouthed fish but an advantage for left mouthed ones. Just as the fish think they've adapted to getting bit on one side, scale eaters that bite the other side suddenly become more common in the lake, with this cycle seeming to occur every five years or so. How genetic and non-genetic cues jointly influence the direction and the degree of mouth bent-ness is still being investigated and of major curiosity to population-geneticists. 

How we Lost our Tails

Humans lost our external tails about 25 million years ago, leaving only the coccyx in its place. A new paper by Xia et al., shows that this loss may have occurred due to a transposable element that inserted itself into an ancestor's gene. Transposable elements are bits of DNA that jump around either by being converted to RNA and then reconverted back to into DNA and placed in a new position, or by producing an enzyme that moves its place in the genome. The most abundant type of transposable element is called an Alu Element, which make up about 10% of our DNA. Most of the time, because most of your DNA is non-functional, their movement doesn't do much. But it seems that one Alu element's movement into the middle of our TBXT gene led to our ancestors losing their tails!


Bichir and the Immune System

This is a project I've been working on at my new lab here at NC State. Bichir, scientifically known as by their family name "Polypteriformes," are the oldest lineage of bony fish. Bichir have been around for about 300 million years and has changed very little, giving them their nickname as "living fossils." 

The Bony Fish Lineage - AKA Acintopterygii. The oldest lineage are at the top, the bichirs and reedfish. The lineage in the middle, the Teleostei, represent the majority of fish species. 


Because bichir are so old, they are some of the only bony fish lineages that did not go through the "Teleost gene duplication event." This event, in which a common ancestor of all Teleostei had its whole genome fully duplicated, lead to a massive explosion in the diversity of fish species. This makes the Polypteriformes interesting for looking into the evolution of all sorts of processes. Specifically, I've been working on looking into characterizing several genes associated with the immune system to see how they might have looked pre-duplication event. 

Stonefish Venom Genes

One really cool aspect of our immune system is the membrane-attack complex. This is a structure that our body forms that pokes holes in disease-causing invaders, causing water to rush into their membranes and kill them. 


Stonefish are notorious for being some of the most venomous animals on earth, with many incidents reported yearly of divers accidentally stepping on their sharp protruding spines, leading to immense pain. They are also really ugly (apologies to any stonefish reading this). I was surprised to learn that the venom causing protein from stonefish actually works in a similar way to our immune system due to it being an ancient branch of our own membrane-attack complex family. Just like how we use our proteins to form pores in invaders, the stonefish venom, SNTX, pokes a hole in the cells of whatever tissue is unfortunate to come into contact with them, leading to cell death. 

Monday, January 8, 2024

BWA, Read Mapping, and Indexing

A couple of months ago I was working on a project looking at genetic differences between two populations of this marine worm as a part of one of my lab rotations. What should have taken me a week or two ended up being four because the first command I had to run to analyze the data I was given, "BWA aln," was not the correct version. After a bit of banging my head against a wall it turns out for the specific genetic data I was given I actually had to run "BWA mem." C’est la vie.


Anyways the reason why I'm telling you this isn't because I wanted to talk about the project I was doing, (I actually didn't end up finding a lot of genetic differentiation between the two populations), or because I'm warning you of the pitfalls of using aln vs mem, but rather because I think BWA is a really cool concept.  BWA and other read mapping programs like it rely on different variations of the same algorithm and represent such a ubiquitous first step in a lot of bioinformatics pipelines. Unless you are a mathematician, most people just run the command without really thinking about what its actually doing in a lot of depth.  

BWA - What is it?


BWA stands for Burrows Wheeler Alignment. Its based on an algorithm invented by two guys, Michael Burrows and David Wheeler, called Burrows Wheeler Transform (BWT), that was originally used as a compression tool for text files. It kind of faded into relative obscurity until it was picked up by geneticists in the 2010's who realized that it worked great for DNA sequencing files.  In fact almost every read mapping alignment tool for genetics uses the BWT algorithm in some way or another. BWA just happens to be (at the time of writing this) one of the most popular. 

It works like this: you take a piece of text, (lets use the example below of the word banana), add a $ symbol to delineate the end, and create columns of each rotation of the letters. So first you would move the letter "b" to the end and shift the "a" up, then in the next row move "a" to the end and shift the "n" up, and so forth. 

Fig 1

After you have every combination of the text, you can simplify it by taking just the last column (highlighted in red in Figure 1). For mathematical reasons, when you do this it tends to make the same letters  bunch together. You then add a number before each group repeated letters designating how many repeats there are.  


Now in this specific instance actual size of the text did not change, so the file containing the text 'banana' would not be compressed. But you can imagine how bunching repeats in this way can be useful for compressing a 50 GB genomic sequencing file of just A's, T's, C's, and G's. But we can take this a step further.

Map those reads!


See the reason why BWA is such a ubiquitous first step in a lot of pipeline's isn't just for its usefulness at compressing files, its also great for aligning newly sequenced genomes to a reference, also known as read mapping.



Read mapping basically allows you to take your fragmented genome sequence pieces and match them to the areas of best fit on a reference so you can reorganize them correctly. 

The way BWA does this is a bit counter intuitive and difficult to explain in paragraph form but bear with me. First we must create a matrix with four columns. The first column is just a number from 0 to however many characters there are. We label this as i. Then go back to the columns of rotations from before. The second and third columns of our matrix are the first and last columns of letters (highlighted below in red boxes). We label these as "First" and "Last." The fourth column is the most confusing. Using the three columns we have just created in our matrix, we then go row by row and look to see where does the character in the "Last" column, appear primarily in the "First" column and write down the corresponding number from i. We call this L2F(i). Again here's what that looks like using 'banana:'


Now what seems like utterly useless playing around with letters is actually genius. Basically in a very roundabout way, what we've done is make it easier to match where a piece of text might fit best to the reference. For a more detailed explanation on how this occurs see this fantastic youtube tutorial by Niema Moshiri. It involves limiting the range in which a pattern appears using the L2F(i) column.  Luckily, we don't have to think about this too hard since we can have a computer do it for us. Again, you can see how helpful this could be with large DNA files and why I get a little bit geeked out when I talk about it. With BWA done and your short read sequences mapped to a reference in their correct order, the next steps of actually analyzing them has become much easier. 

Thursday, July 13, 2023

What Colors Does Your Dog Actually See? And Why Did Color Vision Evolve?

I'm a bit late but recently there was a Tiktok trend where people would apply a filter that supposedly showed you what the world looked like to a dog. Reactions to the filter went semi-viral as some people sobbed that their dog couldn't enjoy the same vibrant colors that we humans do, which was pretty funny. This had me wondering though, how accurate was this filter? If a dog could see the same spectrums of light as us could it even "appreciate" these colors in the same way we do? What does THAT even mean? So, just like a dog that can't jump very high but wants to escape your backyard, I did a little bit of digging. 

I like this picture of this dog staring at me like an 18th century english orphan  

A little bit of background (long groan)...

To begin we have to understand what color vision is. Anybody that has taken a basic bio course knows about rods and cones, photoreceptor cells in your eyes that absorb light and trigger a complex reaction that allows us to see things. I remember these being described to me as "one helps you see shapes and the other helps you see colors," which is an oversimplification. In fact, both cells help you see colors, just slightly differently under different conditions. 


In the dark, both rods and cones release glutamate, an important chemical that sends signals to neurons called bipolar cells which act as an in-between to your other neurons, which lets the brain understand "vision". There are ON bipolar cells, which are called that because they are excited when the light is on, and OFF bipolar cells, which are excited when the light is off. Thus, OFF bipolar cells are excited by glutamate production during the dark and ON bipolar cells are excited by a complex chemical reaction that occurs when light shuts off your body's glutamate valve. The glutamate valve for rod cells are inhibited at lower levels of light and their bipolar cells can inhibit cone OFF bipolar cells, overtaking most of the cone pathway for vision at night and becoming our main way of seeing in the dark. 

But what about color? Thats where cones come in. Humans contain four types of light sensitive proteins called opsins. Cones have three classes of opsins: long (L), medium (M), and short (S),  that are excited by the corresponding wavelengths for red, yellow-green, and purple-blue. Rods only have one, which is why in the dark, when rods are dominant, everything appears more muted and grayish. 

    

A chart that every physics 101 class has seen


The possession of three cones for color vision is called trichromacy and its actually something that is specific to humans and other closely related primates. Most mammals (including your dog) have dichromatic vision, meaning they have some combination of only two types of opsins (usually S with M or L). Birds, amphibians, reptiles, and fish are often tetrachromatic (with 4 opsin proteins) and sometimes rarely pentachromatic, although having five types of opsins and being able to distinguish between the colors they provide are two different things. This is why although mantis shrimp have 55 different types of cones, the idea that they can "see more colors" than us is a bit of an oversimplification. A recent study found that mantis shrimp can not distinguish between wavelengths less than 25nm apart, contrasted by most humans who can distinguish between wavelengths by 1-5 nm. Its thought that this actually helps the shrimp out though. By not worrying itself with all these colors, they can reduce the amount of time it takes their brains to see contrasts between different organisms, helping its survival and making its responses faster. After all, seeing the world as a kaleidoscope of colors would probably get pretty distracting.


Mantis shrimps are however the only organism that can see circular polarization. What does that look like? I don't know, ask your local shrimp.


So what colors do dogs see? 

The answer is we really will never be able to know for sure (my least favorite answer that scientists love to give). But because they lack an L cone they probably see something like this:

The Tiktok filter was right all along! 


How did we get here?

The real reason I wrote this article wasn't because I cared too much about what colors dogs can and can't see. Most people already knew somewhat about rods and cones and that their dog is a dingus. The real reason is because I wanted to know more about opsin evolution. 

Like all proteins, opsins have genes that code them. It used to be that to study these proteins, we would have to isolate them directly from animal retinas, but with modern technologies scientists have opted instead to have cultured cells produce them for us in the lab. This is nice because it allows researchers, who are really just curious children at heart, to play around with the cells genes and see what happens to the opsins. In addition, we have sequenced the genes for about 1000 opsins, from humans to jellyfish, which has provided even more background to their history. 


Scientists theorize that around 500 million years ago, a jawless proto-vertebrate had already developed four opsins homologous to our modern day ones. Scientists dubbed the classes of these opsins as SWS1 and SWS2 (homologous to human S opsins), and RH2 and LWS (homologous to M human opsins). At some point, probably around 250 million years ago, mammals became more and more nocturnal in order to escape predators that hunted mostly in the daytime. In response, over time they lost RH2 and SWS2, which is why most mammals are dichromatic, like your dog. From what I've read, its not clear what the advantage of having SWS1 over SWS2 was however, since theres really not too much of a difference between the two except a little bit more UV-sensitivity in SWS1. 

Case in point, we as humans lost this UV-sensitivity in exchange for seeing blues and purples. Scientists found that exactly seven genetic mutations of the SWS1 gene changed the opsin wavelength sensitivity in primates as they switched from being nocturnal to foraging in the daytime. Seeing blues and purples might have allowed them to see berries and fruits better against green topiary, giving blue seeing primates a bit of an advantage.

Seeing more colors is useful!

But what about the LWS opsin? Overtime LWS slowly shifted its sensitivity to become M, allowing us to see the color green the way we do, and the L gene evolved from that. Again, this is probably because seeing contrasting colors in the daytime is helpful for finding food. Theres two hypotheses for what might have happened to create this two opsin gene system. The first is that there were two variants of this gene, one that was more sensitive to long wavelengths and one for medium ones. This means that at some point  our ancestors were probably running around seeing the world slightly differently from one another. Its theorized that this gene duplicated due to unequal crossing over during meiosis in a female primate. This meant that rather than one gene with two types (M and L) two distinct genes for M and L were created on a single X chromosome. Any children from this primate would now only need a single X chromosome with this mutation to attain trichromatic vision. 

The other hypothesis is that the M opsin gene duplicated. This would have allowed mutations to occur on one set of this gene whilst keeping the other intact. So while the original M gene remained, the duplicate could have had multiple mutations acting on it to eventually become L, allowing us to see the vibrant reds we know and love.

Thursday, March 10, 2022

Catching Invasive Species

We all know that invasive species cause a lot of problems. The introduced organisms are often so good at being themselves that, with the help of a lack of native predators, they outcompete already existing organisms. Famous examples of critters like lionfishasian shore crabs, zebra mussels and a multitude of others have wrecked havoc on local areas. One study estimated that biological invasions have cost North America roughly $26 billion dollars a year. Thus it should be no surprise that finding ways of dealing with invasive species has been a top priority of scientists for decades. Enter environmental DNA. 

As I have written about previously, I have had the fortune to work at an environment DNA (eDNA) lab and learn quite a bit about the methods and research that are used. eDNA refers to genetic material that is just floating around nature in the form of shedded skin or fecal matter. With the increase in more advanced bioinformatic methods we can scoop this DNA from the environment and compare it to DNA sequences in databases to find out what species it came from. 


But how does this help with combatting invasive species? Picture this. You work for a bio-monitoring program, scouting specific areas in a national park for possible abnormalities. One day you find a slew of green crabs on the beaches that are already everywhere! It's too late. You now have a possible environmental disaster! eDNA offers a clever tool in which instead of monitoring invasive organisms by physically locating them after they arrive, scientists are now able to detect them before they have a chance to fully establish themselves.

Friday, February 4, 2022

Gene Editing Starts the Eradication of Salmon Viruses

Great news coming from The Roslin Institute in Scotland! Researchers have identified genes associated with resistance to a disease known as Infectious Pancreatic Necrosis (IPN) in Atlantic Salmon. Seeing as how salmon represent 4.6% of the global food supply with almost all of that being from aquaculture farms, you can see how this would be a pretty big deal. IPN is among the list of several diseases that can greatly disrupt aquaculture centers by infecting their salmon production and causing high mortality rates. By finding the exact locations in Atlantic Salmon genomes that allows for some of them to be naturally resistant to IPN, farmers can more accurately test for and select naturally resistant brood stock: the animals in a farm used for breeding purposes. 



A dissected Salmon Parr infected with IPN (top) vs a healthy Parr (bottom)

But how exactly did scientists go about doing this? Well, first they performed what is called a "challenge experiment" in which they infected families of Atlantic Salmon and looked at the tanks that had the least amount of salmon dead. The salmon in those tanks were deemed resistant whereas the other ones were deemed susceptible or intermediate to IPN. They then took two of the intermediate families, tested for their parent's genotypes, and analyzed their gene expression patterns for IPN QTL-linked markers. 

QTL = Quantitative Trait Locus. It's an area on a chromosome region detected by statistical analysis that is significantly associated with variation for a quantitative trait. Often times to find QTLs scientists link them to specific genetic markers that exist in two distinguishable forms. 

After looking into the QTL pattern differences in the salmon, the authors found a specific gene within this area that was the most differentially expressed; a gene called nae1. They then used CRISPR-Cas9, a widely used method for gene editing, to block nae1 and see if it really caused a major difference in IPN resistance. Their results show that indeed, blocking nae1 significantly reduced the salmon's abilities to resist being infected by the virus.

Monday, January 3, 2022

Guppies and Y Chromosomes

Before this week I could tell you two things about guppies. They are small and they are popular for freshwater aquarium enthusiasts. It wasn't until I came across this paper published in the Journal of Genome Biology and Evolution about how guppies are being used to investigate sex chromosome differentiation that I realized just how cool they are. This was not a topic I was well versed in but I thought the paper was interesting so I went ahead and did some investigating for myself to learn more about this subject. 

Poecilia reticulata AKA The Guppy

As many people are undoubtedly already familiar from their high school biology class, sex chromosomes, also referred to as allosomes, are what determine the sex of an organism. In humans typically XY develops male characteristics and XX develops female characteristics. This is an oversimplified version of the biology behind sex; in fact sex is not binary and biologists have for years viewed it more as a spectrum. Sex determination in humans (whether or not a chromosome becomes X or Y) is sort of understood by scientists through the discovery of the SRY gene. Some animals share similar methods of sex determination to us, whereas others have systems that are completely wild that researchers are still scratching their heads about it. 

So although we understand a little bit about how sex chromosomes work, scientists are still looking for answers on how this system evolved. Enter the guppy, whose allosomes are similar enough to human's that they make for a great organisms to look into this. The chromosome carrying the sex determining gene have evolved regions that are non-recombining, meaning that some portions of the DNA does not become rearranged unlike in regular chromosomes. This recombination suppression allows for the differences between X and Y chromosomes in their shape and size in addition to making the Y chromosome lose its gene functions, making it "genetically degenerate." 

If that was a lot to read, I understand. I spent quite a bit making sure I was getting all the details correct.

To simplify:

No genetic recombination --> Y chromosome's genes are functionless

Yes genetic recombination --> Y chromosomes's genes have (at least some) function 

The key difference that makes the guppy unique for looking into this topic is that their Y-chromosome is not fully degenerated like humans as a result of incomplete recombination suppression. Thus researchers, like in the paper I mentioned at the start, have used guppy species to analyze which situations accelerate or obstruct chromosome degeneration. Under normal circumstances the Y chromosome has a higher rate of mutation making it more likely to be sometimes evolutionarily disadvantageous. Therefore, having some recombination can lead to the Y chromosome to have the same rate of mutations as its X pairing making it a little more stable. 

The researchers of the paper actually go into a lot more detail by comparing three different guppy species' sex chromosomes, which they used to build phylogenetic trees to get a better understanding of when exactly recombination suppression developed. Their main point, besides that having not fully degenerated Y chromosomes can be beneficial for the guppy, is that the process behind recombination suppression is older than evolutionary biologists had once thought. So when exactly did it start? Who knows? Maybe more research into organisms like the guppy can provide us the answers! Overall though, these are important topics. Looking into the evolutionary history of sex chromosomes using species like the guppy can help us better understand our own biology and the complexity behind the development of sexes throughout the animal kingdom.