The Gill Raker

Monday, January 8, 2024

BWA, Read Mapping, and Indexing

A couple of months ago I was working on a project looking at genetic differences between two populations of this marine worm as a part of one of my lab rotations. What should have taken me a week or two ended up being four because the first command I had to run to analyze the data I was given, "BWA aln," was not the correct version. After a bit of banging my head against a wall it turns out for the specific genetic data I was given I actually had to run "BWA mem." C’est la vie.

Anyways the reason why I'm telling you this isn't because I wanted to talk about the project I was doing, (I actually didn't end up finding a lot of genetic differentiation between the two populations), or because I'm warning you of the pitfalls of using aln vs mem, but rather because I think BWA is a really cool concept. BWA and other read mapping programs like it rely on different variations of the same algorithm and represent such a ubiquitous first step in a lot of bioinformatics pipelines. Unless you are a mathematician, most people just run the command without really thinking about what its actually doing in a lot of depth.

BWA - What is it?

BWA stands for Burrows Wheeler Alignment. Its based on an algorithm invented by two guys, Michael Burrows and David Wheeler, called Burrows Wheeler Transform (BWT), that was originally used as a compression tool for text files. It kind of faded into relative obscurity until it was picked up by geneticists in the 2010's who realized that it worked great for DNA sequencing files. In fact almost every read mapping alignment tool for genetics uses the BWT algorithm in some way or another. BWA just happens to be (at the time of writing this) one of the most popular.

It works like this: you take a piece of text, (lets use the example below of the word banana), add a $ symbol to delineate the end, and create columns of each rotation of the letters. So first you would move the letter "b" to the end and shift the "a" up, then in the next row move "a" to the end and shift the "n" up, and so forth.

Fig 1

After you have every combination of the text, you can simplify it by taking just the last column (highlighted in red in Figure 1). For mathematical reasons, when you do this it tends to make the same letters bunch together. You then add a number before each group repeated letters designating how many repeats there are.

Now in this specific instance actual size of the text did not change, so the file containing the text 'banana' would not be compressed. But you can imagine how bunching repeats in this way can be useful for compressing a 50 GB genomic sequencing file of just A's, T's, C's, and G's. But we can take this a step further.

Map those reads!

See the reason why BWA is such a ubiquitous first step in a lot of pipeline's isn't just for its usefulness at compressing files, its also great for aligning newly sequenced genomes to a reference, also known as read mapping.

Read mapping basically allows you to take your fragmented genome sequence pieces and match them to the areas of best fit on a reference so you can reorganize them correctly.

The way BWA does this is a bit counter intuitive and difficult to explain in paragraph form but bear with me. First we must create a matrix with four columns. The first column is just a number from 0 to however many characters there are. We label this as i. Then go back to the columns of rotations from before. The second and third columns of our matrix are the first and last columns of letters (highlighted below in red boxes). We label these as "First" and "Last." The fourth column is the most confusing. Using the three columns we have just created in our matrix, we then go row by row and look to see where does the character in the "Last" column, appear primarily in the "First" column and write down the corresponding number from i. We call this L2F(i). Again here's what that looks like using 'banana:'

Now what seems like utterly useless playing around with letters is actually genius. Basically in a very roundabout way, what we've done is make it easier to match where a piece of text might fit best to the reference. For a more detailed explanation on how this occurs see this fantastic youtube tutorial by Niema Moshiri. It involves limiting the range in which a pattern appears using the L2F(i) column. Luckily, we don't have to think about this too hard since we can have a computer do it for us. Again, you can see how helpful this could be with large DNA files and why I get a little bit geeked out when I talk about it. With BWA done and your short read sequences mapped to a reference in their correct order, the next steps of actually analyzing them has become much easier.

Thursday, July 13, 2023

What Colors Does Your Dog Actually See? And Why Did Color Vision Evolve?

I'm a bit late but recently there was a Tiktok trend where people would apply a filter that supposedly showed you what the world looked like to a dog. Reactions to the filter went semi-viral as some people sobbed that their dog couldn't enjoy the same vibrant colors that we humans do, which was pretty funny. This had me wondering though, how accurate was this filter? If a dog could see the same spectrums of light as us could it even "appreciate" these colors in the same way we do? What does THAT even mean? So, just like a dog that can't jump very high but wants to escape your backyard, I did a little bit of digging.

I like this picture of this dog staring at me like an 18th century english orphan

A little bit of background (long groan)...

To begin we have to understand what color vision is. Anybody that has taken a basic bio course knows about rods and cones, photoreceptor cells in your eyes that absorb light and trigger a complex reaction that allows us to see things. I remember these being described to me as "one helps you see shapes and the other helps you see colors," which is an oversimplification. In fact, both cells help you see colors, just slightly differently under different conditions.

In the dark, both rods and cones release glutamate, an important chemical that sends signals to neurons called bipolar cells which act as an in-between to your other neurons, which lets the brain understand "vision". There are ON bipolar cells, which are called that because they are excited when the light is on, and OFF bipolar cells, which are excited when the light is off. Thus, OFF bipolar cells are excited by glutamate production during the dark and ON bipolar cells are excited by a complex chemical reaction that occurs when light shuts off your body's glutamate valve. The glutamate valve for rod cells are inhibited at lower levels of light and their bipolar cells can inhibit cone OFF bipolar cells, overtaking most of the cone pathway for vision at night and becoming our main way of seeing in the dark.

But what about color? Thats where cones come in. Humans contain four types of light sensitive proteins called opsins. Cones have three classes of opsins: long (L), medium (M), and short (S), that are excited by the corresponding wavelengths for red, yellow-green, and purple-blue. Rods only have one, which is why in the dark, when rods are dominant, everything appears more muted and grayish.

A chart that every physics 101 class has seen

The possession of three cones for color vision is called trichromacy and its actually something that is specific to humans and other closely related primates. Most mammals (including your dog) have dichromatic vision, meaning they have some combination of only two types of opsins (usually S with M or L). Birds, amphibians, reptiles, and fish are often tetrachromatic (with 4 opsin proteins) and sometimes rarely pentachromatic, although having five types of opsins and being able to distinguish between the colors they provide are two different things. This is why although mantis shrimp have 55 different types of cones, the idea that they can "see more colors" than us is a bit of an oversimplification. A recent study found that mantis shrimp can not distinguish between wavelengths less than 25nm apart, contrasted by most humans who can distinguish between wavelengths by 1-5 nm. Its thought that this actually helps the shrimp out though. By not worrying itself with all these colors, they can reduce the amount of time it takes their brains to see contrasts between different organisms, helping its survival and making its responses faster. After all, seeing the world as a kaleidoscope of colors would probably get pretty distracting.

Mantis shrimps are however the only organism that can see circular polarization. What does that look like? I don't know, ask your local shrimp.

So what colors do dogs see?

The answer is we really will never be able to know for sure (my least favorite answer that scientists love to give). But because they lack an L cone they probably see something like this:

The Tiktok filter was right all along!

How did we get here?

The real reason I wrote this article wasn't because I cared too much about what colors dogs can and can't see. Most people already knew somewhat about rods and cones and that their dog is a dingus. The real reason is because I wanted to know more about opsin evolution.

Like all proteins, opsins have genes that code them. It used to be that to study these proteins, we would have to isolate them directly from animal retinas, but with modern technologies scientists have opted instead to have cultured cells produce them for us in the lab. This is nice because it allows researchers, who are really just curious children at heart, to play around with the cells genes and see what happens to the opsins. In addition, we have sequenced the genes for about 1000 opsins, from humans to jellyfish, which has provided even more background to their history.

Scientists theorize that around 500 million years ago, a jawless proto-vertebrate had already developed four opsins homologous to our modern day ones. Scientists dubbed the classes of these opsins as SWS1 and SWS2 (homologous to human S opsins), and RH2 and LWS (homologous to M human opsins). At some point, probably around 250 million years ago, mammals became more and more nocturnal in order to escape predators that hunted mostly in the daytime. In response, over time they lost RH2 and SWS2, which is why most mammals are dichromatic, like your dog. From what I've read, its not clear what the advantage of having SWS1 over SWS2 was however, since theres really not too much of a difference between the two except a little bit more UV-sensitivity in SWS1.

Case in point, we as humans lost this UV-sensitivity in exchange for seeing blues and purples. Scientists found that exactly seven genetic mutations of the SWS1 gene changed the opsin wavelength sensitivity in primates as they switched from being nocturnal to foraging in the daytime. Seeing blues and purples might have allowed them to see berries and fruits better against green topiary, giving blue seeing primates a bit of an advantage.

Seeing more colors is useful!

But what about the LWS opsin? Overtime LWS slowly shifted its sensitivity to become M, allowing us to see the color green the way we do, and the L gene evolved from that. Again, this is probably because seeing contrasting colors in the daytime is helpful for finding food. Theres two hypotheses for what might have happened to create this two opsin gene system. The first is that there were two variants of this gene, one that was more sensitive to long wavelengths and one for medium ones. This means that at some point our ancestors were probably running around seeing the world slightly differently from one another. Its theorized that this gene duplicated due to unequal crossing over during meiosis in a female primate. This meant that rather than one gene with two types (M and L) two distinct genes for M and L were created on a single X chromosome. Any children from this primate would now only need a single X chromosome with this mutation to attain trichromatic vision.

The other hypothesis is that the M opsin gene duplicated. This would have allowed mutations to occur on one set of this gene whilst keeping the other intact. So while the original M gene remained, the duplicate could have had multiple mutations acting on it to eventually become L, allowing us to see the vibrant reds we know and love.

In the colorful history of our vision, it seems our ancestors were quite the sightseers, observing the world through slightly different lenses. Whether it was a tale of two gene variants, causing them to see greens and mediums with a quirky divergence, or the mischievous duplication of the M opsin gene, our vision has certainly evolved in an interesting manner. Keep those eyes wide open, my trichromats, and go hug your dog.

Sunday, December 4, 2022

How do we connect characteristics to genes? (And vice-versa)

We oftentimes hear about studies saying things like "x gene has been shown to be responsible for y." Like this paper for instance, which found that mutations in the rhodopsin gene can affect vision in mice. That's awesome and cool but how exactly do we go about connecting those two things? Well, turns out there are lots of ways, all of which depends on what exactly you are trying to find out.

The first method is called a "forward genetic screen." This is basically when you start out knowing what phenotype (the individual characteristic) you want to examine and are trying to find what genes are responsible for it. This involves making random mutations in your test subject, seeing what characteristics changed, and then going back and seeing what specific genes caused that change. One way of inducing mutations is through chemicals that introduce stop codons into an organism's DNA in random locations which can modify a gene's function. Another technique involves exploiting RNA interference in which you can introduce specific enzymes that degrade mRNA and destroy any instructions needed to build certain proteins, thus also affecting the phenotype.

Steps for forward genetic screenings

The second method is called a "reverse genetic screen." This is when you are trying to find the function of a gene by connecting it to a specific phenotype. Similar techniques are used to disrupt DNA function, only this time they are not random.

A couple of handy guides!

It is important to note that often times it is not a single gene but a network of genes that is responsible for a phenotype, thus making this more complicated. I've only scratched the surface regarding methods and technologies that can be used to determine gene functions. As always, reading about this stuff makes me thankful we live in an era in which analyzing genes is easier than ever.

Tuesday, June 28, 2022

Genome Mining and Microbes

No, not that type of mining!

Thats more like it!

With the advent of next generation sequencing technologies, thousands of genomes have become available for scientists to use. As bioinformatic methods increase in their ability to sift through massive amounts of genomic data, biologists have begun exploring for genes within microbes that play key roles in metabolic pathways. These genes often encode for secondary metabolites - molecules that are synthesized in response to environmental cues that provide advantages to organisms. These molecules can help facilitate nutrient acquisition, create defense mechanisms against predatory organisms, and help resist toxic compounds. Often, the discovery of these secondary metabolites and the gene clusters that make them, have led to the creation of new life saving drugs!

Until recently, application of these techniques have mostly been focused on culturable microbes. However, with the creation of culture-independent microbiology methods, scientists have begun looking towards the worlds largest environment, the oceans! Microalgae, marine protists, marine fungi and other microbial organisms in the oceans have begun to have their genomes sifted through to find genes that code for the synthesis of natural drugs that could be used by the pharmaceutical industry.

Thursday, March 10, 2022

Catching Invasive Species

We all know that invasive species cause a lot of problems. The introduced organisms are often so good at being themselves that, with the help of a lack of native predators, they outcompete already existing organisms. Famous examples of critters like lionfish, asian shore crabs, zebra mussels and a multitude of others have wrecked havoc on local areas. One study estimated that biological invasions have cost North America roughly $26 billion dollars a year. Thus it should be no surprise that finding ways of dealing with invasive species has been a top priority of scientists for decades. Enter environmental DNA.

As I have written about previously, I have had the fortune to work at an environment DNA (eDNA) lab and learn quite a bit about the methods and research that are used. eDNA refers to genetic material that is just floating around nature in the form of shedded skin or fecal matter. With the increase in more advanced bioinformatic methods we can scoop this DNA from the environment and compare it to DNA sequences in databases to find out what species it came from.

The basic steps of eDNA

But how does this help with combatting invasive species? Picture this. You work for a bio-monitoring program, scouting specific areas in a national park for possible abnormalities. One day you find a slew of green crabs on the beaches that are already everywhere! It's too late. You now have a possible environmental disaster! eDNA offers a clever tool in which instead of monitoring invasive organisms by physically locating them after they arrive, scientists are now able to detect them before they have a chance to fully establish themselves!

Friday, February 4, 2022

Gene Editing Starts the Eradication of Salmon Viruses

Great news coming from The Roslin Institute in Scotland! Researchers have identified genes associated with resistance to a disease known as Infectious Pancreatic Necrosis (IPN) in Atlantic Salmon. Seeing as how salmon represent 4.6% of the global food supply with almost all of that being from aquaculture farms, you can see how this would be a pretty big deal. IPN is among the list of several diseases that can greatly disrupt aquaculture centers by infecting their salmon production and causing high mortality rates. By finding the exact locations in Atlantic Salmon genomes that allows for some of them to be naturally resistant to IPN, farmers can more accurately test for and select naturally resistant brood stock: the animals in a farm used for breeding purposes.

A dissected Salmon Parr infected with IPN (top) vs a healthy Parr (bottom)

But how exactly did scientists go about doing this? Well, first they performed what is called a "challenge experiment" in which they infected families of Atlantic Salmon and looked at the tanks that had the least amount of salmon dead. The salmon in those tanks were deemed resistant whereas the other ones were deemed susceptible or intermediate to IPN. They then took two of the intermediate families, tested for their parent's genotypes, and analyzed their gene expression patterns for IPN QTL-linked markers.

QTL = Quantitative Trait Locus. It's an area on a chromosome region detected by statistical analysis that is significantly associated with variation for a quantitative trait. Often times to find QTLs scientists link them to specific genetic markers that exist in two distinguishable forms.

After looking into the QTL pattern differences in the salmon, the authors found a specific gene within this area that was the most differentially expressed; a gene called nae1. They then used CRISPR-Cas9, a widely used method for gene editing, to block nae1 and see if it really caused a major difference in IPN resistance. Their results show that indeed, blocking nae1 significantly reduced the salmon's abilities to resist being infected by the virus! Exciting stuff!

Monday, January 3, 2022

Guppies and Y Chromosomes

Before this week I could tell you two things about guppies. They are small and they are popular for freshwater aquarium enthusiasts. It wasn't until I came across this paper published in the Journal of Genome Biology and Evolution about how guppies are being used to investigate sex chromosome differentiation that I realized just how cool they are. This was not a topic I was well versed in but I thought the paper was interesting so I went ahead and did some investigating for myself to learn more about this subject.

Poecilia reticulata AKA The Guppy

As many people are undoubtedly already familiar from their high school biology class, sex chromosomes, also referred to as allosomes, are what determine the sex of an organism. In humans typically XY develops male characteristics and XX develops female characteristics. This is an oversimplified version of the biology behind sex; in fact sex is not binary and biologists have for years viewed it more as a spectrum. Sex determination in humans (whether or not a chromosome becomes X or Y) is sort of understood by scientists through the discovery of the SRY gene. Some animals share similar methods of sex determination to us, whereas others have systems that are completely wild that researchers are still scratching their heads about it.

Side note: Unsurprisingly, the platypus, much like everything else about it, has one of

the strangest sex determination systems

So although we understand a little bit about how sex chromosomes work, scientists are still looking for answers on how this system evolved. Enter the guppy, whose allosomes are similar enough to human's that they make for a great organisms to look into this. The chromosome carrying the sex determining gene have evolved regions that are non-recombining, meaning that some portions of the DNA does not become rearranged unlike in regular chromosomes. This recombination suppression allows for the differences between X and Y chromosomes in their shape and size in addition to making the Y chromosome lose its gene functions, making it "genetically degenerate."

If that was a lot to read, I understand. I spent quite a bit making sure I was getting all the details correct.

To simplify:

No genetic recombination --> Y chromosome's genes are functionless

Yes genetic recombination --> Y chromosomes's genes have (at least some) function

The key difference that makes the guppy unique for looking into this topic is that their Y-chromosome is not fully degenerated like humans as a result of incomplete recombination suppression. Thus researchers, like in the paper I mentioned at the start, have used guppy species to analyze which situations accelerate or obstruct chromosome degeneration. Under normal circumstances the Y chromosome has a higher rate of mutation making it more likely to be sometimes evolutionarily disadvantageous. Therefore, having some recombination can lead to the Y chromosome to have the same rate of mutations as its X pairing making it a little more stable.

The researchers of the paper actually go into a lot more detail by comparing three different guppy species' sex chromosomes, which they used to build phylogenetic trees to get a better understanding of when exactly recombination suppression developed. Their main point, besides that having not fully degenerated Y chromosomes can be beneficial for the guppy, is that the process behind recombination suppression is older than evolutionary biologists had once thought. So when exactly did it start? Who knows? Maybe more research into organisms like the guppy can provide us the answers! Overall though, these are important topics. Looking into the evolutionary history of sex chromosomes using species like the guppy can help us better understand our own biology and the complexity behind the development of sexes throughout the animal kingdom.