How Much Dirt is Too Much Dirt — Quality Metrics in Gene Expression Analysis

At twoXAR we bring together a lot of disparate data to rapidly identify disease treatments. It’s through these different data that we gain our predictive power. However, more data isn’t always better — not if the new data is of poor quality. In other words, quantity doesn’t trump quality, and that’s because of a common data science saying: bad data in = bad data out. Because of this, we check the quality of our input data at multiple levels; some of this is a manual process, but we automate as much as possible.

In July’s post, (ML)²: Myths and Legends of Machine Learning, I touched on the messiness of real world data and mentioned quality control checks; here, I will expand on that with an example of one of the checks we use for gene expression data…


Synergizing against breast cancer

I was about twelve when I found out my grandmother had breast cancer. My parents did a good job of shielding me from the worst of the details, but there is no way to avoid fear that comes from a loved one being diagnosed with cancer. As a kid, there wasn’t much I could do, but my grandmother loves to tell the story of me trying to comfort her by telling her I was going to do research to help cure her cancer. Little did I know at the time that treating cancer is not as simple as taking a pill once a day and that even identifying the right medicine is akin to finding a needle in a haystack.

Over the next seventeen years, as I pursued undergraduate and graduate studies in biology and genetics, I filled in those knowledge gaps, but felt no closer to changing the status quo of breast cancer…


Night of the Dead’s Living Data

We often speak of our trove of gene expression data:  RNA measurements from different human tissues, which allow us to identify genes that are expressed abnormally in disease patients compared to healthy people. By the time it gets to us, that RNA has been converted first into cDNA, then into a microarray or RNA-seq readout, then into a publication, and finally into an entry in a neat public database. But like babies and sausage, we must eventually pause to consider where this RNA comes from. The answer, especially for brain diseases, is often cadavers (otherwise known as dead people).

Realizing that so much scientific knowledge comes from the dearly departed initially gave me the heebie jeebies. I knew there were no other options, as brain biopsies are incredibly unpopular among the living. But weren’t readouts from dead tissues vastly different from live ones? My naïve intuition was that biological readouts would be like the electronic displays that report system diagnostics on my motorcycle: once the machine’s been turned off, the measurements become significantly less accurate reflections of the bike’s functioning state.

However, apparently one cannot extrapolate this logic from hogs to humans. It turns out that RNA, particularly in brain tissue, is quite stable post-mortem, and a reliable snapshot of brain function in life. Post-mortem protein measurements can be very robust as well; a recent study of more than 3,600 human cadaver brains has shifted the paradigm on which protein is the primary driver of Alzheimer’s Disease.

In a way, twoXAR’s work corroborates this principle. Our gene expression-based models of Parkinson’s Disease, schizophrenia, and Alzheimer’s Disease yield excellent predictions of known treatments and exciting, sensible repurposing candidates. Thus, I have come to acknowledge that like zombies, “undead data” can be surprisingly powerful.

It’s All About the Gene Expression: How Genes are Turned On and Off in Disease

Hi guys, my name’s Aaron, and as a grad student researcher in Genetics at Stanford, I’m twoXAR’s resident gene expression nerd. And as a co-tinkerer on the twoXAR platform, I focus on finding and incorporating data to continuously improve our disease models, and in turn our algorithm’s predictions. In a previous post, Andrew gave a nice introduction about gene expression measurements and how we use them. Today, I want to give you a little more information on how gene expression—otherwise known as transcription—is biologically controlled and scientifically measured.

As Andrew explained in his excellent cookbook analogy, genes are instructions on how to make proteins, written in DNA. For a cell to execute those instructions, it must make RNA “photocopies” of genes that are relevant to its tasks. Cells therefore select which proteins to make by choosing a set of genes to transcribe from DNA to RNA. The genome has tens of thousands of instructions coding for everything from insulin to dopamine to the stuff that makes your toenails.  For a pancreas cell to do its job correctly, it has to pull out the instructions for the first and ignore the latter two. How do cells do this?

One major player in the gene regulation game is the transcription factor. Transcription factors are proteins that bind to specific sequences of DNA and kick off gene expression. You can think of them as “smart bookmarks” that find their way to the words that begin relevant chapters of DNA.  But where do these bookmarks come from, and how’d they get so smart anyways?

It turns out, a lot of them work two shifts, acting as both transcription factors and signaling proteins: molecules that report the signals a cell receives.  So, a transcription factor will hear the hormones in the cell’s environment shouting, “We need more insulin, STAT!”, and hurry to the DNA to open up the insulin chapter (there’s a little pun in there for you signaling geeks). Once the right transcription factor bookmarks arrive at the right gene chapter, other proteins will come to that page and transcribe its DNA instructions into RNA, allowing protein synthesis.

And there’s an even simpler level at which gene expression is regulated: some pages of the DNA book are open and easily flipped through, while other pages can be temporarily glued shut, preventing bookmarks from finding their way to their chapter headings. The varied accessibility of different DNA regions in the genome is referred to as Epigenetics, and it’s so neat that I’m doing a whole PhD about it! Those interested in learning more about this hot, up-and-coming field are encouraged to start here.

But how does all this relate to twoXAR? Well, we’re in the business of finding new roles for drugs in human disease. Human disease manifests through changes in gene expression: DNA pages that are supposed to be sealed become opened, transcription factor bookmarks land excessively on some chapters and insufficiently on others, and the selection and number of RNA photocopies get out of whack. At twoXAR, we compare the gene expression profiles of disease patients versus healthy individuals, and identify the proteins that correspond to each gene, which become the starting points for our drug discovery algorithms, as described here. All very well and good, you say, but what the heck’s a gene expression profile, and how do you get your hands on one?

Some of our gene expression data comes from published databases of federally funded human research. Each dataset indicates the number of RNA photocopies that have been made for thousands of genes in a certain tissue (such as blood, muscle, or brain biopsy samples) from patients and healthy controls. If you’re wondering what kind of healthy people let scientists biopsy their brains, the answer is dead ones; more on that in our next post!

The last thing I want to tell you about is how these RNA measurements are made. We use data collected via two methods: the older (and here we’re still talking only 20 years or so) RNA microarray, and the powerful new kid on the block, RNA sequencing (RNA-seq). The first step in both of these processes is extracting RNA from tissue samples, and immediately converting it back into its more stable cousin, DNA (through a process unsurprisingly called Reverse Transcription); since this DNA is “complementary” to the RNA sequences in each sample, it’s referred to as “cDNA”.

Where these two methods differ is in their mode and range of detection. To run a microarray, you first pick which genes you want to measure, synthesize DNA molecules for each of those genes, and stick thousands of each gene’s molecule at specific locations on a glass slide. You then label your cDNA with fluorescent chemicals, and run it over said glass slide. If one of the genes you put on the chip was expressed in your sample tissue, the fluorescent cDNA for that gene will stick to it, because two complementary pieces of DNA that contact each other will form double helixes. The more copies of that gene in your sample, the more fluorescence will accumulate at that spot on the chip, which can be quantified. In contrast, RNA-seq takes a simpler, but more expensive approach: take your whole batch of cDNA, and sequence (i.e. use a machine to ‘read’ the cDNA) the sucker! Rather than picking out individual genes to measure, RNA-seq takes an unbiased approach and measures everything. As the experimental costs of sequencing and the computational costs of analyzing such large data sets are both going down, this next-generation approach is becoming more prevalent in both the research community and in the twoXAR databases.

Phew! And there you have it, a handy knowledge dump from your friendly neighborhood geneticist. I hope it helps provide a clearer picture of the methods behind our madness.

Commercial Successes in Translational Bioinformatics

As we discussed in our previous post, our work at twoXAR is best described as translational informatics: using big data methods to bring novel, experimental research closer to the clinic. While this exciting field is still relatively young, significant advances have been made in the last fifteen years. As the life sciences’ capacity to generate large, comprehensive, and unbiased biological data sets (“-omics”) has grown, so has the need for powerful data science to scale their digital mountains of results. In this post, I’d like to highlight how small, innovative data science companies have already begun to contribute to this massive project, and how twoXAR is poised to meet current needs in the field.

In the late 1990s, the progress of the human genome project and the advent of molecular profiling technologies such as DNA microarrays set the stage for the “Cambrian explosion” of big data in biology. Some of the first innovations in the generation and decoding of large molecular profiling data sets came from Rosetta InPharmatics, which was acquired by Merck in 2001 for $620M. Rosetta’s early work established computational pattern recognition techniques that allowed researchers to detect cellular gene expression changes induced by treatment with pharmacological compounds.

This pioneering work paved the way for other translational informatics companies, which are tackling current big data challenges in various different ways. For example, Ingenuity Systems, a Redwood City-based bioinformatics company, has developed tools that provide researchers with curated literature summaries and rapid statistical analyses. Their software is currently a well-cited resource, with subscriptions purchased by many academic labs. In 2013, Ingenuity Systems was acquired by the research technology company Qiagen for $105M. Other companies, such as the San Jose and Bangalore-based Cellworks, generate computational simulations of disease states to screen drug candidates for clinical development. Their ordinary differential equation (ODE)-based models have led to successful collaborations with academic scientists, and the advancement of drug candidates to the validation stage. Their efforts have been supported by grants from the Wellcome Trust, and by investments from Artiman Ventures and Sequoia Capital.

These companies, and others like them, demonstrate how diverse data science approaches from small groups of computational innovators can address the challenges of translational informatics in impactful ways. The unique machine learning and data mining techniques we have developed at twoXAR are thus joining a young but powerful arsenal of modern tools for modern biology.