How Much Dirt is Too Much Dirt — Quality Metrics in Gene Expression Analysis

At twoXAR we bring together a lot of disparate data to rapidly identify disease treatments. It’s through these different data that we gain our predictive power. However, more data isn’t always better — not if the new data is of poor quality. In other words, quantity doesn’t trump quality, and that’s because of a common data science saying: bad data in = bad data out. Because of this, we check the quality of our input data at multiple levels; some of this is a manual process, but we automate as much as possible.

In July’s post, (ML)²: Myths and Legends of Machine Learning, I touched on the messiness of real world data and mentioned quality control checks; here, I will expand on that with an example of one of the checks we use for gene expression data…


(ML)²: Myths and Legends of Machine Learning

Skepticism is (and should be) a vital part of any science; statistics and data science are no exception. Statistician George Box nicely summed it up when he said, “all models are wrong, but some are useful”. Box reminds us that statistical models are just that: models. A simplified representation of the real-world will always have shortcomings. But we shouldn’t forget the last bit of Box’s saying: “some [models] are useful”. Although challenging, carefully constructed statistical models can be extremely…


Synergizing against breast cancer

I was about twelve when I found out my grandmother had breast cancer. My parents did a good job of shielding me from the worst of the details, but there is no way to avoid fear that comes from a loved one being diagnosed with cancer. As a kid, there wasn’t much I could do, but my grandmother loves to tell the story of me trying to comfort her by telling her I was going to do research to help cure her cancer. Little did I know at the time that treating cancer is not as simple as taking a pill once a day and that even identifying the right medicine is akin to finding a needle in a haystack.

Over the next seventeen years, as I pursued undergraduate and graduate studies in biology and genetics, I filled in those knowledge gaps, but felt no closer to changing the status quo of breast cancer…


It’s All About the Gene Expression: How Genes are Turned On and Off in Disease

Hi guys, my name’s Aaron, and as a grad student researcher in Genetics at Stanford, I’m twoXAR’s resident gene expression nerd. And as a co-tinkerer on the twoXAR platform, I focus on finding and incorporating data to continuously improve our disease models, and in turn our algorithm’s predictions. In a previous post, Andrew gave a nice introduction about gene expression measurements and how we use them. Today, I want to give you a little more information on how gene expression—otherwise known as transcription—is biologically controlled and scientifically measured.

As Andrew explained in his excellent cookbook analogy, genes are instructions on how to make proteins, written in DNA. For a cell to execute those instructions, it must make RNA “photocopies” of genes that are relevant to its tasks. Cells therefore select which proteins to make by choosing a set of genes to transcribe from DNA to RNA. The genome has tens of thousands of instructions coding for everything from insulin to dopamine to the stuff that makes your toenails.  For a pancreas cell to do its job correctly, it has to pull out the instructions for the first and ignore the latter two. How do cells do this?

One major player in the gene regulation game is the transcription factor. Transcription factors are proteins that bind to specific sequences of DNA and kick off gene expression. You can think of them as “smart bookmarks” that find their way to the words that begin relevant chapters of DNA.  But where do these bookmarks come from, and how’d they get so smart anyways?

It turns out, a lot of them work two shifts, acting as both transcription factors and signaling proteins: molecules that report the signals a cell receives.  So, a transcription factor will hear the hormones in the cell’s environment shouting, “We need more insulin, STAT!”, and hurry to the DNA to open up the insulin chapter (there’s a little pun in there for you signaling geeks). Once the right transcription factor bookmarks arrive at the right gene chapter, other proteins will come to that page and transcribe its DNA instructions into RNA, allowing protein synthesis.

And there’s an even simpler level at which gene expression is regulated: some pages of the DNA book are open and easily flipped through, while other pages can be temporarily glued shut, preventing bookmarks from finding their way to their chapter headings. The varied accessibility of different DNA regions in the genome is referred to as Epigenetics, and it’s so neat that I’m doing a whole PhD about it! Those interested in learning more about this hot, up-and-coming field are encouraged to start here.

But how does all this relate to twoXAR? Well, we’re in the business of finding new roles for drugs in human disease. Human disease manifests through changes in gene expression: DNA pages that are supposed to be sealed become opened, transcription factor bookmarks land excessively on some chapters and insufficiently on others, and the selection and number of RNA photocopies get out of whack. At twoXAR, we compare the gene expression profiles of disease patients versus healthy individuals, and identify the proteins that correspond to each gene, which become the starting points for our drug discovery algorithms, as described here. All very well and good, you say, but what the heck’s a gene expression profile, and how do you get your hands on one?

Some of our gene expression data comes from published databases of federally funded human research. Each dataset indicates the number of RNA photocopies that have been made for thousands of genes in a certain tissue (such as blood, muscle, or brain biopsy samples) from patients and healthy controls. If you’re wondering what kind of healthy people let scientists biopsy their brains, the answer is dead ones; more on that in our next post!

The last thing I want to tell you about is how these RNA measurements are made. We use data collected via two methods: the older (and here we’re still talking only 20 years or so) RNA microarray, and the powerful new kid on the block, RNA sequencing (RNA-seq). The first step in both of these processes is extracting RNA from tissue samples, and immediately converting it back into its more stable cousin, DNA (through a process unsurprisingly called Reverse Transcription); since this DNA is “complementary” to the RNA sequences in each sample, it’s referred to as “cDNA”.

Where these two methods differ is in their mode and range of detection. To run a microarray, you first pick which genes you want to measure, synthesize DNA molecules for each of those genes, and stick thousands of each gene’s molecule at specific locations on a glass slide. You then label your cDNA with fluorescent chemicals, and run it over said glass slide. If one of the genes you put on the chip was expressed in your sample tissue, the fluorescent cDNA for that gene will stick to it, because two complementary pieces of DNA that contact each other will form double helixes. The more copies of that gene in your sample, the more fluorescence will accumulate at that spot on the chip, which can be quantified. In contrast, RNA-seq takes a simpler, but more expensive approach: take your whole batch of cDNA, and sequence (i.e. use a machine to ‘read’ the cDNA) the sucker! Rather than picking out individual genes to measure, RNA-seq takes an unbiased approach and measures everything. As the experimental costs of sequencing and the computational costs of analyzing such large data sets are both going down, this next-generation approach is becoming more prevalent in both the research community and in the twoXAR databases.

Phew! And there you have it, a handy knowledge dump from your friendly neighborhood geneticist. I hope it helps provide a clearer picture of the methods behind our madness.