Synergizing against breast cancer

I was about twelve when I found out my grandmother had breast cancer. My parents did a good job of shielding me from the worst of the details, but there is no way to avoid fear that comes from a loved one being diagnosed with cancer. As a kid, there wasn’t much I could do, but my grandmother loves to tell the story of me trying to comfort her by telling her I was going to do research to help cure her cancer. Little did I know at the time that treating cancer is not as simple as taking a pill once a day and that even identifying the right medicine is akin to finding a needle in a haystack.

Over the next seventeen years, as I pursued undergraduate and graduate studies in biology and genetics, I filled in those knowledge gaps, but felt no closer to changing the status quo of breast cancer…


What We Do: The A,B,C’s of twoXAR, Part III

So far, we’ve told you about various pieces of the twoXAR puzzle: our goal of identifying new drugs to treat human disease, our machine learning methods for drug classification, and our overall vision to improve lives more quickly and efficiently through data science. Now, I’d like to connect the dots by walking you through our drug discovery process for one of our major disease targets, Type II Diabetes—a disease that as of 2012 affects one out of every ten people in the United States.

So, where to begin? Step 1 in data science: collect data! Our first task is to identify the molecular differences between health and disease through published gene expression profiling databases. While over 99% of the human genome is shared by all members of the species, everybody’s cells exhibit differences in the expression of this genome: the transcribing of DNA instructions into RNA messengers, which are ultimately translated into protein products that execute biological functions. In other words: if the genome is a giant cookbook, then each cell type opens up a different set of pages to photocopy recipes (RNA), and follows those instructions to create a specific set of foods (proteins), which the cell will then use. Importantly, the particular recipes copied, and the number of copies made, is often changed during disease, leading to aberrant protein accumulation, which disrupts normal cellular function. Gene expression profiling studies identify these differences for thousands of distinct RNA instructions, creating a global picture of the ways in which diseased cells have gone awry.

The proteins that are made using these disease-altered RNA recipes are therefore attractive targets for drug intervention. However, proteins rarely act alone: they team up with a large network of fellow proteins in order to do their jobs in the cell. Thus, drugs that target the “coworkers” of disease-associated proteins may also prove to be effective therapies in the clinic. We therefore made use of a marvelously curated database of known protein-protein interactions to identify the buddies of proteins whose RNA instructions are significantly altered in Type II Diabetes patients, compared to healthy subjects. What’s more, this database also provides lists of drugs that interact with the disease-associated proteins and their friends.

Here’s our pot of gold: these drugs have the potential to compensate or correct for the changes caused by disease! The catch is, the gold is actually a massive stack of hay, with a few dozen golden needles buried inside. The human brain can’t make sense of thousands of data points from RNA profiles, protein networks and drug interactions alone. And no pharmaceutical company will ever pour R&D funds to investigate these connections one by one. Enter: the machine.

We developed some super sweet computer algorithms to map and simplify these large, diverse and unbiased datasets into a single biological model of disease. This model is then used to quantify each drug’s relevance to Type II Diabetes through machine learning. Machine learning provides a rigorous, unbiased way to predict which factors are relevant to a disease. As humans, it’s impossible to sort through a pile of data looking for relevant hits without our prior beliefs creeping up on us. A computer, on the other hand doesn’t privilege or discriminate against any correlations it finds. Thus, machine learning enables the discovery of drug-disease connections that researchers may have never considered on their own.

After the machine has made its predictions, we then use our human knowledge to verify its performance. Sprinkled throughout the haystack are a few known golden needles: chemical compounds that are already being used or developed to treat Type II Diabetes. What did our brilliant but ignorant machine think of these drugs?

Top 25 treatments for T2D ranked by twoXAR's algorithm

Our model successfully predicted the relevance of more than thirty drugs known to affect Type II Diabetes (see Figure above, blue bars), including both currently used clinical therapies and promising candidates that have shown significant effects in animal studies. For example, Bromocriptine, a top hit identified by our model, is an FDA-approved therapy that improves blood sugar levels and other hallmarks of diabetes. Meanwhile, NADH, another drug that was highly ranked by our algorithm, improves glucose tolerance and insulin sensitivity in both diet- and age-induced models of Type II Diabetes in mice. Our model also highly ranks many common treatments such as insulin therapy.

These successes validate the predictive power of our methods, and tantalizingly hint at the therapeutic potential of those highly ranked candidates whose effects on Type II Diabetes are currently unknown (depicted, appropriately, as “golden needles” in the chart above). Now it’s time to start sewing…

What We Do: The A,B,C’s of twoXAR, Part I

What We Do: The A,B,C’s of twoXAR, Part II

What We Do: The A,B,C’s of twoXAR, Part III

Commercial Successes in Translational Bioinformatics

As we discussed in our previous post, our work at twoXAR is best described as translational informatics: using big data methods to bring novel, experimental research closer to the clinic. While this exciting field is still relatively young, significant advances have been made in the last fifteen years. As the life sciences’ capacity to generate large, comprehensive, and unbiased biological data sets (“-omics”) has grown, so has the need for powerful data science to scale their digital mountains of results. In this post, I’d like to highlight how small, innovative data science companies have already begun to contribute to this massive project, and how twoXAR is poised to meet current needs in the field.

In the late 1990s, the progress of the human genome project and the advent of molecular profiling technologies such as DNA microarrays set the stage for the “Cambrian explosion” of big data in biology. Some of the first innovations in the generation and decoding of large molecular profiling data sets came from Rosetta InPharmatics, which was acquired by Merck in 2001 for $620M. Rosetta’s early work established computational pattern recognition techniques that allowed researchers to detect cellular gene expression changes induced by treatment with pharmacological compounds.

This pioneering work paved the way for other translational informatics companies, which are tackling current big data challenges in various different ways. For example, Ingenuity Systems, a Redwood City-based bioinformatics company, has developed tools that provide researchers with curated literature summaries and rapid statistical analyses. Their software is currently a well-cited resource, with subscriptions purchased by many academic labs. In 2013, Ingenuity Systems was acquired by the research technology company Qiagen for $105M. Other companies, such as the San Jose and Bangalore-based Cellworks, generate computational simulations of disease states to screen drug candidates for clinical development. Their ordinary differential equation (ODE)-based models have led to successful collaborations with academic scientists, and the advancement of drug candidates to the validation stage. Their efforts have been supported by grants from the Wellcome Trust, and by investments from Artiman Ventures and Sequoia Capital.

These companies, and others like them, demonstrate how diverse data science approaches from small groups of computational innovators can address the challenges of translational informatics in impactful ways. The unique machine learning and data mining techniques we have developed at twoXAR are thus joining a young but powerful arsenal of modern tools for modern biology.

Recreational Protein Folding

A favorite pastime of bio-geeks is to search the amino acid chains of proteins for everyday words or phrases. This is, of course, a silly endeavor, as applying the English language to the single-letter codes that represent each amino acid might as well be random noise as far as Mother Nature is concerned.

But, I couldn’t help myself from looking. Given that my name and my co-founder’s name has been the source of much fun, I went looking for it.

The closest match to ANDREWRADIN in protein language is in the oxidoreductase protein which is expressed in a form of bacteria known as sulfitobacter. This best-match amino acid sequence is AENREWRADI. Hmm, not exactly a perfect fit.

My namefellow decided, after I shared my dissatisfaction with him, to head over to raptorx and render an imaginary protein comprised of CARLFARRINGTANANDREWRADINANDREWRADINHEFFSAN, which represents the names of some of the people who are working with us at twoXAR. With only twenty-two letters of the alphabet available as codes for amino acids, some liberties had to be taken. For your viewing pleasure, an image of the CARLFARRINGTANANDREWRADINANDREWRADINHEFFSAN protein is included with this post. See the pink, swirly pasta noodle thing? I was surprised to see an alpha helix in what is otherwise random characters. Perhaps there is some order in the randomness of our names after all.