So far, we’ve told you about various pieces of the twoXAR puzzle: our goal of identifying new drugs to treat human disease, our machine learning methods for drug classification, and our overall vision to improve lives more quickly and efficiently through data science. Now, I’d like to connect the dots by walking you through our drug discovery process for one of our major disease targets, Type II Diabetes—a disease that as of 2012 affects one out of every ten people in the United States.
So, where to begin? Step 1 in data science: collect data! Our first task is to identify the molecular differences between health and disease through published gene expression profiling databases. While over 99% of the human genome is shared by all members of the species, everybody’s cells exhibit differences in the expression of this genome: the transcribing of DNA instructions into RNA messengers, which are ultimately translated into protein products that execute biological functions. In other words: if the genome is a giant cookbook, then each cell type opens up a different set of pages to photocopy recipes (RNA), and follows those instructions to create a specific set of foods (proteins), which the cell will then use. Importantly, the particular recipes copied, and the number of copies made, is often changed during disease, leading to aberrant protein accumulation, which disrupts normal cellular function. Gene expression profiling studies identify these differences for thousands of distinct RNA instructions, creating a global picture of the ways in which diseased cells have gone awry.
The proteins that are made using these disease-altered RNA recipes are therefore attractive targets for drug intervention. However, proteins rarely act alone: they team up with a large network of fellow proteins in order to do their jobs in the cell. Thus, drugs that target the “coworkers” of disease-associated proteins may also prove to be effective therapies in the clinic. We therefore made use of a marvelously curated database of known protein-protein interactions to identify the buddies of proteins whose RNA instructions are significantly altered in Type II Diabetes patients, compared to healthy subjects. What’s more, this database also provides lists of drugs that interact with the disease-associated proteins and their friends.
Here’s our pot of gold: these drugs have the potential to compensate or correct for the changes caused by disease! The catch is, the gold is actually a massive stack of hay, with a few dozen golden needles buried inside. The human brain can’t make sense of thousands of data points from RNA profiles, protein networks and drug interactions alone. And no pharmaceutical company will ever pour R&D funds to investigate these connections one by one. Enter: the machine.
We developed some super sweet computer algorithms to map and simplify these large, diverse and unbiased datasets into a single biological model of disease. This model is then used to quantify each drug’s relevance to Type II Diabetes through machine learning. Machine learning provides a rigorous, unbiased way to predict which factors are relevant to a disease. As humans, it’s impossible to sort through a pile of data looking for relevant hits without our prior beliefs creeping up on us. A computer, on the other hand doesn’t privilege or discriminate against any correlations it finds. Thus, machine learning enables the discovery of drug-disease connections that researchers may have never considered on their own.
After the machine has made its predictions, we then use our human knowledge to verify its performance. Sprinkled throughout the haystack are a few known golden needles: chemical compounds that are already being used or developed to treat Type II Diabetes. What did our brilliant but ignorant machine think of these drugs?
Our model successfully predicted the relevance of more than thirty drugs known to affect Type II Diabetes (see Figure above, blue bars), including both currently used clinical therapies and promising candidates that have shown significant effects in animal studies. For example, Bromocriptine, a top hit identified by our model, is an FDA-approved therapy that improves blood sugar levels and other hallmarks of diabetes. Meanwhile, NADH, another drug that was highly ranked by our algorithm, improves glucose tolerance and insulin sensitivity in both diet- and age-induced models of Type II Diabetes in mice. Our model also highly ranks many common treatments such as insulin therapy.
These successes validate the predictive power of our methods, and tantalizingly hint at the therapeutic potential of those highly ranked candidates whose effects on Type II Diabetes are currently unknown (depicted, appropriately, as “golden needles” in the chart above). Now it’s time to start sewing…