Skepticism is (and should be) a vital part of any science; statistics and data science are no exception. Statistician George Box nicely summed it up when he said, “all models are wrong, but some are useful”. Box reminds us that statistical models are just that: models. A simplified representation of the real-world will always have shortcomings. But we shouldn’t forget the last bit of Box’s saying: “some [models] are useful”. Although challenging, carefully constructed statistical models can be extremely…
- Google Maps now predicts how difficult it will be to park near your destination based on the behavior of other users in the area.
- Airbnb uses attributes of its users and available properties to generate personalized search results that lead to higher booking rates.
- twoXAR is predicting new drug candidates outside of the wet lab using biomedical data.
These three accomplishments are all possible today because of machine learning.
Machine learning continues to disrupt markets and transform peoples’ everyday lives. Yet, the public is far removed from the actual technology that drives these changes. To many, the idea of machine learning may elicit images of complex mathematical formulas and sentient robots. In fact, many of the general ideas behind machine learning are approachable to a wider audience…
Today we announced our collaboration with Santen, a world leader in the development of innovative ophthalmology treatments. Scientists at twoXAR will use our proprietary computational drug discovery platform to discover, screen and prioritize novel drug candidates with potential application in glaucoma. Santen will then develop and commercialize drug candidates arising from the collaboration. This collaboration is an exciting example of how artificial intelligence-driven approaches can move beyond supporting existing hypotheses and lead the discovery of new drugs. Combining twoXAR’s unique capabilities with Santen’s experience in ophthalmic product development and commercialization…
We at twoXAR were very honored to be included this week in The AI 100, CBInsight’s list of top private Artificial Intelligence companies. It’s given me a chance to reflect on how we employ AI relative to others in the industry.
Our focus is on drug development — and being one of the few biopharma companies to be included in the list, we use AI in a unique way. Where others may be using AI as the sole ingredient…
Guest post by Marina Sirota, PhD, twoXAR Advisor and Assistant Professor, UCSF Institute for Computational Health Sciences
Earlier this month, Andrew A. Radin and I had the opportunity to attend acommunity outreach meeting at UC Irvine hosted by the NIH Libraries of Cellular Signatures (LINCS) consortium. It was a great and diverse community gathering of drug discovery researchers from academia, biopharma, startups, consulting companies and government funding agencies. For anyone interested in listening to the talks, some of them have been posted on YouTube.
The focus of day one was…
“Any sufficiently advanced technology is indistinguishable from magic.”
Science writer and futurist Arthur C. Clarke’s poignant “third law” only becomes more relevant as technological innovation accelerates and disciplines like computer science, data science and life science converge.
As we have been out in the field demonstrating the power of our technology platform to our collaborators, it has been interesting to hear their reactions when we tell them how it can
evaluate tens of thousands of drug candidates and identify their possible MOAs, evaluate chemical similarity, and screen for clinical evidence in minutes. These responses cover the gamut from, “Wow, this is going to revolutionize drug discovery!” to “this is magic, I don’t believe computers can do this…”
However, whether we are talking to the converted or the skeptical, as we get deeper into conversations about how our technology works, we come into agreement that using advanced data science techniques to analyze data about drug candidates is not magic. In fact, we’re doing what scientific researchers have always done – analyze data that arises from experiments. What’s different is that advances in statistical methods, our proprietary algorithms, and secure cloud computing enable us to do this orders of magnitude faster than by hand or with the naked eye.
The speed of our technology combined with the massive quantities of data that it processes, is simply enhancing the work that our collaborators have been doing in the lab for years. We believe that the most interesting and powerful new discoveries will arise at the intersection of open-minded life scientists combining their deep expertise with unbiased software.
Technologies like ours are meant to augment* the work of life scientists and help them accelerate drug discovery and fill clinical pipelines while leading society to a more robust and streamlined scientific process. Although DUMA might sound futuristic, today it is enabling therapeutic researchers to better leverage the value of their data and do it more rapidly than ever before.
Don’t believe the magic? Contact me and we’ll get a trial started to show you the science.
*Sidenote: I have been particularly interested in this interaction between humans and machines, which led me to a class at MIT called The [Technological] Singularity and Related Topics. One of those major topics was whether or not machines (including software) will replace aspects of society. One of my professors Erik Brynjolfsson, author of The Second Machine Age: stated that “We are racing with machines – let’s augment, not automate.”And we definitely share that view here at twoXAR.
Long before my biomedical informatics studies at Stanford, I learned to recognize the difference between what is computable and what is not. In my past three startups this knowledge has been the key to engineering success.
Recently I made a trip to visit my alma mater, the Rochester Institute of Technology, where I earned my computer science degrees. For those not familiar with RIT, it has been ranked as one the top 10 of universities in the Northeastern United States. Recently, Linked-In ranked RIT in the top 25 for software development programs in the nation, and as #13 for software developers specifically for startups.
While on campus, I reconnected with my thesis advisor, Prof. Stanisław P. Radziszowski. Dr. Radziszowski is a highly-respected computer scientist and mathematician, and is best known for his work in Ramsey theory. He has published a number of works in Ramsey theory, block designs, number theory, and computational complexity. Since 1984, Dr. Radziszowski has mentored and trained thousands of students at RIT and I was not sure if would remember me and my work. I was pleasantly surprised that he not only remembered me, but had my thesis readily available on his bookshelf. He told me that he was very excited about my work in combinatorial mathematics, and in the years since has shown it to students to inspire them to work on similar problems in computability theory.
One of the most important things I learned while a student of Dr. Radziszowski was how to skirt the line between what is computable and what is not. It is easy to imagine scenarios that seem like they should be easily solved with a computer, however there are many problems that turn out to be unsolvable. For example, let’s say you want to build a chemical storage facility. To minimize construction costs, you want to find the minimum number of buildings to store chemicals in, but you have to make sure no chemicals that react with each other are in the same building. Sounds simple, right? Well, turns out that there is no known solution that can be calculated optimally without nearly infinite computational resources.
My thesis at RIT was about the mathematical problem described by the chemical storage problem above. My method struck a balance between computability and complexity, right in the fuzzy center of what can be computed and what cannot. It involved developing a new heuristic which approximated the optimal solution. While my heuristic was not guaranteed to always produce the best solution, it produced a result that was theoretically very close to optimal, and better than other known approximation methods at the time.
Heuristics like these are the way that computer scientists model extremely complex systems. Today, there is no way to model every small detail that make up the complex interactions between organisms, disease, and drug treatments. There are too many variables; and to compute every possibility of interaction is impossible. We have perfected a computational technique at twoXAR that represents complex biological reactions in such a way that accurately reflects reality, but is simple enough to perform fast computation on. This technique is what enables us to go from biological data sets to accurate drug-disease prediction within minutes.
It was a pleasure to circle back and talk about twoXAR with the mathematician who inspired in me the principals behind our methodology all those years ago. Students at RIT should be honored to learn from such a talented individual, and I know the next generation of great data scientists are attending Dr. Radziszowski’s lectures today.
In previous posts, we’ve alluded to the ever-expanding wealth of Big Biological Data, and the increasing capacity of biomedical informatics to convert this data into knowledge, cures, and cash. Here, I’d like to clarify the source of this approach’s power. Rather than relying on strong individual signals to reveal the causes and answers to disease, bioinformaticians are unearthing the complex webs of weak associations that underlie biological (mal)function.
The need for such methods is illustrated by the “missing heritability problem”. As Gregor Mendel was lucky enough to find and rigorous enough to observe, many traits such as plant seed color are passed from parent to offspring in a predictable manner. With the advent of molecular biology, it became clear that these traits are determined by variants in parental DNA, called alleles, which are inherited by the cells that make up the next generation. However, many other traits such as height, diabetes and Crohn’s Disease, though clearly heritable, can’t be traced to single allele and neatly predicted by a high schooler’s Punnett Square. For instance, a casual glance around one’s social network will confirm that parental height often corresponds to their child’s chance at making the basketball team. Tall parents beget tall children, seems simple enough. Yet height is determined by at least 40 different genes, which when combined still only explain 5% of the height variance of tens of thousands of people! How is it that 40 supposedly clear signals can’t pinpoint inheritance patterns we can plainly see? In the past decade, it’s become clear that most complex traits can’t be understood by finding a few smoking guns, but rather by connecting hundreds of scattered embers. Thus, to understand complex diseases, we must untangle the weak, noisy contributions of many, many genes.
Believe it or not, this is the type of problem that twoXAR’s software architect Carl worked on at NASA. To study extraterrestrial objects, NASA scientists record their electromagnetic emissions using instruments such as radio telescopes. As these objects are really frickin’ far away, radio signals they emit are extremely weak and noisy. However, what this data lacks in clarity, it makes up for in abundance. The concept goes like this: if a signal is even slightly more consistent than random noise, over lots and lots (and lots) of measurements, its pattern will manifest. All you need then is some clever algorithms to detect it. Fortunately, Carl’s and his ilk are some pretty clever folks.
When seven leading geneticists were interviewed about how to solve the missing heritability problem, one common theme that emerged was the need for more data, and more different types of it. Here at twoXAR, we’ve taken that concept to heart by querying multiple measurements, databases and tissue types in our search for protein networks linked to disease, and hiring folks like Carl to help build effective telescopes.
So far, we’ve told you about various pieces of the twoXAR puzzle: our goal of identifying new drugs to treat human disease, our machine learning methods for drug classification, and our overall vision to improve lives more quickly and efficiently through data science. Now, I’d like to connect the dots by walking you through our drug discovery process for one of our major disease targets, Type II Diabetes—a disease that as of 2012 affects one out of every ten people in the United States.
So, where to begin? Step 1 in data science: collect data! Our first task is to identify the molecular differences between health and disease through published gene expression profiling databases. While over 99% of the human genome is shared by all members of the species, everybody’s cells exhibit differences in the expression of this genome: the transcribing of DNA instructions into RNA messengers, which are ultimately translated into protein products that execute biological functions. In other words: if the genome is a giant cookbook, then each cell type opens up a different set of pages to photocopy recipes (RNA), and follows those instructions to create a specific set of foods (proteins), which the cell will then use. Importantly, the particular recipes copied, and the number of copies made, is often changed during disease, leading to aberrant protein accumulation, which disrupts normal cellular function. Gene expression profiling studies identify these differences for thousands of distinct RNA instructions, creating a global picture of the ways in which diseased cells have gone awry.
The proteins that are made using these disease-altered RNA recipes are therefore attractive targets for drug intervention. However, proteins rarely act alone: they team up with a large network of fellow proteins in order to do their jobs in the cell. Thus, drugs that target the “coworkers” of disease-associated proteins may also prove to be effective therapies in the clinic. We therefore made use of a marvelously curated database of known protein-protein interactions to identify the buddies of proteins whose RNA instructions are significantly altered in Type II Diabetes patients, compared to healthy subjects. What’s more, this database also provides lists of drugs that interact with the disease-associated proteins and their friends.
Here’s our pot of gold: these drugs have the potential to compensate or correct for the changes caused by disease! The catch is, the gold is actually a massive stack of hay, with a few dozen golden needles buried inside. The human brain can’t make sense of thousands of data points from RNA profiles, protein networks and drug interactions alone. And no pharmaceutical company will ever pour R&D funds to investigate these connections one by one. Enter: the machine.
We developed some super sweet computer algorithms to map and simplify these large, diverse and unbiased datasets into a single biological model of disease. This model is then used to quantify each drug’s relevance to Type II Diabetes through machine learning. Machine learning provides a rigorous, unbiased way to predict which factors are relevant to a disease. As humans, it’s impossible to sort through a pile of data looking for relevant hits without our prior beliefs creeping up on us. A computer, on the other hand doesn’t privilege or discriminate against any correlations it finds. Thus, machine learning enables the discovery of drug-disease connections that researchers may have never considered on their own.
After the machine has made its predictions, we then use our human knowledge to verify its performance. Sprinkled throughout the haystack are a few known golden needles: chemical compounds that are already being used or developed to treat Type II Diabetes. What did our brilliant but ignorant machine think of these drugs?
Our model successfully predicted the relevance of more than thirty drugs known to affect Type II Diabetes (see Figure above, blue bars), including both currently used clinical therapies and promising candidates that have shown significant effects in animal studies. For example, Bromocriptine, a top hit identified by our model, is an FDA-approved therapy that improves blood sugar levels and other hallmarks of diabetes. Meanwhile, NADH, another drug that was highly ranked by our algorithm, improves glucose tolerance and insulin sensitivity in both diet- and age-induced models of Type II Diabetes in mice. Our model also highly ranks many common treatments such as insulin therapy.
These successes validate the predictive power of our methods, and tantalizingly hint at the therapeutic potential of those highly ranked candidates whose effects on Type II Diabetes are currently unknown (depicted, appropriately, as “golden needles” in the chart above). Now it’s time to start sewing…
What We Do: The A,B,C’s of twoXAR, Part III
As you may recall, in a previous post, I introduced some of the biomedical informatics concepts that twoXAR uses to find new drug treatments for disease. In this post, I’d like to offer more specifics about how our work fits into wide, yet sub-divided landscape of biomedical informatics.
What often comes to mind when people think of the use of computers in the medical field is identifying patterns in clinical data. This branch of biomedical informatics, known as clinical informatics, enables researchers, physicians, and policy makers to learn new information from electronic healthcare records such as predicting disease outbreak or discovering adverse drug effects. One of the more notable clinical informatics studies revealed that Vioxx was responsible for an increased risk of heart attacks, which subsequently resulting in it being pulled from the market.
Another branch of biomedical informatics is computational molecular biology. In this field, researchers process genomic sequences to help map and understand the underlying structure of our DNA. This knowledge allows other scientists to make associations based on our genes— such as uncovering the links between heredity and disease. The work here includes powerful data processing techniques to coalesce, order, and organize the jumble of data that comes out of gene detection instruments like gene expression microarrays. The most famous computational molecular biology project was the human genome project, which was the first time society was able to sequence the entire DNA of a human being.
The field that best encompasses the twoXAR technology is translational informatics. As with clinical informatics and computational molecular biology, translational informatics uses computer science approaches such as machine learning, data mining and other statistical analysis techniques to gain new insights through computation. The biggest difference between clinical and translational informatics is that clinical informatics is concerned with finding new insights from patient data, while translational informatics is focused on translating new scientific discoveries into functional solutions for humans.
At twoXAR, we are using computational methods to find new drug treatments that have never been identified or used before. After rigorous clinical trials, we will bring these new therapeutics to your doctor so she is able to prescribe a new course of drug therapy that will produce safer and more effective results than existing treatments.
In a subsequent post I’ll talk a little more about how we use machine learning and data mining to discover new drug therapies.