Skip to navigation Skip to main content Skip to footer

Approved research

Patient stratification using machine learning on literature and clinical datasets

Principal Investigator: Dr Enrique Garcia-Rivera
Approved Research ID: 46421
Approval date: June 18th 2019

Lay summary

The UK Biobank represents one of the largest repositories of patient data available. We seek to harness this dataset to develop new ways of understanding patient segments and disease progression. Using a suite of machine learning models, we will attempt to characterize and stratify patients based on the entirety of their available data. This includes a complete understanding of a patient's record and associated variables. At nference we have developed Neural Network (NN) based models to quickly determine features associated with a diseased state. We are a team of highly accomplished computer scientists, biologists, chemists and clinical scientists (a large fraction coming from the Harvard/MIT ecosystem) with a mission to fundamentally change the way biomedical knowledge is queried. Applying our methods on the UK Biobank dataset could uncover previously unknown patterns and feature sets that predict clinically impactful variables, including disease progression, survival and therapeutic response. We have deployed our approach on other publicly available datasets, e.g. The Cancer Genome Atlas (TCGA), and have found distinct patient clusters that are non-obvious to a disease expert. At the same time, we are able to recapitulate well-known segments, e.g. EGFR (a gene) mutated patients in lung cancer. This represents an unbiased method of combining molecular data and what is known in the literature about a disease or phenotype. Such models can be used with the UK Biobank data to find associations between multiple variables and a particular phenotype, including clinical observations, molecular signals, clinical images, laboratory results, and many other variables. We expect to apply these models in a year's timeframe, with the ultimate goal of identifying feature sets of high clinical relevance. This will lead to dramatic advances in clinical trial design, with the most important step being the improvement of therapeutic development and response.