A multitude of genetic factors can influence the development of diseases such as high blood pressure, heart disease, and type 2 diabetes. Understanding how DNA affects the risk of such diseases would allow healthcare systems to be less reactive and more proactive, thereby not only improving patients' quality of life but also reducing costs. However, identifying the links between DNA and disease onset requires statistical models that can reliably process very large datasets from hundreds of thousands of patients.
Matthew Robinson, Assistant Professor at the Institute of Science and Technology (IST) Austria, has now developed a new mathematical model with an international research team that improves the accuracy of predictions based on large amounts of genomic data. This method could help develop personalized predictions about health risks, similar to how a doctor might use a family's medical history.
Human DNA consists of billions of base pairs that encode our biological structure and functions. For their study, the scientists selected several hundred thousand genetic markers—short segments of DNA sequence—as the basis for their model. They then correlated the composition of these markers with the occurrence of high blood pressure, heart disease, or type 2 diabetes in the patients. The researchers were particularly interested in the patients' age at the onset of the disease. With this information, they can then calculate the probabilities of developing these diseases from a certain age onward.
However, this statistical model cannot establish direct relationships between specific genes and the onset of a disease, but only provides an improved prediction of the probabilities of disease onset. This means that it cannot predict the onset of a disease with certainty based on a person's genes. Furthermore, there is also an important difference between the black-box models often used for big data studies and this method by Robinson and his colleagues. While black-box models do provide predictions, their internal structure, due to the many levels of abstraction involved, is not easily understood by humans. In contrast, the model developed by Robinson and his colleagues provides verifiable statistical calculations.
The ability to understand the precise structure of a mathematical model used to make predictions about human health is a crucial part of an ethical approach to using large amounts of patient data.
To fully exploit the potential of such preventive methods, both effective models and the collection of large genomic datasets are needed. These raise important questions about data security and privacy that must be addressed by both researchers and the healthcare system.
Strict data security measures must be observed when using patient data. Only with the permission of the respective ethics committees were the researchers able to access anonymized patient data from national biobanks—large collections of genetic patient data—in both the UK and Estonia. They used the UK data to build their model and the Estonian data to test its predictive power. The latter even yielded initial personalized risk assessments for disease outbreaks. These will be shared with patients via the Estonian healthcare system to incentivize them to take preventative measures.
The new statistical model by Robinson and colleagues is a first step toward harnessing the full potential of large genomic datasets for preventative healthcare. Both the models and the data infrastructure of biobanks, along with a robust and secure data protection system, are needed to fulfill the promises of personalized medicine.
Sven E. Ojavee, Athanasios Kousathanas, Daniel Trejo Banos, Etienne J. Orliac, Marion Patxot, Kristi Läll, Reedik Mägi, Krista Fischer, Zoltan Kutalik, Matthew R. Robinson. 2021. Genomic architecture and prediction of censored time-to-event phenotypes with a Bayesian genome-wide analysis. Nature Communications. DOI: 10.1038/s41467-021-22538-w
This project was funded by an SNSF Eccellenza grant to MRR (PCEGP3-181181) and core funding from the Institute of Science and Technology Austria and the University of Lausanne; KF's work was supported by the Estonian Research Council grant PUT1665. The researchers would like to thank Mike Goddard for his comments, which greatly improved the work, the participants in the cohort studies, and the Ecole Polytechnique Federal Lausanne (EPFL) SCITAS for their excellent computing resources, their generosity of time, and their warm support.