Arab Health Online 2016 is part of the Global Exhibitions Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 3099067.

Biomedical Big Data Analytics for Precision Health

Biomedical Big Data Analytics for Precision Health

 Ying Sha, Janani Venugopalan, Hang Wu, Li Tong, May D. Wang Georgia Institute of Technology and Emory University Atlanta, USA

 16 March 2017

Precision health is defined as clinical interventions for treating and preventing disease that takes individual variability into account. It is transforming current reactive health into future proactive health and will greatly improve the health outcomes of patients. Omic data capture molecular information such as genomic variation that are the intrinsic characteristics indicating individuals’susceptibility to disease, while EHR data provides the actual healthcare trajectories of individuals. Big –omic and electronic health record (HER) data analytics for precision health requires data integrity, data integration, causal inference, and real-time decision making.

-Omic and EHR Data Quality

The inherent quality issue of –omic (e.g. sequencing) data is influenced by a combination of biological, experimental, and instrumental factors in the sequencing process such as GC bias, read errors, sample contamination, batch effects, and inadequacy of sequencing depth and coverage. It significantly impacts the downstream analysis such as genome assembly, single nucleotide polymorphisms (SNP) identification, and gene expression study. Typical sequencing data quality control procedures identify and filtrate low-quality and contaminated sequencing reads by tools such as FastQC, Fastx-Toolkit, PRINSEQ, and NGS QC toolkits. In gene expression study that compares different patient groups, batch effect removal is also needed to eliminate external factors such as time and technician that complicating the downstream analysis. A typical –omic data expression analysis pipeline consists of aligning short reads to reference genome, quantifying transcripts, and identifying differential genes. Each step of the expression analysis pipeline such as reference genome, aligners, quantification tools all have multiple choices. Choosing a proper analysis pipeline is relative immature and complex with diverse applicability. Thus, systematic evaluation of sequencing data analysis pipelines is critical for making sense of -omic data

The two quality issues for EHR are the diverse data collection frequencies and missing data. While bedside monitoring data such as heart rate and pulse oximetry are captured at frequencies larger than 100 Hz, laboratory tests may be taken a few times a day, and the administrative data such as gender and procedures may be only one or a couple records for every hospital or clinic visit. Within a specific data type, the sampling frequency is also irregular for different variables depending on factors such as the criticality of a patient, the easiness of a measurement, and the rate of change of a clinical variable. For waveform data, using imputation with robust parameter extraction and the visualization to assist physicians in making a decision have been tried. However, analyzing other clinical data with irregular sampling remains an open research area. The main reason for missing data is despite comprehensive record keeping, the set of clinical variables being recorded varies with every clinical encounter and depends on clinical team’s assessment of a patient’s clinical condition. Missing data makes the validity of results generated from downstream data mining hard because

  1. statistical power may decrease whenconducting statistical hypothesis tests
  2. larger variance and bias may occur when estimating model parameters
  3. the conclusion drew from partial data may not be representative of entire data
  4. downstream analysis may become more complicated when missing data exist. Current imputation models either try to fill in missing data by using population averages or the means or median of the database or delete records with missing values. Deletion of records leads to a loss of statistical power, and mean filling introduces errors in the data that distorted the underlying disease state. More accepted technique are model-based approaches such as multiple imputation, expectation maximization, maximum likelihood methods and hot–deck imputation, but they still do not account of the patterns. Recently, Matrix factorization (MF) aims to find two low rank matrices representing patient phenotype and coefficients filling missing elements with product close to the original matrix. Or missing data is categorized into different classes with customized missing data filling methods.

Data Integration

The first challenge of data integration is incompatibility among EHRs developed by different vendors (such as Epic, Cerner) for hospitals of varying sizes and functionality (primary care, secondary care, tertiary or super-specialty). This is further compounded by the lack of standardization in the data definitions across vendors and time. Under the new meaningful use of EHR, the government requires healthcare institutions to show potential for sharing data, interoperability, and clinical decision support. Standards such as Health Level 7 Clinical Document Architecture (HL7 CDA) do not adequately represent all the health fields to incorporate decision support systems, and thus the Observational Medical Outcomes Partnership’s Common Data Models (OMOP CDM) and HL7’s Fast Healthcare Interoperability Resources (FHIR) data standard emerged. OMOP is an academia and industry collaboration. FHIR standardizes healthcare communication data elements using a resource-centric approach (not document-centric).

Existing integration efforts include the Electronic Medical Records and Genomics (eMERGE) and the brain initiative. The eMERGE network is a National Human Genome Research Institute (NHGRI)-funded consortium among nine research groups to identify causal genomic mutations (mostly SNPs) for phenotypic information (e.g., observable phenotypes for genetic disorders, drug responses, childhood obesity, and childhood autism) recorded in the EMR system, and then integrate identified genotype- phenotype associations. It augments the current EMR system structure to accommodate findings in genomic data, data security and privacy, and robustness of data analytics. The Alzheimer’s disease(AD) initiative works on multi-modal analysis by combining various imaging modalities such as structural MRI (T1 weighted, T2 weighted, DTI, DWI), fMRI and PET etc with clinical factors in EHR.

Causal Inference

Biomedicine aims to uncover molecular mechanisms of diseases, the cause of patient symptoms, the causal relationships between environmental factors and disease, and health policy aims to use causal relationships to optimize strategies for promoting health and preventing disease. To find causality, the “gold-standard”solution is to conduct highly controlled experiments such as randomized controlled trial (RCT) that puts people randomly into several groups for receiving one of several clinical interventions. It will eliminate confounding factors and elucidate reliable causal relationships. With the increasingly accumulated data such as Omic and EHR, there is a need to develop causality inference model to ensure the reliability of causal relationships derived from heterogeneous datasets. There are three categories of automated approach for causality inference from large observational datasets. Bayesian Networks (BN) infer a set of directed acyclic graphs from data based on three assumptions: the Causal Markov conditions (CMC), Causal Faithfulness (CF), and Causal Sufficiency (CS). Nodes and edges in BNs represent variables and their conditional dependence. Two nodes without edges suggests that the two variables are independent. CMC means that a node is conditionally independent of its non-descendants, given its direct causes. CF means that all dependence relationships in the graph hold in the data, and causality often loses faithfulness with small data. CS means that all common causes of variables are included in the graph. A variety of algorithms have implemented BNs to infer causality, and they differ from each other by the ways to explore the search space and the criterion to evaluate graphs. Granger causality infers causality from individual time series that evaluates the relationship of two variables individually at some lagged time instead of a set of relationships that best explain the data. However, two sequential symptoms of one disease does not necessarily mean one symptom is the cause of another, and thus Granger causality may be better suitable for prediction instead of causal inference. Temporal logics are based on the assumptions that causes occur prior to its effects, and raises the probability of its effect with probabilistic computation tree logic (PCTL) formula to provide an automated way to test causal relationships with properties such as the duration of time. For example, instead of a simple causality “a genetic mutation causes lung cancer”, the temporal logics enable the reasoning of a relationship that the probability of a person, that has smoked for 2 to 5 years, having lung cancer after a genetic mutation occurs is 0.2. This approach has been validated in multiple synthetic datasets and has significantly lower false discovery rate than BNs and Granger.

Real-time Decision

With the advancement in bio-sensor technologies, a variety of health conditions can be monitored such as vital sign monitoring by wearable devices, and environmental factor monitoring and captured in real time as streaming data. Data elements may include body temperature, blood pressure, vital signs, types of continuous monitoring data from EEG, EMG, and lifestyle-related information such as calories burned, distance walked, steps climbed, and sleep quality at clinic and at home through wired or wireless (Bluetooth). Real-time environmental monitoring devices may record human health related factors such as air quality, light, humidity, climate variation, and ozone. These data provide a personalized baseline for person’s health condition and living environment. Analyzing these “personalized”data requires streaming data analytics. Comparing to traditional data analytics for retrospective data, real-time streaming data analytics needs to learn incrementally by accepting new data all the time and adjusting prediction results continuously. Unsupervised learning is better than supervised learning and common prediction models often have incremental version such as support vector machine, neural networks.

Precision Health

Achieving precision health requires the integration of -omic data and EHR data to disentangle the complex causal relationship between genotypes and phenotypes to provide a comprehensive view of a person’s health. However, building infrastructures to host multiple data types, and relating population- level causal relationships to individual level remaining big challenges.

To host multiple datatypes, an growing effort such as eMERGE Network consortium attempts to identify causal genomic variants (mostly SNPs) for EMR-based phenotypes and to integrate identified genotype- phenotype associations into the EMR system by improving the current EMR structure:

  1. storing various genomic variants, such as SNPs, indels, and CNVs, in a structured format
  2. providing interoperability to reduce the burden in data transfer and update within and between healthcare facilities
  3. supporting rule-based decision support engines
  4. containing abundant visualization elements for easier interpretation. Recently, the HL7 FHIR standard also expands to have genomics standardized data exchange protocol to enable clinicians to utilize -omic information with EHR to tailor treatment plants for individual patient.

To relate population-level causal rules to individual-level ones also requires additional efforts. For example, although smoking causes lung cancer, a physician cannot attribute smoking as the cause for a patient with lung cancer and one-week smoking history. Various approaches are proposed including generalizing individual-level rules into population-level rules, applying population-level rules into cases, and treating them as two separate entities. A number of applications in the context of medicine include medical diagnosis using qualitative simulation, expert systems based on knowledge derived from large databases. On the other hand, inferring causal rules from a population that are similar to the target patients is desirable and useful.


In this article, we give an overview of biomedical big data analytics for precision health. Precision health is in its infancy stage, and more efforts are anticipated to improve the data quality control, facilitate data integration, advance causal inference from observatory studies, and develop real-time decisions. As more and more data standards are proposed, collaboration among clinical, academic, and industrial institutions are happening, we envision that implementing -omic data into EHR in clinical setting will be realized in near future and Precision Health will come into reality, leading to the better well-being and outcomes of patients.