Big Data in Biomedicine

  • Posted on: Sun, 01/22/2017 - 02:48
  • By: OCHIS

Big Data in Biomedicine

“Big Data in Biomedicine” is a project conducted by Healthcare Engineering Division to give a full picture about how big data technology is being applied in biomedical field and what is the impact of that. 'Big data' is the sort of thing that people talk about every day, but few people actually knows how to do it.

Project Team Composition

Zhongxu An (project leader) , Shengwen Deng, Harry Qiu, Wenchuan Wu and Qinwei Zhang

Project Status


Delivered Report

Author: Zhongxu An, Shengwen Deng, Harry Qiu, Wenchuan Wu, Qinwei Zhang


Finished by Oct, 2016

Big Data

Large volumes of high velocity, complex, and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management and analysis of the information. The typical characteristics of big data consist of for Vs: Volume, Velocity, Variety and Veracity [1-2].

However, big data doesn’t have to be terabytes volume level but rather be a comprehensive data sets containing all possibly collectible data, which may or may not be directly related to the research topic. One advantage of this could be the repeatable use of same data for different studies. Comparing with conventional sample study, the cost of conducting a big data study may be higher, but accuracy and precision of conclusion would be significantly improved. On the other hand, big data focuses more on correlation among the data rather than causality, which may help scientists discover unexpected mechanism of diseases.

The traditional biostatistical analytics is applied on a designed sample data from a specific population, which makes it impossible to be precisely applicable for individuals. Application of big data in biomedicine would break the gap between the conclusion from population study to individual application because of two reasons: 1. The data collected from an individual is already a big data set including EMR, sensor data, diagnostic images, genome information. 2. The study conducted at the population level instead of a sample of that would improve the understanding of population wise medical situation and incorporate population information into individual analysis to improve the reliability of conclusion.

In addition, medical practice is shifting from subjective decision making to evidence-based medicine and big data analytics is a critical step in this transition.

Where is big data from?

The BIG DATA used for biomedicine is being collected in a wide range. Typically, it would include 1. human generated data (i.e. EMR), 2. biometric data (blood pressure, ECG, MRI, CT etc). However, with the help of highly developed information technology, web and social media data could also be used to understand population wide health flow. i.e. it has been proven that using Google Flu Trends to predict flu-related emergency visits is one week advance than CDC report [3-4].

Currently, big data is being used in the following area but not limited to:

  1. Clinical operations;
  2. R&D of new drug;
  3. Public Health;
  4. Evidence-based medicine;
  5. Genomic analytics;
  6. Pre-adjudication fraud analysis.
  7. Device/remote monitoring;
  8. Patient profile analytics;

Although application of big data in biomedicine became more and more popular, there are still remaining challenges need to be solved for a larger scale big data application in standard patient care:

  • Healthcare applications in particular need more efficient ways to combine and convert varieties of data including automating conversion from unstructured to structured data.
  • The conceptual framework for a big data analytics project in healthcare is similar to that of a traditional health informatics or analytics project. The key difference lies in how processing is executed. Big data analytics tools, on the other hand, are extremely complex, programming intensive, and require the application of a variety of skills.
  • Real-time big data analytics is a key requirement in healthcare. The lag between data collection and processing has to be addressed.

To fully understand how to implement big data technology in biomedicine, OCHIS Healthcare Engineering Team proposed the following framework. The framework includes data collection and data analytics part. With the framework, we would address how the data is collected, stored, transferred and processed. The framework includes Data Collection and Data Analysis part. On the data collection side, we will find how big data is being acquired, stored and transferred. On the data analysis side, it would be interesting to see how different algorithms and software were used for different data analysis at each step in different application. At the same time, hardware requirement is also a practical concern for researchers.

Fig. 1. Big Data in Biomedicine framework

There are multiple data levels in healthcare. However, how these data are structured for different big data application is unclear. Therefore, the Big Data Analytics Framework is to help understanding big data analytics in individual precision medicine and population medicine [5]. This would enable a comprehensive understanding of big data and thus help the implementation of big data in biomedicine.

Fig. 2. Big Data Analytics in Medicine framework

1. Bioinformatics - Molecular level data

Bioinformatics research uses molecular level data (e.g. gene expression data, cheminformatics, high-throughput screening, DNA sequence analysis) to study how human body works and develops effective handling methods for data analysis. As the molecular data is continually growing, surely classifiable as Big Volume, Big Data method is needed to process this type of data.

A major challenge of molecular-level data is ‘high dimensionality’, where data has a large number of independent attributes. Feature selection is essential to analyze this type of data.

Examples: 1) using gene expression profiling to categorize leukemia into different subclasses; 2) uses gene expression data to predict relapse among patients in the early stage of colorectal cancer.

2. Neuroinformatics - Tissue level data

Neuroinformatics concentrates its research on the analysis of brain image data in order to learn how the brain works, find correlation between information gathered from brain images to medical events.

Major challenges include feature extraction, how to manage complex images, high demand in computational power

Examples: 1) Creating brain connectivity map (Human connectome project); 2) Using MRI data for clinical prediction (e.g. using MRI data to determine the amount a patient has Alzheimer’s disease)

3. Clinical informatics - Patient level data

Clinical Informatics research involves making predictions that can help physicians make better, faster, more accurate decisions about their patients through analysis of patient data. Due to the use of streaming data (real-time), patient level data has big velocity as well as variety and volume.

Feature selection is still the most important factor for patient-level data, which can help choose the critical features and helps in making quicker clinical decisions.

Examples: prediction of ICU readmission; prediction of patient mortality rate (after ICU discharge and 5 years); making clinical predictions using data stream (predicting patient’s condition in real time).

4. Public health informatics - Population level data (social media)

Public Health Informatics applies data mining and analytics to population data (e.g. Twitter, internet query data like Google search data, message boards, or anywhere else people put information on the internet), in order to gain medical insight.

Data gathered from the population through social media could possibly have low Veracity leading to low Value, but techniques for extracting the useful information from social media (such as Twitter posts), this line of data can also have Big Value.

Examples: spatiotemporal information of disease outbreaks, real-time tracking of harmful and infectious diseases, increasing the knowledge of global distribution for various diseases, and creating an extremely accessible way of letting people get information about any medical questions they might have

Big data text mining

1. Nature language data mining.

Text mining deals with the recognition of patterns in natural language texts, most of which contains unstructured information in documents or unstructured text fields within databases. When accessing this textual information, applications can also benefit from a more detailed linguistic analysis of the Text beyond “word based” analysis. There is a wide range of techniques that can be applied in the field of natural language processing. The Text mining workflow consisting of information retrieval, information extraction, meaning annotation( or named entity identification), semantic search and association extraction, event/pathway extraction and knowledge discovery. Major sub-areas of text mining includes: As specific for natural language processing, the engine consist of concept analyzer, linguistic matching, statistical matching and linguistic recourses( multilingual lexicon, synonym database, grammar, etc)

Example: Text mining can be used in detecting associations between gene-diseases( gene prioritization and gene function prediction), diseases-diseases, toxicity prediction. It can also be used to rapid mining candidate hypothesis from text that validate against experimental data, for example: Indomethacin in Alzheimer’s disease

2. Multilevel modeling for big data.

After finding the representation topic and conceptual group in the text data, there is a need to construct a model that represent the topical hierarchies. As medical information are often collected in a network basis with time series and spatial hierarchies. It is often modeled in a multilevel manner.

Example: Teams in Chicago are using multilevel modeling to investigate the Why and Where of geographic disparities in late-stage cancer diagnoses in U.S. They specifically focus on breast cancer and colorectal cancer, together considering the spatial effects across the countries and spatial heterogeneity in the dynamics related to cancer outcomes. With the model, they can estimate relative influence of the correlates of cancer outcomes along multilevel, temporal, and urban-rural dimensions.

3. High dimensionality of data.

Different data in text can be regarded as different dimensions. As dimensions aggregates, the space in which individuals are represented becomes smaller, and it is more difficult to find interesting relations between variables. Such has been a great concern in text categorization. And in the process of knowledge discovery in text mining, it is necessary to reduce the dimensionality via methods like corresponding analysis, latent semantic analysis and projection pursuit. All these methods adopt the basic idea of “ bag of words” assumption: “Each document in a given corpus is thus represented by a histogram containing the occurrence of words. The histogram is modeled by a distribution over a certain number of topics, each of which is a distribution over words in the vocabulary. By learning the distributions, a corresponding low-rank representation of the high dimensional histogram can be obtained for each document”

4. Difference between prediction and explanatory.

Explanatory research aims to identify risk (or protective) factors that are causally related to an outcome. Predictive research aims to find the combination of factors that best predicts a current diagnosis or future event. This distinction affects every aspect of model building and evaluation. As for potential variables, the predictive modeling can have a large set of potential predictors, while the explanatory modeling only allows prespecified or low-rank representation confounders. The explanatory modeling is hypothesis-driven method for identifying casual relationship, and should not use automated selection procedures. The predictive modeling can use automated selection procedures to determined current diagnoses or future outcomes, but requires extra validation procedure such as calibration, discrimination, reclassification and clinical utility.

Data mining

To some extent, big data analysis is extracting information from large amount of data to make decisions. Sometimes we need to find meaningful groups from given collection of data (clustering). In some other situation, we need to learn a decision rule from a set of labeled data (classification).

1. Clustering

Cluster analysis or clustering is the task of grouping a set of unlabeled data in such a way that data in the same group are more similar to each other than to those in other groups. The groups are called clusters; data similarity is defined by newly assigned labels.

Depending on the data property, multiple clustering algorithms can be used. Different applications focus on different aspects of clustering. In data mining, the clustered results or groups themselves are of interest. Whereas in automatic classification, the discriminative power is of interest.

Example: Clustering can help to define patients group more accurate even without knowing how many groups of patients out there. Doctors with this knowledge can provide the most appropriate treatment to the patients. The more accurate the clusters, the more personalized the treatment can be.

2. Classification

Classification is a technique that used to classify and predict the predetermined data for the specific class. Many algorithms are available for classification. Widely used algorithms including Bayesian classification, genetic algorithm, decision tree induction etc. . Choosing the appropriate classification algorithm is important for getting meaningful results. It is depends on the type of the application and the dimensionality of the data. It’s also a challenging process. Normally researchers need to try many algorithms before finding the optimal one. For the big data analysis in health care, the most important factor is the accuracy of the classifier.

Example: C4.5 is an algorithm that generating the decision tree. It was applied on the SEER dataset (The Surveillance, Epidemiology, and End Results Program of national cancer institute) to classify the patient either in the benign stage or per cancer stage. The accuracy was 93%.

Fig. 3. Distributed processing


[1] Transforming Health Care Through Big Data. Institute for Health Technology Transformation.

[2] Raghupathi W and Raghupathi V. Big data analytics in healthcare: promise and potential. Raghupathi and Raghupathi Health Information Science and Systems, 2014,2:3.

[3] Herland M, Khoshgoftaar TM and Wald R. A review of data mining using big data in health informatics. Journal of Big Data, 2014, 1:2.

[4] Gonzalez GH, Tahsin T et al, Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery. Briefings in Bioinformatics, 17(1), 2016, 33–42.

[5] Arah OA. On the relationship between individual and population health. Med Health Care and Philos, 2009, 12:235–244.