combining human+machine intelligence to successfully integrate biomedical data

14
COMBINING HUMAN & MACHINE INTELLIGENCE TO SUCCESSFULLY INTEGRATE BIOMEDICAL DATA TIMOTHY DANFORD | TAMR, INC.

Upload: russelltamr

Post on 07-Aug-2015

49 views

Category:

Technology


2 download

TRANSCRIPT

COMBINING HUMAN & MACHINE INTELLIGENCE TO SUCCESSFULLY INTEGRATE BIOMEDICAL DATATIMOTHY DANFORD | TAMR, INC.

THE DATA INTEGRATION PROBLEM

● flat files: every file has its own columns

● bioinformatics: every tool has its own file format

● graph data: RDF, OWL, “knowledge graphs”

● proprietary / legacy formats: SAS, DBF

● relational databases: inconsistent data models

Biomedical Data Integration is aConstantly Moving Target

THE DATA INTEGRATION PROBLEM

● One solution: hire or train data curators who understand the subject area

● Benefits: accuracy

● Problemso Low bandwidtho Difficult to scale to larger

problemso Recording decisionso Consistency between curators

Data Curation Teams Do Not Scale

THE DATA INTEGRATION PROBLEM

● Build an automated or rules-based system to perform data integration

● Benefits: scale

● Problemso Accuracy, edge-caseso Programmers do not scaleo Out-of-band communicationo Expensive to maintaino Brittle in the face of new data

Rule-based Integration Is Brittle

TAMR AUTOMATES DATA INTEGRATION

● Solution: combine learning rules with asking experts

● Modern machine learning techniqueso semi-supervised learningo active learning

● Benefits o speed of an automated systemo accuracy of human expertso auditability o responds well to changing

requirements

Use Probabilistic Rules with Active Learning

TAMR AUTOMATES DATA INTEGRATION

● Build a unified schema and link it to source attributes

● Engage subject matter experts to answer questions

● Automate data transformation

● Eliminate redundant records with de-duplication

Tamr Combines Machine Learning and Expert Feedback

CASE STUDY: CLINICAL STUDY DATA

● Clinical study data integration is motivated by a single schema: CDISCo mandated by FDA for data

submissiono common schema for clinical data

warehouses

● Mostly performed by SAS scripting today

● Tamr learns attribute mapping and transformations using human feedback

An Example: Clinical Study Data Integration

Thank You

THE BIOMEDICAL DATA INTEGRATION PROBLEM

Fundamentally, many scientific analyses are tabularrows are ‘entities’

columns are ‘attributes’ graphs (paths) and hierarchies (part/whole) are other shapes

tables emphasize independence of entities and attributes

Tabular Datasets are a Core Data Shape

THE BIOMEDICAL DATA INTEGRATION PROBLEM

● Column-oriented: Find the matching attributes● Row-oriented: Discover duplicate entities

Data Integration Proceeds In Two Directions

● 80% of clinical data today goes unused● Clinical Data Warehouses capture legacy data● Improved analytics = better trials, less $$

Advanced Analytics, Better Clinical Trials

TAMR BUILDS LASTING VALUE

SAS

Faster Regulatory Filings

Better Clinical Analytics

Data Mining for New Indications

Dynamic, Integrated View of 15k Existing and New Sources: Biopharma

Result• Replaced 10+ man years of human curation effort with Tamr• Engage 600 Scientists in data quality ownership

Challenges• $2B in research and silos of experimental results• 15,000 sources of experimental results• Hundreds of decentralized labs• 1M+ rows with >100k attribute names• Non-standardized attribute names & measurement units• Manual curation prohibitively time & cost intensive

Solution• Integrate data to find similar experiments• Scaling data curation to incorporate all sources at

reasonable cost• Engage owners of data sources in improving quality of data

15k sources integrated into one view

Tamr Output

TACKLING THE ENTERPRISE DATA SILO PROBLEM

All are necessary but not sufficient to truly address next-gen challenges

● Democratized visualization and modeling - radical consumption heterogeneity

● SemanticWeb/LinkedData - radical source heterogeneity

● Provenance for data to improve reliability

● Rapid iteration/change requires reproduceability from source

● Desire for longitudinal data across many entities

● Need for automated data quality / assurance

Traditional approaches...

● Standardization - worth trying

● Aggregation - yes - but actually makes the problem worse

● Top-down modeling (MDM/ETL) - ok for app-specific or well-defined data