solving the disconnected data problem in healthcare using mongodb
DESCRIPTION
The data diversity in healthcare and life sciences is exploding and the market is fundamentally changing as a result of healthcare reform. The result is more and more data but it is compartmentalized and disconnected. At Zephyr Health, we have developed a data platform that is able to provide connectivity between thousands of healthcare data assets using an ontology driven approach storing data in MongoDB. This session will show how we break down this very challenging problem and how some of MongoDBs more recent features have been utilized to do so.TRANSCRIPT
Sven Junkergård - CTO
Solving the Disconnected Data
Problem in Healthcare Using
MongoDB
A MongoSF talk – December 3rd 2014
• MSc Computer Science and Engineering – Chalmers University
of Technology in Gothenburg
• AMS, Capgemini
• Cake Financial – aggregating retail investor portfolios and
generating investment insights from the best of the best
• Billfloat – novel financial credit product with highly differentiated
underwriting method
• Zephyr Health – built out technology and engineering team to
deliver on a big vision – integrate disconnected data in
healthcare and solve real problems. Now CTO.
ME
I am a reformed consultant who used to do architecture consulting…
2
3
• Life Sciences
• Brand Management
• Big Data
• Applied Mathematics
• Algorithms
• IaaS | SaaS | PaaS
San FranciscoLondon
India
OFFICE LOCATIONS
ORGANIZATIONAL
EXPERTISE
CURRENT CLIENTSInclude members of:
GLOBAL TOP 5
BIOPHARM
GLOBAL TOP 5
PHARM
GLOBAL TOP 5
MEDICAL
DEVICES
WHO I WORK FOR – ZEPHYR HEALTH
• Machine Learning
• Artificial Intelligence
• Statistics & Modeling
• Data Science
• Visualization
• App Development
OUR FOCUS
• Organize disconnected data in healthcare and life science
• Visualize the combination of heterogeneous data sources in analytical problems
• Solve important and challenging problems for our customers
V
V
V
Volume
Velocity
Variety
V Visualization
SOLVING THE VARIETY PROBLEM
4
Genomic sequencing
Streaming device data
Understanding healthcare
landscape and treatment
effectiveness
Healthcare example
• Image sources: illumina and iRhythm
Internal Vendor Public
Providing relevant and
powerful visualizations
that provide real insights
Data trends
WHY HEALTHCARE DATA IS A DIFFERENT WORLD ENTIRELY
5
Loan application decision Clinical trial investigator decision
• Research
• Published trials
• Current sponsored trials
• Prescriptions
• Claims
• Funding
• Network leadership
• Site profile
• Site certification
• Site statistics
Applicant demographics
Bank
account
Credit
report
Identity
check
Income
verification
SSN
SSN SSN
SSN SSN
Investigator
Site
Patients
Inconsis
tent o
r mis
sin
g k
eys
THE TYPES OF PROBLEMS THAT CAN BE SOLVED
WITH INTEGRATED DISPARATE DATA
Problem What is it?
Site selectionFinding the right locations to house clinical trials
Trail outcomesVisualizing data from different sources within clinical
trials
Medical expertise
communication
Identifying the healthcare professionals with the right
expertise
Scoring and rankingFinding the top ranking healthcare professionals or
institutions for a particular purpose
Network leadership
analysis
Understanding who is connected to who and how
information is disseminated
Care delivery
effectiveness
Identifying areas of great or poor performance and the
underlying reason
Patient outcomesRelating patient outcomes to specific market activities
Health economicsUnderstanding the financial effectiveness of an
intervention or introducing a new standard or care
6
DATA CATEGORIES AND EXAMPLES
Keys Controlled Vendor specific Anything and nothing
FormatsSpreadsheets
(structured) Flat files Anything
Managing variety is the key to solving the problem
Sales
Speakers
Partners
CRM
Payments
Trials
Internal
Rx
Claims
Primary research
Consulting
Referral patterns
Vendors
Providers
Grants
Public trials
Research
Public
Creating a complete picture requires combining disconnected data from
an enormous variety of sources
7
Managing data variety is the key to solving the problem
A DIFFERENT PROBLEM REQUIRES A DIFFERENT SOLUTION
Instead…
• A different data model based on
descriptive meta data
• A non-traditional data store
• Something other than Informatica
• Automated intelligent algorithms
• A few special tricks
• An API
• Some really great applications...
8
OLAP Cube BI Insigh
t
ETL DW DM
ENTITY CENTRIC DATA MODEL
Entity
table
Data
source 1
Data
source 2
Data
source n
Entity
Attributes
Entity
Attributes
Entity
Attributes
Traditional, relational model Entity centric model
Meta
data
……
……
……
……
……
……
……
……
……
……
……
……
……
ONTOLOGY-BASED DEVELOPMENT
10
Requirements• Flexible
• Extensible and adaptive
• Easy to maintain
Solution• Ontology: used to formally represent knowledge within a
domain
• Vocabulary: Collection of entities, attributes, relationships
that provides context within the domain
• Taxonomy (Classification): A hierarchical collection of
controlled terms from vocabulary
VOCABULARY
11
Entities
Organic Attributes
Derived Attributes
Entity Relationships
Real world things or eventsE.g. Institution, patient, sales,
potential, etc.
Data points coming from datasets
E.g. first_name, age, revenue, date, etc.
Relationships between different entities
Processed key-value pairs from existing organic and/or derived
attributes
WHY MONGODB?
Our requirements• Extremely flexible data storage• Low cost of evolving schema• Highly performant for complex joints, recursive queries etc• Scalable to large volumes of connected information
MongoDB: • Document store is a great fit for storing arbitrary information• Key-value pair in JSON format – (allowed for both adding data traceability and
cheap data evolution)• Secondary indexes and strict consistency• Map-reduce functionality
Challenges:• Queries are powerful but not easy to write• We needed complex joints across arbitrary information (how do you create an
index on something you don’t even know what it is ahead of time?)
12
DATA ORGANIZATION
13
Full Profile
Main ProfileEntity
RelationshipsAttribute
References
Identity Section
Attributes (Organic + Derived)
dataset dataset_recordsFile
InfoRaw Data
Geo locations
DATA INTEGRATION
14
{
first_name: Charles
last_name: Morris
street: 200 First St.
city: Rochester
state: MN
zip: 55905
phone: 802-555-1234
email: [email protected]
headshot: <AF6713…>
thought_leader_score: 8
pub_count: 203
}
DISPARATE SOURCESOF INFORMATION
STRUCTUREDPROFILE
APPLICATIONREPRESENTATION
All enabled through a series of data integration algorithms
ALGORITHM EXAMPLES
15
Disambiguation
Dataset identification
Clustering
Record linkage
C MorrisHeart and Vascular Center
123 Main St
Rochester, MN 55903
802-555-9988
Charles “Chuck” MorrisCardiologist
200 First St.
Rochester, MN 55905
802-555-1234
??Automatically choosing
the most authoritative
version of an attribute
Maximizing re-use of
meta data describing
imported data sets
Pre-calculating clusters
in weakly attributed data
ILLUSTRATIVE MONGODB PROFILE
{
“_id” : “53bcf9cae4b03f352d4b47c7“,
"identity": {"npi": "1",
"specialty": ["Cardiologist”],
"first_name": "Tom",
"last_name": "Smith”},
"attributes": {
"npi": {1},
"first_name": {"Tom”},
"last_name": {"Smith”},
"specialty": {"Cardiologist”}
}
}
16
NPI FirstName LastName Specialty
1 Tom Smith Cardiologist
ADDING ADDITIONAL ATTRIBUTES
{
“_id” : “53bcf9cae4b03f352d4b47c7“,
"identity": {"npi": "1",
"specialty": ["Cardiologist”],
"first_name": "Tom",
"last_name": "Smith”},
"attributes": {
"npi": {1},
"first_name": {"Tom”},
"last_name": {"Smith”},
"specialty": {"Cardiologist”},
"institution": {"UCSF Medical Center”},
"clinical_trial": {"Heart Valve Clinical Trial”},
"start_date": {"01/01/2011”},
"end_date": {"03/25/2013”}
}
}
17
NPI FirstName LastName Specialty
1 Tom Smith Cardiologist
NPI Institutio
n
ClinicalTrial Name Start Date End Date
1 UCSF
Medical
Center
Heart Valve Clinical
Trial
01/01/2011 03/25/2013
TRICKS TO TAME THE WILD DATA
• Ontology – how we keep track of all ingested information
• Vocabulary – bringing structure to large variety of information
• Derived attributes – encapsulate complexity
• GIS transformations – practical integration of geo data
• Indexing – fast access to complex information in MongoDB
18
DERIVED ATTRIBUTES
What’s the problem?• Data is rarely clean and business rules are
complex
What are we doing about it?• Use existing (organic) attributes and apply
rules to generate new (derived) attributes
• Derived attributes generated through
queries or map-reduce jobs
Why it matters• Too complex and expensive to consider all
business rules at run-time with every query
• Hides the complexity and introduces
uniformity
19
Entity
Attributes
GEOSPATIAL MAPPING APPROACH FOR
AWKWARD GEO DATA
20
Using traditional method
Reporting unit
Postal codes
Stuttgart District
Using geospatial method
Geocoded reporting unit
State
• Additional challenges with mismatches
between
reporting unit postal codes and mapping
postal codes
• Have to compensate for missing postal
codes
• Split patients or metrics across multiple
regions
when reporting unit spans multiple regions
Mapping + calculations
Baden-Württemberg
Mapping + calculation
State
Baden-Württemberg
Stuttgart
District
• Requires determining a single central point for each
reporting unit
• Uses no mapping documents
• No compensatory calculations required
• Overall accuracy increases
701737017370173
INDEXING
Why MongoDB alone does not get it done• Cross collection queries required for large number of scenarios
• Indexing challenges when dealing with unknown information
What we did• Graph based index
• Entities and attributes are nodes
• Entity – attribute ownership and entity to entity relationships are edges
How we use it• zQueries allow us to do complex
queries from web front ends
21
Disconnected Data Apps for Life Sciences
Algorithm Driven
Data Ingestion
Synchronization
Proprietary REST API
zQuery
Internal Vendor Public
Data Organized in
Connected Profile
Documents
Graph Based
Materialized
Query Index
Ontology Driven Data Tier
100,000,000+ data points ingested and indexed each year
THE ZEPHYR PLATFORM
100,000,000+ data points ingested and indexed each year
22
Zephyr Platform
Ontology Driven
Data Store
A
P
I
REST API
Exposes both data and the
ontology
zQueries
jSON based query language for
queries against dynamic and
connected data
Functional Focus
Solving specific business problem
with focused apps
Design
Single page apps with targeted
data visualizations
Analytical Apps
CONSUMING INTEGRATED DISPARATE DATA
Analytical applications use the zAPI and the ontology to produce
applications that adapt to changing data
23
TARGETED ANALYTICAL APPLICATIONS
Apps for real business problems leveraged by everyday business users
Illuminate
Voyager Kaleidoscope
24
Lighthouse
A BRIEF DEMO
25
LEARNINGS
• There was no one technology or one database that provided a
compete solution embrace diversity
• Create generic platform, pour effort into specialized
algorithms to populate data intelligently
• Ontology driven development can be very powerful but data
organization still a challenge
• Indexing on a priori unknown attributes is challenging
• Data modeling is always important, large profiles had to be
broken down
26
SUMMARY
Wrapping it all up in five points
1. Healthcare is different and has lots of critical data that is disconnected
2. Generic, MongoDB-based data storage model using meta-data
3. Data integration powered by algorithms
4. Document profiles for facts, graph for querying
5. Diverse set of end user analytical applications powered by the generic data
platform
Why this matters
• Standards are really important, but slow to develop
• Huge amount of change occurring in our healthcare system
• We need to make decisions today based on available data sets despite existing
challenges
27
THANK YOU!
Brian Roy – Strategy and architecture
Mahesh Chaudhari – Database architecture
Cesar Arevalo – Data integration implementation
The guys that made all of it come together!
28
Zephyr Health
450 Mission St. Suite 201
San Francisco, California 94105
+1.415.529.7649
zephyrhealth.com
CTO
+1.415.503.7412
Sven
Junkergård
29
CONTACT INFORMATION
BACKUP SCREEN SHOTS
30
ILLUMINATE – LANDING PAGE
ILLUMINATE – ALL CASES VIEW
ILLUMINATE – GRID VIEW
ILLUMINATE – GRAPH VIEW
ILLUMINATE – PROFILE VIEW