ingesting healthcare data, micah whitacre

INGESTING COMPLEX HEALTHCARE DATA WITH APACHE KAFKA

Micah Whitacre@mkwhit

#kafkasummit

Leader Healthcare IT

~30% of all US Healthcare Data in a Cerner Solution

Sepsis Alerting(minutes)

Doctor’s Office

Minute Clinic

ERHospital

Specialist

Ambulatory(<2 seconds)

Table Table.NOTIFY

Google PercolatorNoSQL

Table Table.NOTIFY

Collector

Was successful… for awhile

Progressed from minutes to seconds

Hit a wall preventing going faster (missed SLAs)

Collector

Crawler

Solution A

Solution B

Solution C

Collector

Crawler

Use the right tool for the job!

NoSQL != Distributed Queue

Anti-patterns apply to everyone eventually

Our scalability should not impact crawlers

Cluster sprawl should be avoided

Reduce the number of copies

Table Table.NOTIFY

Table Kafka Topic

Kafka Base Notifications

● Kafka topic per listener● Small Google Protobuf payloads

○ Gzip based compression for higher compression● Could minimize to fewer listeners

○ Single topic and partition vs 100s of NoSQL rows● Able to give up fairness concerns in favor of speed

Collector

Crawler

Kafka Staging Area● Single location for one copy of the data● Consumption based on type and source of data

○ 500ish of types and 100-1000 sources○ Choose source based topics to cut down on topics○ Default to 8 partitions

● Snappy compression for low latency● Huge variation in data sizes and frequency

○ Infrequent MB - GB file uploads (daily, weekly, monthly, yearly)○ Streaming uploads of 100B-10MB

● Time based retention to prevent data loss○ Ambitiously set to 30 days but lowered to 7 days○ Archive data to HDFS for reprocessing or lagging/offline consumers

Kafka Payloads And Delivery

● Avro Schema to wrap ingested data○ Source, Type, Id, Version, Value (byte[]), Metadata

(byte[]), Properties○ Common payload regardless of actual byte[]

● Set threshold for payloads stored in Kafka○ Store 95-98% of data in Kafka○ Data larger than 50 MB stored in HDFS with path

stored in Avro wrapper

● Rate of ingestion changes with Kafka○ Lack of backpressure can increase rate of ingestion○ Capacity and retention planning could end up

inaccurate

Most Surprising Lesson Learned

Initial Crawl - NoSQL

Crawl all historical data

Crawling only recent changes

Rate of Data Ingested Per Day

By Source

Number of Sources

Number of Days to keep Kafka

Total Storage Needed in Kafka

Initial Crawl - Kafka

Crawls from weeks to daysCrawl all historical data

Crawling only recent changes

Rate of Data Ingested Per Day

By Source

Number of Sources

Number of Days to keep Kafka

Total Storage Needed in Kafka

10-30x

Kafka Storage Woes

● Monitor ALL THE THINGS○ Broker free space○ Disk usage per topic○ Consumer lag in message count and max latency○ Rate of data per source to detect anomalies vs steady

state● Re-evaluate default retention with more evidence

Kafka Storage Woes Solution

● When storage gets tight know your options○ Automate building new servers○ Adjust retention policy for a topic(s)

● Balancing partitions is hard to do by hand○ Balance in small batches○ Automate, Automate, Automate

Collector

Crawler

Collector

Crawler

Collector

Crawler

NoSQLDataCenter A

DataCenter B

Collector

Crawler

Current Stats● Deployed in 3 (soon to be 4) data centers● 440 sources currently (⅓ of all clients)● Ingesting 2 billion messages per day

○ Spiked as high as 6 billion

● Ingest 1.2 TB/day of raw data● Archive job runs hourly and takes ~10 mins to pull ~50 GB

data● Latency

○ NoSQL: 2-3 seconds (subset of data)○ Replication (Kafka to Kafka): 700 milliseconds (all the data)

http://engineering.cerner.com/

References● Percolator - http://research.google.com/pubs/pub36726.html● Cassandra Queue Anti-pattern: http://www.datastax.com/dev/blog/cassandra-

anti-patterns-queues-and-queue-like-datasets● https://blog.cloudera.com/blog/2014/11/how-cerner-uses-cdh-with-apache-

kafka/

ingesting healthcare data, micah whitacre

Engineering

boundary-layer-ingesting inlet flow control

this marriage - whitacre

eric whitacre - this marriage

ingesting click events for analytics

10491 chor whit a - ndr.de · eric whitacre eric whitacre...

aug/sep/oct 2010 - eric whitacre singer magazine... · i...

10491 chor whit a - ndr.de · pdf fileeric whitacre eric...

ingesting click data for analytics

igniteboulder micah

201543948 alleluia eric whitacre

ingesting drone data into big data platforms

eric whitacre sleep

by eric whitacre

ingesting hdfs intosolrusingsparktrimmed

the whitacre link: improving central england’s...

eric whitacre this marriage

apache nifi: ingesting enterprise data at scale

october-eric whitacre score

boot camp by: micah worek. boot camp by: micah worek

eric whitacre