i-sense: an early-warning sensing systems for infectious diseases
TRANSCRIPT
i-Sense: an early-warning sensing systems for infectious diseases
Jens Kristian GeytiSoftware Engineer
University College [email protected]
The promise of Big Data
“Dense” (clinical and admin) data
- Patient Journals- Lab reports- HES data- Claims
“Sparse” (health-relevant) data
- News- Social Media messages- Sensors (fitbit etc)- Phone logs
Personalised course of action- Diagnosis- Treatment and prescription
Preventative course of action- Prevent infections- Predict resource usage- “Nowcasting” flu levels in geographical regions
Data Flow
Old Data (OLTP) “Big Data”
Acquire Databases (RDBMS) and files Specialised data storage
Organise ETL Distributed computing
Analyse Data Warehousing ML/Graph Algorithms
Decide BI and reporting Answer to question
Old Data (OLTP) “Big Data”
Acquire Databases (RDBMS) and files Specialised data storage
Organise ETL Distributed computing
Analyse Data Warehousing ML/Graph Algorithms
Decide BI and reporting Answer to question
Data Flow
Old Data (OLTP) “Big Data”
Acquire Databases (RDBMS) and files Specialised data storage
Organise ETL Distributed computing
Analyse Data Warehousing ML/Graph Algorithms
Decide BI and reporting Answer to question
Data Flow
Our Data Problem - No standards/Secondary use
MySQL dumps
“Realtime” data Batch reportsUnstructured data
Data Updates
JSONCSV TSV
HTML
ZIPGZIP LZO
ParquetRaw
ExcelPDF
HTTPWebSockets
EmailFTP
XML
Our Data Processing Problem
Download
Cleanse and munge
python/matlab magic
Validate
Publish
Experimental results / paper
Our Data Processing Problem
Download
Cleanse and munge
python/matlab magic
Validate
Publish
Concern
Solve the problem
Contract
Experimental results / paper
Experimental results / paper
Contracts and Separation of concerns
Acquire/Collect
Organise
Processing
Validate
Publish
Contract: Raw data
Concern: Collecting data with failover
Contract: High-throughput data access layer
Concern: Presenting processable data
Concern: Distributed data manipulation
Concern: Detecting errors
Concern: Publish to externally accessible service
Contract: Model output
Contract: Validated model output
Contract: Website / CSV / streaming data
Contracts and Separation of concerns
Acquire/Collect
Organise
Processing
Validate
Publish
Contract: Raw data
Concern: Collecting data with failover
Challenges and questions
- Parallelisation/distribution- Downtime- Data retention- Data consistency- Data replays/duplicates- Deadlines (data delays and latency)- Monitoring and alerting- License terms/data storage agreements
Hardware
UCL
- Cheap (free)- High-performant hardware- Familiar concepts- Inflexible hardware allocation- Support latency- No uptime guarantees- Network shares
Cloud hosting (IAAS)
- Unsafe data storage- Made to fail
Cloud hosting & IAAS
Digital Ocean/Linode/ (AWS EC2)
- Just a linux VM.- Snapshot support- API support
You provide
- Monitoring- Error recovery- Stable Storage/data queue
Our solution (just an example!!)
- Collect data on cloud VM instances (python/phantomjs/...)- at-least-once delivery- delete after 24/48/72 hours
- Send to kinesis (stream processing)- Spill to S3 (stable storage)
Data Collector
Data Collector
Data Collector
Stream 1
Stream 2
S3
Data Collector
Our solution (just an example!!)
- Collect data on cloud VM instances (python/phantomjs)- at-least-once delivery- delete after 24/48/72 hours
- Send to kinesis (stream processing)- Spill to S3 (stable storage)
Data Collector
Data Collector
Data Collector
Data Collector
Stream 1 (Kinesis)
Stream 2 (Kinesis)
S3
Lambda CloudWatch
Contracts and Separation of concerns
Acquire/Collect
Organise
Processing
Validate
Publish
Contract: Raw data
Concern: Collecting data with failover
Solution
- Streaming ✓ Kinesis- File access ✓ S3- Monitoring and alarm ✓ Cloudwatch
Pricing
- Stream: 1MB/sec shard per month: $11- Storage: $30/month per TB- Monitoring/alarms: $5/month- Download: $90 per TB
Contracts and Separation of concerns
Acquire/Collect
Organise
Processing
Validate
Publish
Contract: Raw data
Concern: Collecting data with failover
Contract: High-throughput data access layer
Concern: Presenting processable data
Organisation and processing
S3
Hourly/nightly jobs
UCLSqlLite, CSV files, etc.
Hadoop (etc)
S3
Amazon AWS walled garden The internet and beyond
Thank you
Slides available at goo.gl/f9ta7f after the talk.
Jens Kristian GeytiSoftware Engineer
University College [email protected]