i-sense: an early-warning sensing systems for infectious diseases

21
i-Sense: an early-warning sensing systems for infectious diseases Jens Kristian Geyti Software Engineer University College London [email protected] www.i-sense.org.uk

Upload: london-school-of-hygiene-and-tropical-medicine

Post on 15-Apr-2017

243 views

Category:

Technology


1 download

TRANSCRIPT

i-Sense: an early-warning sensing systems for infectious diseases

Jens Kristian GeytiSoftware Engineer

University College [email protected]

BIG DATA!

The promise of Big Data

“Dense” (clinical and admin) data

- Patient Journals- Lab reports- HES data- Claims

“Sparse” (health-relevant) data

- News- Social Media messages- Sensors (fitbit etc)- Phone logs

Personalised course of action- Diagnosis- Treatment and prescription

Preventative course of action- Prevent infections- Predict resource usage- “Nowcasting” flu levels in geographical regions

Data Flow

Old Data (OLTP) “Big Data”

Acquire Databases (RDBMS) and files Specialised data storage

Organise ETL Distributed computing

Analyse Data Warehousing ML/Graph Algorithms

Decide BI and reporting Answer to question

Old Data (OLTP) “Big Data”

Acquire Databases (RDBMS) and files Specialised data storage

Organise ETL Distributed computing

Analyse Data Warehousing ML/Graph Algorithms

Decide BI and reporting Answer to question

Data Flow

Old Data (OLTP) “Big Data”

Acquire Databases (RDBMS) and files Specialised data storage

Organise ETL Distributed computing

Analyse Data Warehousing ML/Graph Algorithms

Decide BI and reporting Answer to question

Data Flow

Our Data Problem - No standards/Secondary use

MySQL dumps

“Realtime” data Batch reportsUnstructured data

Data Updates

JSONCSV TSV

HTML

ZIPGZIP LZO

ParquetRaw

ExcelPDF

HTTPWebSockets

EmailFTP

XML

Our Data Processing Problem

Download

Cleanse and munge

python/matlab magic

Validate

Publish

Experimental results / paper

Our Data Processing Problem

Download

Cleanse and munge

python/matlab magic

Validate

Publish

Concern

Solve the problem

Contract

Experimental results / paper

Experimental results / paper

Contracts and Separation of concerns

Acquire/Collect

Organise

Processing

Validate

Publish

Contract: Raw data

Concern: Collecting data with failover

Contract: High-throughput data access layer

Concern: Presenting processable data

Concern: Distributed data manipulation

Concern: Detecting errors

Concern: Publish to externally accessible service

Contract: Model output

Contract: Validated model output

Contract: Website / CSV / streaming data

Contracts and Separation of concerns

Acquire/Collect

Organise

Processing

Validate

Publish

Contract: Raw data

Concern: Collecting data with failover

Challenges and questions

- Parallelisation/distribution- Downtime- Data retention- Data consistency- Data replays/duplicates- Deadlines (data delays and latency)- Monitoring and alerting- License terms/data storage agreements

Hardware

UCL

- Cheap (free)- High-performant hardware- Familiar concepts- Inflexible hardware allocation- Support latency- No uptime guarantees- Network shares

Cloud hosting (IAAS)

- Unsafe data storage- Made to fail

Cloud hosting & IAAS

Digital Ocean/Linode/ (AWS EC2)

- Just a linux VM.- Snapshot support- API support

You provide

- Monitoring- Error recovery- Stable Storage/data queue

Our solution (just an example!!)

- Collect data on cloud VM instances (python/phantomjs/...)- at-least-once delivery- delete after 24/48/72 hours

- Send to kinesis (stream processing)- Spill to S3 (stable storage)

Data Collector

Data Collector

Data Collector

Stream 1

Stream 2

S3

Data Collector

Our solution (just an example!!)

- Collect data on cloud VM instances (python/phantomjs)- at-least-once delivery- delete after 24/48/72 hours

- Send to kinesis (stream processing)- Spill to S3 (stable storage)

Data Collector

Data Collector

Data Collector

Data Collector

Stream 1 (Kinesis)

Stream 2 (Kinesis)

S3

Lambda CloudWatch

Contracts and Separation of concerns

Acquire/Collect

Organise

Processing

Validate

Publish

Contract: Raw data

Concern: Collecting data with failover

Solution

- Streaming ✓ Kinesis- File access ✓ S3- Monitoring and alarm ✓ Cloudwatch

Pricing

- Stream: 1MB/sec shard per month: $11- Storage: $30/month per TB- Monitoring/alarms: $5/month- Download: $90 per TB

Contracts and Separation of concerns

Acquire/Collect

Organise

Processing

Validate

Publish

Contract: Raw data

Concern: Collecting data with failover

Contract: High-throughput data access layer

Concern: Presenting processable data

Organisation and processing

S3

Hourly/nightly jobs

UCLSqlLite, CSV files, etc.

Hadoop (etc)

S3

Amazon AWS walled garden The internet and beyond

Thank you

Slides available at goo.gl/f9ta7f after the talk.

Jens Kristian GeytiSoftware Engineer

University College [email protected]