mongodb san francisco 2013:geo searches for healthcare pricing data presented by robert stewart,...

24
CONFIDENTIAL CONFIDENTIAL CONFIDENTIAL CONFIDENTIAL Geo Searches for Health Care Pricing Data Robert Stewart Senior Architect, Castlight Health [email protected] @wombatnation 1

Upload: mongodb

Post on 14-Dec-2014

1.447 views

Category:

Technology


0 download

DESCRIPTION

This talk covers the MongoDB deployment architecture used at Castlight Health to support very low latency spatial searches against our database of hundreds of millions of healthcare prices. The Geo haystack index in MongoDB and SSDs turned out to be the perfect solution for our problem. A strategy of replica set flipping also enables Castlight to swap in very large changes to the pricing data with no impact to the running application.

TRANSCRIPT

Page 1: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Geo Searches for Health Care Pricing Data

Robert Stewart

Senior Architect, Castlight Health

[email protected]

@wombatnation

1

Page 2: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Castlight Health

The Business and Technical Problems

Initial Solution

MongoDB, Geo Haystack Index and SSDs

Replica Set Flipping

2

Page 3: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

3

Hosted web and mobile applications providing unbiased information on health care cost and quality

Customers are employers and health plans

Founded in 2008, raised $181 million in VC funding

#1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011

Hiring!

Castlight Health

Page 4: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

4

Home Page

Page 5: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

5

Search Results

Page 6: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

6

Business Problem

Support searches for

Prices for a procedure performed by any in-network provider in a geographical area

Prices for all procedures performed by a single provider

Sub-second response, even if returning data on thousands of prices

Page 7: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

7

Need a very fast geo index

Rate count doubled in last 3 months to 600 million

Major rate updates monthly

Difficult to index data to ensure sequential reads

Sometimes lots of random reads

Technical Problems

Page 8: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

8

Pricing Retrieval Architecture

Page 9: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

9

Initial Solution

Store pricing data in MySQL

When Pricing Service starts, create two in-memory indexes and cache most of the rates

55 GB JVM Heap with lots of GC tuning

20-minute service startup time to build indexes

3 hours for background caching of most rates

Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow

Page 10: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Enter the Mongo

10

Page 11: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

11

Geo Indexes

Tried standard geo 2D indexes in MongoDB

Too slow for my use case

Geo Haystack index

Conceptually similar

From docs.mongodb.org “A haystack index is a special index that is optimized to return

results over small areas. Haystack indexes improve performance on queries that use flat geometry.”

Page 12: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

12

Mercator Projection with 10 degree grid

Page 13: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

13

Geo Haystack

We chose degrees long-lat for x-y coordinate system

25 miles is our default search radius Roughly 0.5 degrees in middle of the US

db.priceables_1.ensureIndex(

{ loc: "geoHaystack", pm: 1 },

{ bucketSize: 0.5 })

db.runCommand(

{ geoSearch: "priceables_1",

near: [-122.4, 37.79],

maxDistance: 0.5,

search: { pm: 6757 },

limit: 50000 })

maxDistance calculated using great circle algorithm

Page 14: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

14

Geo Haystack Pros

Very fast when retrieving many documents in a relatively small search radius

Great when you also need to apply a secondary filter Compound 2dsphere index in Mongo 2.4 has even better support

Page 15: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

15

Geo Haystack Cons

Supports only one extra filter in index SERVER-2979

A bug if unindexed query on only the second part of the key SERVER-8645

> db.priceables_1.find({pm: 6757})

error: { "$err" : "assertion src/mongo/db/geo/haystack.cpp:178" }

Second part of index can’t have an array value

Location part of key can’t be null

Page 16: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

16

SSDs

For uncached data on HDD, Geo Haystack was twice as fast as custom Java geo index and MySQL

Still close to 1 minute for big queries with full data set

Death by random read

Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis

Page 17: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

17

Random 4k block reads, 5 GB file, 16 threads

Mongoperf on SSDs

Env SSD Read Ops/s Read MB/s

Prod Samsung 200GB SLC 74k 288

QA VM Samsung 200GB SLC 30k 117

Dev Samsung 830 256GB SATA MLC 47k 183

Env SSD Write Ops/s Write MB/s

Prod Samsung 200GB SLC 1074 289

QA VM Samsung 200GB SLC 405 196

Dev Samsung 830 256GB SATA MLC 438 210

Sequential write of the 5 GB file

Page 18: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

18

Requirements Major price updates monthly Minor updates more frequently

Huge bulk loads with no impact on active replica set

I/O bound, not CPU bound

Low Impact Pricing Updates

Page 19: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

19

Two replica sets

Lowered cost with two SSDs on each pricing server

scp compressed files from QA to passive replica set Protip: to compress and uncompress

tar cvf - pricing | pigz > ~/pricing.tgz

pigz -dc pricing.tgz | tar xvf -

Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true })

Pricing Service operation to atomically flip

Replica Set Flipping Solution

Page 20: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

20

Replica Set Architecture

Physical Servers

ReplicaSets

prodpricing1

prodpricing2

Server pricing1

mongod 28001primary

mongod 28002secondary

Server pricing2

mongod 28001secondary

mongod 28002primary

Server db1

mongod 28001arbiter

Server db2

mongod 28002arbiter

Page 21: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

21

Obviously, increased cost, but only for SSDs

Recently added caching of remote pricing lookups TTL collections

Cache is lost during a flip

But, usually flip late at night

Cache eviction time is only a few hours

Replica Set Flipping Drawbacks

Page 22: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

22

Geo search speed with cold cache acceptable

Geo search speed with warm cache awesome

Pricing Service startup down to a few seconds

No production impact for major rate updates

Lowered risk for minor rate updates

Overall Results

Page 23: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

23

Summary

Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Geo searches with a secondary filter

SSDs great for … Random reads Reducing need for lots of complex indexes

Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility

Page 24: MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented by Robert Stewart, Castlight Health

CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL

Q & A

24