sharing a startup’s big data lessons

137
Sharing a Startup’s Big Data Lessons Experiences with non-RDBMS solutions at

Post on 21-Oct-2014

351 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Sharing a Startup’s Big Data Lessons

Sharing a Startup’s Big Data Lessons

Experiences with non-RDBMS solutions at

Page 2: Sharing a Startup’s Big Data Lessons

Who we are

• A search engine• A people

search engine• An influencer

search engine• Subscription-

based

Page 3: Sharing a Startup’s Big Data Lessons

George Stathis

VP Engineering14+ years of experience building full-stack web software systems with a past focus on e-commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.

Page 4: Sharing a Startup’s Big Data Lessons

What’s this talk about?

• Share what we know about Big Data/NoSQL:

what’s behind the buzz words?

• Our reasons and method for picking a NoSQL

database

• Share the lessons we learned going through

the process

Page 5: Sharing a Startup’s Big Data Lessons

Big Data/NoSQL: behind the buzz words

Page 6: Sharing a Startup’s Big Data Lessons

What is Big Data?

• 3 Vs:– Volume– Velocity– Variety

Page 7: Sharing a Startup’s Big Data Lessons

What is Big Data? Volume + Velocity

• Data sets too large or coming in at too high a velocity to process using traditional databases or desktop tools. E.g.

big scienceweb logsrfidsensor networkssocial networkssocial datainternet text and documentsinternet search indexingcall detail records

Astronomyatmospheric sciencegenomicsbiogeochemicalmilitary surveillancemedical recordsphotography archivesvideo archiveslarge-scale e-commerce

Page 8: Sharing a Startup’s Big Data Lessons

Traditional static reports

What is Big Data? Variety

• Big Data is varied and unstructured

Analytics, exploration & experimentation

Page 9: Sharing a Startup’s Big Data Lessons

$$$$$$$$

What is Big Data?

• Scaling data processing cost effectively

$$$$$

$$$

Page 10: Sharing a Startup’s Big Data Lessons

What is NoSQL?

• NoSQL ≠ No SQL• NoSQL ≈ Not Only SQL• NoSQL addresses RDBMS limitations, it’s not

about the SQL language• RDBMS = static schema• NoSQL = schema flexibility; don’t have to

know exact structure before storing

Page 11: Sharing a Startup’s Big Data Lessons

What is Distributed Computing?

• Sharing the workload: divide a problem into many tasks, each of which can be solved by one or more computers

• Allows computations to be accomplished in acceptable timeframes

• Distributed computation approaches were developed to leverage multiple machines: MapReduce

• With MapReduce, the program goes to the data since the data is too big to move

Page 12: Sharing a Startup’s Big Data Lessons

What is MapReduce?

Source: developer.yahoo.com

Page 13: Sharing a Startup’s Big Data Lessons

What is MapReduce?

• MapReduce = batch processing = analytical• MapReduce ≠ interactive• Therefore many NoSQL solutions don’t

outright replace warehouse solutions, they complement them

• RDBMS is still safe

Page 14: Sharing a Startup’s Big Data Lessons

What is Big Data? Velocity

• In some instances, being able to process large amounts of data in real-time can yield a competitive advantage. E.g.– Online retailers leveraging buying history and click-

though data for real-time recommendations• No time to wait for MapReduce jobs to finish• Solutions: streaming processing (e.g. Twitter

Storm), pre-computing (e.g. aggregate and count analytics as data arrives), quick to read key/value stores (e.g. distributed hashes)

Page 15: Sharing a Startup’s Big Data Lessons

What is Big Data? Data Science

• Emergence of Data Science • Data Scientist ≈ Statistician• Possess scientific discipline & expertise• Formulate and test hypotheses• Understand the math behind the algorithms so

they can tweak when they don’t work• Can distill the results into an easy to understand

story• Help businesses gain actionable insights

Page 16: Sharing a Startup’s Big Data Lessons

Big Data Landscape

Source: capgemini.com

Page 17: Sharing a Startup’s Big Data Lessons

Big Data Landscape

Source: capgemini.com

Traackr consumes

these

Traackr uses these for storage

Traackr adds value in these verticals through our

custom tools

Page 18: Sharing a Startup’s Big Data Lessons

Big Data Landscape

Source: capgemini.com

Traackr re-sells

Page 19: Sharing a Startup’s Big Data Lessons

So what’s Traackr and why did we need a NoSQL DB?

Page 20: Sharing a Startup’s Big Data Lessons

Traackr: context

• A cloud computing company as about to launch a new platform; how does it find the most influential IT bloggers on the web that can help bring visibility to the new product? How does it find the opinion leaders, the people that matter?

Page 21: Sharing a Startup’s Big Data Lessons

Traackr: a people search engine

Up to 50 keywords per search!

Page 22: Sharing a Startup’s Big Data Lessons

Traackr: a people search engine

Peopleassearchresults

Contentaggregatedby author

Proprietary 3-scale ranking

Page 23: Sharing a Startup’s Big Data Lessons

Traackr: 30,000 feet

Acquisition Processing Storage & Indexing Services Applications

Page 24: Sharing a Startup’s Big Data Lessons

NoSQL is usually associated with“Web Scale” (Volume & Velocity)

Page 25: Sharing a Startup’s Big Data Lessons

• In terms of users/traffic?

Do we fit the “Web scale” profile?

Page 26: Sharing a Startup’s Big Data Lessons

Source: compete.com

Page 27: Sharing a Startup’s Big Data Lessons

Source: compete.com

Page 28: Sharing a Startup’s Big Data Lessons

Source: compete.com

Page 29: Sharing a Startup’s Big Data Lessons

Source: compete.com

Page 30: Sharing a Startup’s Big Data Lessons
Page 31: Sharing a Startup’s Big Data Lessons

• In terms of users/traffic?

• In terms of the amount of data?

Do we fit the “Web scale” profile?

Page 32: Sharing a Startup’s Big Data Lessons

PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){

"db" : "traackr","collections" : 12,"objects" : 68226121,"avgObjSize" : 2972.0800625760330,"dataSize" : 202773493971,"storageSize" : 221491429671,"numExtents" : 199,"indexes" : 33,"indexSize" : 27472394891,"fileSize" : 266623699968,"nsSizeMB" : 16,"ok" : 1

}

That’s a quarter of a terabyte …

Page 33: Sharing a Startup’s Big Data Lessons

Wait! What? My Synology NAS at home can hold 2TB!

Page 34: Sharing a Startup’s Big Data Lessons

No need for us to track the entire web

Web Content

Influencer Content

Not at scale :-)

Page 35: Sharing a Startup’s Big Data Lessons

• In terms of users/traffic?

• In terms of the amount of data?

Do we fit the “Web scale” profile?

Page 36: Sharing a Startup’s Big Data Lessons

Variety view of “Web Scale”

Web data is:

Heterogeneous

Unstructured (text)

Page 37: Sharing a Startup’s Big Data Lessons

Source: http://www.opte.org/

Visualization of the Internet, Nov. 23rd 2003

Page 38: Sharing a Startup’s Big Data Lessons

Data sources are

isolated islands of rich

data with lose links to

one another

Page 39: Sharing a Startup’s Big Data Lessons

How do we build a database that models all possible entities found on the web?

Page 40: Sharing a Startup’s Big Data Lessons

Modeling the web: the RDBMS way

Page 41: Sharing a Startup’s Big Data Lessons

Source: socialbutterflyclt.com

Page 42: Sharing a Startup’s Big Data Lessons

or

Page 43: Sharing a Startup’s Big Data Lessons
Page 44: Sharing a Startup’s Big Data Lessons

{ "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ]}

Influencer data as JSON

Page 45: Sharing a Startup’s Big Data Lessons

NoSQL = schema flexibility

Page 46: Sharing a Startup’s Big Data Lessons

• In terms of users/traffic?

• In terms of the amount of data?

Do we fit the “Web scale” profile?

Page 47: Sharing a Startup’s Big Data Lessons

• In terms of users/traffic?

• In terms of the amount of data?

• In terms of the variety of the data

Do we fit the “Web scale” profile?

Page 48: Sharing a Startup’s Big Data Lessons

Traackr’s Datastore Requirements

• Schema flexibility

• Good at storing lots of variable length text

• Batch processing options

Page 49: Sharing a Startup’s Big Data Lessons

Requirement: text storage

Variable text length:

< big variance <140

character tweets

multi-page

blog posts

Page 50: Sharing a Startup’s Big Data Lessons

Requirement: text storage

RDBMS’ answer to variable text length:

Plan ahead for largest value

CLOB/BLOB

Page 51: Sharing a Startup’s Big Data Lessons

Requirement: text storage

Issues with CLOB/BLOG for us:

No clue what largest value is

CLOB/BLOB for tweets = wasted space

Page 52: Sharing a Startup’s Big Data Lessons

Requirement: text storage

NoSQL solutions are great for text:

No length requirements (automated

chunking)

Limited space overhead

Page 53: Sharing a Startup’s Big Data Lessons

Traackr’s Datastore Requirements

• Schema flexibility

• Good at storing lots of variable length text

• Batch processing options

Page 54: Sharing a Startup’s Big Data Lessons

Requirement: batch processing

Some NoSQL

solutions come

with MapReduce

Source: http://code.google.com/

Page 55: Sharing a Startup’s Big Data Lessons

Requirement: batch processing

MapReduce + RDBMS:

Possible but proprietary solutions

Usually involves exporting data from

RDBMS into a NoSQL system anyway.

Defeats data locality benefit of MR

Page 56: Sharing a Startup’s Big Data Lessons

Traackr’s Datastore Requirements

• Schema flexibility

• Good at storing lots of variable length text

• Batch processing options

A NoSQL option is the right fit

Page 57: Sharing a Startup’s Big Data Lessons

How did we pick a NoSQL DB?

Page 58: Sharing a Startup’s Big Data Lessons

Bewildering number of options (early 2010)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Page 59: Sharing a Startup’s Big Data Lessons

Bewildering number of options (early 2010)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Page 60: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure.

We’d rather use these tools for specialized data analysis but not as the

main data store.

Page 61: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Memcache: memory-based,we need true persistence

Page 62: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Amazon SimpleDB: not willing to store our data in a proprietary

datastore.

Page 63: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Not willing to store our data in a proprietary datastore.

Redis and LinkedIn’s Project Voldermort: no query filters,

better used as queues or distributed caches

Page 64: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try

early prototypes.

Page 65: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options

(came later on).

Page 66: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

MongoDB: in early 2010, maturity questions, adoption questions

and no batch processing options.

Page 67: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Riak: very close but in early 2010, we had adoption questions.

Page 68: Sharing a Startup’s Big Data Lessons

Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

HBase: came across as the most mature at the time, with several deployments, a

healthy community, "out-of-the box" secondary indexes through a contrib and

support for batch processing using Hadoop/MR .

Page 69: Sharing a Startup’s Big Data Lessons

Lessons Learned

Challenges

- Complexity

- Missing Features

- Problem solution fit

- Resources

Rewards

- Choices

- Empowering

- Community

- Cost

Page 70: Sharing a Startup’s Big Data Lessons

Rewards: ChoicesKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Page 71: Sharing a Startup’s Big Data Lessons

Rewards: Choices

Source: capgemini.com

Page 72: Sharing a Startup’s Big Data Lessons

Lessons Learned

Challenges

- Complexity

- Missing Features

- Problem solution fit

- Resources

Rewards

- Choices

- Empowering

- Community

- Cost

Page 73: Sharing a Startup’s Big Data Lessons

When Big-Data = Big Architectures

Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Must have a Hadoop HDFS cluster of at least 2x replication

factor nodes

Must have an odd number of

Zookeeper quorum nodes

Then you can run your Hbase nodes but it’s recommended to

co-locate regionservers with hadoop datanodes so you have

to manage resources.

Master/slave architecture means a single point of failure,

so you need to protect your master.

And then we also have to manage the MapReduce

processes and resources in the Hadoop layer.

Page 74: Sharing a Startup’s Big Data Lessons

Source: socialbutterflyclt.com

Page 75: Sharing a Startup’s Big Data Lessons

Jokes aside, no one said open source was easy to use

Page 76: Sharing a Startup’s Big Data Lessons

To be expected

• Hadoop/Hbase are

designed to move

mountains

• If you want to move big

stuff, be prepared to

sometimes use big

equipment

Page 77: Sharing a Startup’s Big Data Lessons

What it means to a startup

Development capacity before

Development capacity after

Congrats, you are now a sysadmin…

Page 78: Sharing a Startup’s Big Data Lessons

Lessons Learned

Challenges

- Complexity

- Missing Features

- Problem solution fit

- Resources

Rewards

- Choices

- Empowering

- Community

- Cost

Page 79: Sharing a Startup’s Big Data Lessons

Mapping an saved search to a column store

Name

Ranks References to influencer records

Page 80: Sharing a Startup’s Big Data Lessons

Unique key

“attributes” column family

for general attributes

“influencerId” column familyfor influencer ranks and foreign keys

Mapping an saved search to a column store

Page 81: Sharing a Startup’s Big Data Lessons

Mapping an saved search to a column store

“name” attribute

Influencer ranks can be attribute names as well

Page 82: Sharing a Startup’s Big Data Lessons

Mapping an saved search to a column store

Can get pretty long so needs indexing and pagination

Page 83: Sharing a Startup’s Big Data Lessons

Problem: no out-of-the-box row-based indexing and pagination

Page 84: Sharing a Startup’s Big Data Lessons

Jumping right into the code

Page 85: Sharing a Startup’s Big Data Lessons

Lessons Learned

Challenges

- Complexity

- Missing Features

- Problem solution fit

- Resources

Rewards

- Choices

- Empowering

- Community

- Cost

Page 86: Sharing a Startup’s Big Data Lessons

a few months later…

Page 87: Sharing a Startup’s Big Data Lessons

Need to upgrade to Hbase 0.90

• Making sure to remain on recent code base

• Performance improvements

• Mostly to get the latest bug fixes

No thanks!

Page 88: Sharing a Startup’s Big Data Lessons

Looks like something is missing

Page 89: Sharing a Startup’s Big Data Lessons
Page 90: Sharing a Startup’s Big Data Lessons

Our DB indexes depend on this!

Page 91: Sharing a Startup’s Big Data Lessons

Let’s get this straight

• Hbase no longer comes with secondary

indexing out-of-the-box

• It’s been moved out of the trunk to GitHub

• Where only one other company besides us

seems to care about it

Page 92: Sharing a Startup’s Big Data Lessons

Only one other maintainer besides us

Page 93: Sharing a Startup’s Big Data Lessons

What it means to a startup

Development capacity

Congrats, you are now an hbase

contrib maintainer…

Page 94: Sharing a Startup’s Big Data Lessons

Source: socialbutterflyclt.com

Page 95: Sharing a Startup’s Big Data Lessons

Lessons Learned

Challenges

- Complexity

- Missing Features

- Problem solution fit

- Resources

Rewards

- Choices

- Empowering

- Community

- Cost

Page 96: Sharing a Startup’s Big Data Lessons

Homegrown Hbase Indexes

Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters

Row ids for Posts

Page 97: Sharing a Startup’s Big Data Lessons

Homegrown Hbase Indexes

Find posts for influencer_id_1234

Row ids for Posts

Page 98: Sharing a Startup’s Big Data Lessons

Homegrown Hbase Indexes

Find posts for influencer_id_5678

Row ids for Posts

Page 99: Sharing a Startup’s Big Data Lessons

Homegrown Hbase Indexes

• No longer depending on

unmaintained code

• Work with out-of-the-box Hbase

installation

Page 100: Sharing a Startup’s Big Data Lessons

What it means to a startup

Development capacity

You are back but you still need to

maintain indexing logic

Page 101: Sharing a Startup’s Big Data Lessons

a few months later…

Page 102: Sharing a Startup’s Big Data Lessons

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Page 103: Sharing a Startup’s Big Data Lessons

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Denormalized/duplicated for fast runtime access

and storage of influencer-to-site relationship

properties

Page 104: Sharing a Startup’s Big Data Lessons

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Content attribution logic could sometimes mis-attribute posts because of the

duplicated data.

Page 105: Sharing a Startup’s Big Data Lessons

Cracks in the data modelhuffingtonpost.com

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.html

http://www.huffingtonpost.com/arianna-huffington/post_2.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Exacerbated when we started tracking people’s content on a daily basis in mid-

2011

Page 106: Sharing a Startup’s Big Data Lessons

Fixing the cracks in the data model

huffingtonpost.com

http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html

http://www.huffingtonpost.com/arianna-huffington/post_3.html

http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html

http://www.huffingtonpost.com/shaun-donovan/post3.html

writes for

authored by

published under

writes for

authored by

published under

Normalize the sites

Page 107: Sharing a Startup’s Big Data Lessons

Fixing the cracks in the data model

• Normalization requires stronger

secondary indexing

• Our application layer indexing would

need revisiting…again!

Page 108: Sharing a Startup’s Big Data Lessons

What it means to a startup

Development capacity

Psych! You are back to writing indexing

code.

Page 109: Sharing a Startup’s Big Data Lessons

Source: socialbutterflyclt.com

Page 110: Sharing a Startup’s Big Data Lessons

Lessons Learned

Challenges

- Complexity

- Missing Features

- Problem solution fit

- Resources

Rewards

- Choices

- Empowering

- Community

- Cost

Page 111: Sharing a Startup’s Big Data Lessons

Traackr’s Datastore Requirements (Revisited)

• Schema flexibility

• Good at storing lots of variable length text

• Out-of-the-box SECONDARY INDEX support!

• Simple to use and administer

Page 112: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Page 113: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Nope!

Page 114: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Graph Databases: we looked at Neo4J a bit closer but passed again

for the same reasons as before.

Page 115: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Memcache: still no

Page 116: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Amazon SimpleDB: still no.

Page 117: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Not willing to store our data in a proprietary datastore.

Redis and LinkedIn’s Project Voldermort: still no

Page 118: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

CouchDB: more mature but still no ad-hoc queries.

Page 119: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Cassandra: matured quite a bit, added secondary indexes and batch processing

options but more restrictive in its’ use than other solutions. After the Hbase lesson,

simplicity of use was now more important.

Page 120: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

Riak: strong contender still but adoption questions remained.

Page 121: Sharing a Startup’s Big Data Lessons

NoSQL picking – Round 2 (mid 2011)

Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent

Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped

into families

Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema

Graph Databases• Graph Theory G=(E,V)• Great for modeling

networks• Great for graph-based

query algorithms

MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing

options, breeze to use, well documented and fit into our existing code base very nicely.

Page 122: Sharing a Startup’s Big Data Lessons

Lessons Learned

Challenges

- Complexity

- Missing Features

- Problem solution fit

- Resources

Rewards

- Choices

- Empowering

- Community

- Cost

Page 123: Sharing a Startup’s Big Data Lessons

Immediate Benefits

• No more maintaining custom application-layer

secondary indexing code

Page 124: Sharing a Startup’s Big Data Lessons

What it means to a startup

Development capacity

Yay! I’m back!

Page 125: Sharing a Startup’s Big Data Lessons

Immediate Benefits

• No more maintaining custom application-layer

secondary indexing code

• Single binary installation greatly simplifies

administration

Page 126: Sharing a Startup’s Big Data Lessons

What it means to a startup

Development capacity

Honestly, I thought I’d never see you

guys again!

Page 127: Sharing a Startup’s Big Data Lessons

Immediate Benefits

• No more maintaining custom application-layer

secondary indexing code

• Single binary installation greatly simplifies

administration

• Our NoSQL could now support our domain

model

Page 128: Sharing a Startup’s Big Data Lessons

many-to-many relationship

Page 129: Sharing a Startup’s Big Data Lessons

Modeling an influencer

Embedded list of references to sites augmented with

influencer-specific site attributes (e.g.

percent contribution to content)

{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}

Page 130: Sharing a Startup’s Big Data Lessons

Modeling an influencer

siteId indexed for “find influencers

connected to site X”

> db.influencers.ensureIndex({siteReferences.siteId: 1});> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});

{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}

Page 131: Sharing a Startup’s Big Data Lessons

Other Benefits

• Ad hoc queries and reports became easier to write with

JavaScript: no need for a Java developer to write map reduce code

to extract the data in a usable form like it was needed with Hbase.

• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-

cluster replication is available but experimental and a lot more

involved to setup.

• Great documentation

• Great adoption and community

Page 132: Sharing a Startup’s Big Data Lessons

looks like we found the right fit!

Page 133: Sharing a Startup’s Big Data Lessons

We have more of this

Development capacity

Page 134: Sharing a Startup’s Big Data Lessons

And less of this

Source: socialbutterflyclt.com

Page 135: Sharing a Startup’s Big Data Lessons

Recap & Final Thoughts

• 3 Vs of Big Data:– Volume– Velocity– Variety Traackr

• Big Data technologies are complementary to SQL and RDBMS

• Until machines can think for themselves Data Science will be increasingly important

Page 136: Sharing a Startup’s Big Data Lessons

Recap & Final Thoughts

• Be prepared to deal with less mature tech

• Be as flexible as the data => fearless refactoring

• Importance of ease of use and administration

cannot be overstated for a small startup

Page 137: Sharing a Startup’s Big Data Lessons

Q&A