Download - Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
Experiences with non-RDBMS solutions at
Who we are
• A search engine• A people
search engine• An influencer
search engine• Subscription-
based
George Stathis
VP Engineering14+ years of experience building full-stack web software systems with a past focus on e-commerce and publishing. Currently responsible for building engineering capability to enable Traackr's growth goals.
What’s this talk about?
• Share what we know about Big Data/NoSQL:
what’s behind the buzz words?
• Our reasons and method for picking a NoSQL
database
• Share the lessons we learned going through
the process
Big Data/NoSQL: behind the buzz words
What is Big Data?
• 3 Vs:– Volume– Velocity– Variety
What is Big Data? Volume + Velocity
• Data sets too large or coming in at too high a velocity to process using traditional databases or desktop tools. E.g.
big scienceweb logsrfidsensor networkssocial networkssocial datainternet text and documentsinternet search indexingcall detail records
Astronomyatmospheric sciencegenomicsbiogeochemicalmilitary surveillancemedical recordsphotography archivesvideo archiveslarge-scale e-commerce
Traditional static reports
What is Big Data? Variety
• Big Data is varied and unstructured
Analytics, exploration & experimentation
$$$$$$$$
What is Big Data?
• Scaling data processing cost effectively
$$$$$
$$$
What is NoSQL?
• NoSQL ≠ No SQL• NoSQL ≈ Not Only SQL• NoSQL addresses RDBMS limitations, it’s not
about the SQL language• RDBMS = static schema• NoSQL = schema flexibility; don’t have to
know exact structure before storing
What is Distributed Computing?
• Sharing the workload: divide a problem into many tasks, each of which can be solved by one or more computers
• Allows computations to be accomplished in acceptable timeframes
• Distributed computation approaches were developed to leverage multiple machines: MapReduce
• With MapReduce, the program goes to the data since the data is too big to move
What is MapReduce?
Source: developer.yahoo.com
What is MapReduce?
• MapReduce = batch processing = analytical• MapReduce ≠ interactive• Therefore many NoSQL solutions don’t
outright replace warehouse solutions, they complement them
• RDBMS is still safe
What is Big Data? Velocity
• In some instances, being able to process large amounts of data in real-time can yield a competitive advantage. E.g.– Online retailers leveraging buying history and click-
though data for real-time recommendations• No time to wait for MapReduce jobs to finish• Solutions: streaming processing (e.g. Twitter
Storm), pre-computing (e.g. aggregate and count analytics as data arrives), quick to read key/value stores (e.g. distributed hashes)
What is Big Data? Data Science
• Emergence of Data Science • Data Scientist ≈ Statistician• Possess scientific discipline & expertise• Formulate and test hypotheses• Understand the math behind the algorithms so
they can tweak when they don’t work• Can distill the results into an easy to understand
story• Help businesses gain actionable insights
Big Data Landscape
Source: capgemini.com
Big Data Landscape
Source: capgemini.com
Traackr consumes
these
Traackr uses these for storage
Traackr adds value in these verticals through our
custom tools
Big Data Landscape
Source: capgemini.com
Traackr re-sells
So what’s Traackr and why did we need a NoSQL DB?
Traackr: context
• A cloud computing company as about to launch a new platform; how does it find the most influential IT bloggers on the web that can help bring visibility to the new product? How does it find the opinion leaders, the people that matter?
Traackr: a people search engine
Up to 50 keywords per search!
Traackr: a people search engine
Peopleassearchresults
Contentaggregatedby author
Proprietary 3-scale ranking
Traackr: 30,000 feet
Acquisition Processing Storage & Indexing Services Applications
NoSQL is usually associated with“Web Scale” (Volume & Velocity)
• In terms of users/traffic?
Do we fit the “Web scale” profile?
Source: compete.com
Source: compete.com
Source: compete.com
Source: compete.com
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
PRIMARY> use traackrswitched to db traackrPRIMARY> db.stats(){
"db" : "traackr","collections" : 12,"objects" : 68226121,"avgObjSize" : 2972.0800625760330,"dataSize" : 202773493971,"storageSize" : 221491429671,"numExtents" : 199,"indexes" : 33,"indexSize" : 27472394891,"fileSize" : 266623699968,"nsSizeMB" : 16,"ok" : 1
}
That’s a quarter of a terabyte …
Wait! What? My Synology NAS at home can hold 2TB!
No need for us to track the entire web
Web Content
Influencer Content
Not at scale :-)
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
Variety view of “Web Scale”
Web data is:
Heterogeneous
Unstructured (text)
Source: http://www.opte.org/
Visualization of the Internet, Nov. 23rd 2003
Data sources are
isolated islands of rich
data with lose links to
one another
How do we build a database that models all possible entities found on the web?
Modeling the web: the RDBMS way
Source: socialbutterflyclt.com
or
{ "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteUrl": "http://twitter.com/dchancogne", "metrics": [ { "value": 216, "name": "twitter_followers_count" }, { "value": 2107, "name": "twitter_statuses_count" } ] }, { "siteUrl": "http://traackr.com/blog/author/david", "metrics": [ { "value": 21, "name": "google_inbound_links" } ] } ]}
Influencer data as JSON
NoSQL = schema flexibility
• In terms of users/traffic?
• In terms of the amount of data?
Do we fit the “Web scale” profile?
• In terms of users/traffic?
• In terms of the amount of data?
• In terms of the variety of the data
Do we fit the “Web scale” profile?
✓
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
Requirement: text storage
Variable text length:
< big variance <140
character tweets
multi-page
blog posts
Requirement: text storage
RDBMS’ answer to variable text length:
Plan ahead for largest value
CLOB/BLOB
Requirement: text storage
Issues with CLOB/BLOG for us:
No clue what largest value is
CLOB/BLOB for tweets = wasted space
Requirement: text storage
NoSQL solutions are great for text:
No length requirements (automated
chunking)
Limited space overhead
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
✓
Requirement: batch processing
Some NoSQL
solutions come
with MapReduce
Source: http://code.google.com/
Requirement: batch processing
MapReduce + RDBMS:
Possible but proprietary solutions
Usually involves exporting data from
RDBMS into a NoSQL system anyway.
Defeats data locality benefit of MR
Traackr’s Datastore Requirements
• Schema flexibility
• Good at storing lots of variable length text
• Batch processing options
✓
✓
A NoSQL option is the right fit
✓
How did we pick a NoSQL DB?
Bewildering number of options (early 2010)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Bewildering number of options (early 2010)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Graph Databases: while we can model our domain as a graph we don’t want to pigeonhole ourselves into this structure.
We’d rather use these tools for specialized data analysis but not as the
main data store.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Memcache: memory-based,we need true persistence
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Amazon SimpleDB: not willing to store our data in a proprietary
datastore.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: no query filters,
better used as queues or distributed caches
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
CouchDB: no ad-hoc queries; maturity in early 2010 made us shy away although we did try
early prototypes.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Cassandra: in early 2010, maturity questions, no secondary indexes and no batch processing options
(came later on).
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
MongoDB: in early 2010, maturity questions, adoption questions
and no batch processing options.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Riak: very close but in early 2010, we had adoption questions.
Trimming optionsKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
HBase: came across as the most mature at the time, with several deployments, a
healthy community, "out-of-the box" secondary indexes through a contrib and
support for batch processing using Hadoop/MR .
Lessons Learned
Challenges
- Complexity
- Missing Features
- Problem solution fit
- Resources
Rewards
- Choices
- Empowering
- Community
- Cost
Rewards: ChoicesKey/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Rewards: Choices
Source: capgemini.com
Lessons Learned
Challenges
- Complexity
- Missing Features
- Problem solution fit
- Resources
Rewards
- Choices
- Empowering
- Community
- Cost
When Big-Data = Big Architectures
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Must have a Hadoop HDFS cluster of at least 2x replication
factor nodes
Must have an odd number of
Zookeeper quorum nodes
Then you can run your Hbase nodes but it’s recommended to
co-locate regionservers with hadoop datanodes so you have
to manage resources.
Master/slave architecture means a single point of failure,
so you need to protect your master.
And then we also have to manage the MapReduce
processes and resources in the Hadoop layer.
Source: socialbutterflyclt.com
Jokes aside, no one said open source was easy to use
To be expected
• Hadoop/Hbase are
designed to move
mountains
• If you want to move big
stuff, be prepared to
sometimes use big
equipment
What it means to a startup
Development capacity before
Development capacity after
Congrats, you are now a sysadmin…
Lessons Learned
Challenges
- Complexity
- Missing Features
- Problem solution fit
- Resources
Rewards
- Choices
- Empowering
- Community
- Cost
Mapping an saved search to a column store
Name
Ranks References to influencer records
Unique key
“attributes” column family
for general attributes
“influencerId” column familyfor influencer ranks and foreign keys
Mapping an saved search to a column store
Mapping an saved search to a column store
“name” attribute
Influencer ranks can be attribute names as well
Mapping an saved search to a column store
Can get pretty long so needs indexing and pagination
Problem: no out-of-the-box row-based indexing and pagination
Jumping right into the code
Lessons Learned
Challenges
- Complexity
- Missing Features
- Problem solution fit
- Resources
Rewards
- Choices
- Empowering
- Community
- Cost
a few months later…
Need to upgrade to Hbase 0.90
• Making sure to remain on recent code base
• Performance improvements
• Mostly to get the latest bug fixes
No thanks!
Looks like something is missing
Our DB indexes depend on this!
Let’s get this straight
• Hbase no longer comes with secondary
indexing out-of-the-box
• It’s been moved out of the trunk to GitHub
• Where only one other company besides us
seems to care about it
Only one other maintainer besides us
What it means to a startup
Development capacity
Congrats, you are now an hbase
contrib maintainer…
Source: socialbutterflyclt.com
Lessons Learned
Challenges
- Complexity
- Missing Features
- Problem solution fit
- Resources
Rewards
- Choices
- Empowering
- Community
- Cost
Homegrown Hbase Indexes
Rows have id prefixes that can be efficiently scanned using STARTROW and STOPROW filters
Row ids for Posts
Homegrown Hbase Indexes
Find posts for influencer_id_1234
Row ids for Posts
Homegrown Hbase Indexes
Find posts for influencer_id_5678
Row ids for Posts
Homegrown Hbase Indexes
• No longer depending on
unmaintained code
• Work with out-of-the-box Hbase
installation
What it means to a startup
Development capacity
You are back but you still need to
maintain indexing logic
a few months later…
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Denormalized/duplicated for fast runtime access
and storage of influencer-to-site relationship
properties
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Content attribution logic could sometimes mis-attribute posts because of the
duplicated data.
Cracks in the data modelhuffingtonpost.com
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.html
http://www.huffingtonpost.com/arianna-huffington/post_2.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Exacerbated when we started tracking people’s content on a daily basis in mid-
2011
Fixing the cracks in the data model
huffingtonpost.com
http://www.huffingtonpost.com/arianna-huffington/post_1.htmlhttp://www.huffingtonpost.com/arianna-huffington/post_2.html
http://www.huffingtonpost.com/arianna-huffington/post_3.html
http://www.huffingtonpost.com/shaun-donovan/post1.htmlhttp://www.huffingtonpost.com/shaun-donovan/post2.html
http://www.huffingtonpost.com/shaun-donovan/post3.html
writes for
authored by
published under
writes for
authored by
published under
Normalize the sites
Fixing the cracks in the data model
• Normalization requires stronger
secondary indexing
• Our application layer indexing would
need revisiting…again!
What it means to a startup
Development capacity
Psych! You are back to writing indexing
code.
Source: socialbutterflyclt.com
Lessons Learned
Challenges
- Complexity
- Missing Features
- Problem solution fit
- Resources
Rewards
- Choices
- Empowering
- Community
- Cost
Traackr’s Datastore Requirements (Revisited)
• Schema flexibility
• Good at storing lots of variable length text
• Out-of-the-box SECONDARY INDEX support!
• Simple to use and administer
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Nope!
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Graph Databases: we looked at Neo4J a bit closer but passed again
for the same reasons as before.
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Memcache: still no
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Amazon SimpleDB: still no.
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Not willing to store our data in a proprietary datastore.
Redis and LinkedIn’s Project Voldermort: still no
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
CouchDB: more mature but still no ad-hoc queries.
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Cassandra: matured quite a bit, added secondary indexes and batch processing
options but more restrictive in its’ use than other solutions. After the Hbase lesson,
simplicity of use was now more important.
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
Riak: strong contender still but adoption questions remained.
NoSQL picking – Round 2 (mid 2011)
Key/Value Databases• Distributed hashtables• Designed for high load• In-memory or on-disk• Eventually consistent
Column Databases• Spread sheet like• Key is a row id• Attributes are columns• Columns can be grouped
into families
Document Databases• Like Key/Value• Value = Document• Document = JSON/BSON• JSON = Flexible Schema
Graph Databases• Graph Theory G=(E,V)• Great for modeling
networks• Great for graph-based
query algorithms
MongoDB: matured by leaps and bounds, increased adoption, support from 10gen, advanced indexing out-of-the-box as well as some batch processing
options, breeze to use, well documented and fit into our existing code base very nicely.
Lessons Learned
Challenges
- Complexity
- Missing Features
- Problem solution fit
- Resources
Rewards
- Choices
- Empowering
- Community
- Cost
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
What it means to a startup
Development capacity
Yay! I’m back!
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
What it means to a startup
Development capacity
Honestly, I thought I’d never see you
guys again!
Immediate Benefits
• No more maintaining custom application-layer
secondary indexing code
• Single binary installation greatly simplifies
administration
• Our NoSQL could now support our domain
model
many-to-many relationship
Modeling an influencer
Embedded list of references to sites augmented with
influencer-specific site attributes (e.g.
percent contribution to content)
{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}
Modeling an influencer
siteId indexed for “find influencers
connected to site X”
> db.influencers.ensureIndex({siteReferences.siteId: 1});> db.influencers.find({siteReferences.siteId: "602dc370945d3b3480fff4f2a541227c"});
{ ”_id": "770cf5c54492344ad5e45fb791ae5d52”, "realName": "David Chancogne", "title": "CTO", "description": "Web. Geek.\r\nTraackr: http://traackr.com\r\nPropz: http://propz.me", "primaryAffiliation": "Traackr", "email": "[email protected]", "location": "Cambridge, MA, United States", "siteReferences": [ { "siteId": "b31236da306270dc2b5db34e943af88d", "contribution": 0.25 }, { "siteId": "602dc370945d3b3480fff4f2a541227c", "contribution": 1.0 } ]}
Other Benefits
• Ad hoc queries and reports became easier to write with
JavaScript: no need for a Java developer to write map reduce code
to extract the data in a usable form like it was needed with Hbase.
• Simpler backups: Hbase mostly relied on HDFS redundancy; intra-
cluster replication is available but experimental and a lot more
involved to setup.
• Great documentation
• Great adoption and community
looks like we found the right fit!
We have more of this
Development capacity
And less of this
Source: socialbutterflyclt.com
Recap & Final Thoughts
• 3 Vs of Big Data:– Volume– Velocity– Variety Traackr
• Big Data technologies are complementary to SQL and RDBMS
• Until machines can think for themselves Data Science will be increasingly important
Recap & Final Thoughts
• Be prepared to deal with less mature tech
• Be as flexible as the data => fearless refactoring
• Importance of ease of use and administration
cannot be overstated for a small startup
Q&A