big data @ bodensee barcamp 2010
DESCRIPTION
Big Data @ Bodensee Barcamp 2010TRANSCRIPT
Tuesday, June 8, 2010
Tuesday, June 8, 2010
BIG DATAThe rise of the data scientist
http://flowingdata.com/2009/06/04/rise-of-the-data-scientist/
Tuesday, June 8, 2010
Holidaycheck
Travel platform: review + book
12+ countries (.de ... .cn)
30% growth / year, profitable
Almost 1.5 mio hotel reviews
1.6 mio + pics
Tuesday, June 8, 2010
internet-driven company
traditional: MVC/3-Tier/RDBMS/caching
50+ Apache instances
15 Gb Operational Data
12 Gb logs / day
5 searches / second
My scientist friend: “That’s neat, but it’s not data science.”
Data @ HC
Tuesday, June 8, 2010
The I/O Bottleneck“The problem is simple: Memory, Disk size and CPU and even
network performance continue to grow much faster than disk I/O performance.”
2004 to 2009
CPU: still following Moore's Law (transistor x2 every 18 months)
Memory Bandwidth (Intel): 9.3x
Disk Density (SATA): 8x
Disk I/O: 0.8x
Network speed: routers can easily saturate the fastest hard drives
http://blogs.cisco.com/datacenter/comments/networking_delivering_more_by_exceeding_the_law_of_moore/
Tuesday, June 8, 2010
I/O Repercussions
Turn to memcache
Try out SSD
Try out asynchronous writes (e.g. message queues)
Try to solve/hack the I/O problem: Sharding, in-memory DB
Our problems seem big, but are they really?
Tuesday, June 8, 2010
So what is Big Data anyway?“The term Big data from software engineering and computer science
describes datasets that grow so large that they become awkward to work with using on-hand database management tools”
kilo to mega to giga to tera to peta to exa to zetta to yotta
Tuesday, June 8, 2010
NoSQL = Not Only SQLTrade-Offs, e.g. transactions, data loss
e.g. Document Stores (MongoDB) e.g. Key-Value Stores (MemcacheDB)
e.g. Graph Databases (Neo4j) Map/Reduce algorithm
Tuesday, June 8, 2010
Medium Data“With yesterday's scientific technology most businesses should be able to
handle their data analysis needs.”
HC: 12 Gb logfiles / day = medium data problem
(2006) Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber
(2004) MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
Solved (?) with: RDBMS + NoSQL
Tuesday, June 8, 2010
3 sexy skills of data geeks
“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it. Hal Valerian (Google)”
http://dataspora.com/blog/sexy-data-geeks/
Tuesday, June 8, 2010
3 skills: statistics
sentiment analysis natural language processingmachine learning
good old-fashioned regressionrecommendation engines
Tuesday, June 8, 2010
3 skills: visualization
Vs.
Q: Are you hiring statisticians, visualization experts & data plumbers?
TheOathMeal Edward Tufte, Ben Fry
Tuesday, June 8, 2010
3 skills: data plumbing
Glue languages: Python, Perl, regex, XSLT
Admin: setting up, maintaining clusters
Affinity with OSS & *nix
NoSQL = NoSchema = Transform Data
/^([\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+\.)*[\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+@((((([a-z0-9]{1}[a-z0-9\-]{0,62}[a-z0-9]{1})|[a-z])\.)+[a-z]{2,6})|(\d{1,3}\.){3}\d{1,3}(\:\d{1,5})?)$/i
Tuesday, June 8, 2010
More Data beats smart algorithms
spelling correction machine translation
face recognition
http://videos.syntience.com/ai-meetups/peternorvig.htmlhttp://dataspora.com/blog/tipping-points-and-big-data/
Tuesday, June 8, 2010
Ethics of data
Black Hat vs. White Hat <=> Black Data vs. White data
White: Amazon free public datasets (e.g. human genome)
Black: Scientific climate data (or the lack of PUBLIC data)
Just like money, information flows to the least taxed location in a global world.
Tuesday, June 8, 2010
Take-Away & Discuss“Don't throw away data if you don’t have to, because
unlike material goods, data becomes more valuable the more of it is created. As a society, I don't think we
understand this completely yet.”
q: Who is using a NoSQL db? Share Stories?
q: Do you hire statisticians?
q: Do you hire visualization experts?
q: Do you know how much data you are throwing away?
q: Share: how big is your data?
q: Do you own your customer data or does Facebook?
q: Do you own your content or does Google?
q: How are you exploiting asynchronicity?
q: Any tips on introducing NoSQL in companies?
q: Do you own your analytics data?
q: Should information be regulated (privacy)? Can it?
Tuesday, June 8, 2010