Download - Big Data vs Data Warehousing
![Page 1: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/1.jpg)
Thomas Kejser
http://blog.kejser.org
@thomaskejser
Bigdata vs. Data Warehousing
Synergy or Conflict?
![Page 2: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/2.jpg)
Thomas Kejserhttp://blog.kejser.org@thomaskejser
• Formerly: Lead SQLCAT EMEA• Now: CTO FusionIo EMEA
• 15 year database experience• Performance Tuner
Who is this Guy?
![Page 3: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/3.jpg)
Billi
on H
uman
s
Year2000 2050 2100 2150 2200 22505
6
7
8
9
10
Source: United Nations Projections
Human Consciousness Doesn’t Scale
![Page 4: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/4.jpg)
Text Messages in a Table
CREATE TABLE AllTexts (
Sender BIGINT 8B
, Receiver BIGINT 8B
, SenderLocation BIGINT 8B
, ReceiverLocation BIGINT 8B
, Time DATETIME 8B , SMS VARCHAR(140) 140B
)= 180Bytes
![Page 5: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/5.jpg)
How much do we text?
• World Average• 6.1 Trillion Text Messages / year• About 80% cell phone coverage• 7 billion people• 3 messages/day/person
• But: • Teenagers: 50 messages/day
Source: Pew Internet Research 2010 & ITU
![Page 6: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/6.jpg)
How much will we EVER text?
• 9B people acting like teenagers (in 2050)• 50 texts/day
• That’s 450 billion texts/day• 164 Trillion texts/year (20x today)• 180 bytes each• Assume x3 compression
• Approximation: 10 Petabytes/year in 2050
![Page 7: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/7.jpg)
LOGCapacity GB
Year
Can it be done?
Moore’s Hard Drives
![Page 8: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/8.jpg)
How Large is this/year?
Hard Disk (4TB) : 2.5”
About 1500 Wine Bottles
Wine Bottle (75cl): 4.0”
![Page 9: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/9.jpg)
• Calculating:• 2U Storage=24 Disks
(includes compute)• 4TB per Disk• 100TB in 2U (a bit
less)• 10PB = 200U storage
• About six racks
In the Data Center
![Page 10: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/10.jpg)
Warehouses Serve us Well..
![Page 11: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/11.jpg)
• Good Management Interfaces
• Standard SQL• with a few extensions
• Appliances• Support system• Homogenous HW
• In chunks
… And it is Becoming a Commodity
![Page 12: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/12.jpg)
vs.
![Page 13: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/13.jpg)
PDW vs. Hive – Scan/seek
SELECT count(*) FROM lineitem
Query 1 Query 20
200400600800
100012001400
HivePDW
Secs.
SELECT max(l_quantity) FROM lineitem WHERE l_orderkey > 1000 and l_orderkey < 100000 GROUP BY l_linestatus
Query 1 Query 2
![Page 14: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/14.jpg)
Hive PDW-U PDW-P0
5001000150020002500300035004000
Series1
Secs.
PDW vs. Hive - Joins
PDW-U: • orders partitioned on c_custkey • lineitem partitioned on l_partkey
PDW-P: • orders partitioned on o_orderkey• lineitem partitioned on
l_orderkey
SELECT max(l_orderkey) FROM ordersJOIN lineitem ON l_orderkey = o_orderkey
![Page 15: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/15.jpg)
• Thread startup times• Co-location awareness• Files vs. optimized DB memory
structures• Column stores and other DB tech
Generic is good…
… but when there is structure, make use of it!
What does Big Data need to Catch up?
![Page 16: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/16.jpg)
• What is BigdataVery Unstructured Data
![Page 17: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/17.jpg)
How many Pictures of Cats?
• Flickr Today: • 300MB/month • 2GB/year• 51M users (too small?)
• Estimate: 102 PB / year
• 10 x text messages
Source: WikiPedia
![Page 18: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/18.jpg)
How big is this in wine bottles?
![Page 19: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/19.jpg)
We have learned how to store it!
![Page 20: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/20.jpg)
• Distributed File System
• Open Source• No more SAN
• The Failure Unit is the Server
What is HDFS?
![Page 21: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/21.jpg)
Fully unstructured data is boring
…Unless you get money for storing it
![Page 22: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/22.jpg)
Acquiring Personal Information
Your Semi-structured Data, the Old Fashioned Way
![Page 23: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/23.jpg)
The Social Angle
Who do you talk to and how often?
![Page 24: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/24.jpg)
The Reasons
Why do you own a cell phone?
![Page 25: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/25.jpg)
Your Semi-structured Data, For Free
- at The PubSaturday, 1:39am
![Page 26: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/26.jpg)
Big Value
Extraction of
of meaning and insight
from semi-structured data
![Page 27: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/27.jpg)
Extracting Meaning from Humans
Method Examples
Turn semi-structure to structure Image recognition, network proximity and super nodes, social media
Needle in a haystack Extract outliers, Fraud
Herd behaviors Clustering, Pattern Recognition, “Customers who bought this also bought”
Text classification and search Text indexes, syntactic counting, pagerank
Text to structure Semantic analysis, loose structure into structure
![Page 28: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/28.jpg)
Find New Customers
“Michael, who is respected among his peers, often talks about his new, coolgadgets”
Michael
Thomas
Tommy
![Page 29: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/29.jpg)
Cross Sell
“Families who own an Aston Martin will often buy a Mini Cooper too”
![Page 30: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/30.jpg)
Free Information
![Page 31: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/31.jpg)
Need: Lots of CPU Cores!
![Page 32: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/32.jpg)
Need: Data Centers!
![Page 33: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/33.jpg)
Provisioning has to be REALLY fast
![Page 34: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/34.jpg)
• Get good at • Statistics (again)• Distributed Algorithms• Tuning
• Understand Physical Constraints
• Acquire deep domain knowledge
Things to Learn for the Future
![Page 35: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/35.jpg)
Something is Changing
Today Tomorrow
YouCAPEX Hardware OPEX Hardware
![Page 36: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/36.jpg)
The Mother of All Stovepipes
![Page 37: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/37.jpg)
Data you are afraidto lose
Big Data / Staging(No Model)
Delivery(Model)
Data You actually need
![Page 38: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/38.jpg)
Synergy
Create Structure for me
Here is a tableWarehouse
![Page 39: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/39.jpg)
Applying Social Media to Structure
![Page 40: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/40.jpg)
Data Warehouse
• There is a model• Seek Co-location• Respond in seconds• Calculate first, query after• Expensive HW• Optimise for target HW• Homogenous HW• Pay vendor, expect
optimised
Big Data
• Don’t bother modeling!• Optional Co-Location• Respond in minutes• Calculate while querying• Cheap HW• Good enough on all HW• Heterogeneous HW• Free license, optimise
yourself
Summary
![Page 41: Big Data vs Data Warehousing](https://reader034.vdocuments.net/reader034/viewer/2022051412/54876f5eb479590f0d8b540e/html5/thumbnails/41.jpg)
Q A&