analyzing real-world data with apache drill
TRANSCRIPT
®© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Analyzing Real-World Data with Apache Drill Tomer Shiran VP Product Management, MapR Technologies Co-Founder, PMC Member and Committer, Apache Drill November 20, 2014
®© 2014 MapR Technologies 3
44 ZETTABYTES
4.4 ZETTABYTES
2011 2013
1.8 ZETTABYTES
IDC estimates that in 2020, there will be 44 zettabytes
of data in the world
2020
Source: IDC Digital Universe
®© 2014 MapR Technologies 4
UNSTRUCTURED DATA
STRUCTURED DATA
1980 2000 2010 1990 2020
Unstructured data will account for more than 80% of the data
collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
Total Data S
tored
®© 2014 MapR Technologies 5 1980 2000 2010 1990 2020
Fixed schema
DBA controls structure
Dynamic schema (schema-free)
Application controls structure
“NOSCHEMA” DATASTORES RELATIONAL DATABASES
MBs-GBs TBs-PBs Volume
Database
NoSchema Datastores are Capturing this Data
Structure
Development
Structured Structured, semi-structured and unstructured
Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
®© 2014 MapR Technologies 6
SQL in the Big Data World
• SQL • BI (Tableau, MicroStrategy, etc.) • Low latency • Scalability
• Create and maintain schemas on: – HDFS (Parquet, JSON, etc.) – HBase – MongoDB
• Transform or copy data
2 DON’T WANT WANT
We want SQL and BI support without compromising the flexibility and agility of NoSchema datastores
®© 2014 MapR Technologies 7
• Schema-free scale-out query engine for Hadoop and NoSQL • Point-and-query vs. schema-first • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
APACHE DRILL
40+ contributors 150+ years of experience building databases and distributed systems
®© 2014 MapR Technologies 8
Evolution Towards Self-Service Data Exploration
Data Modeling and Transformation
Data Visualization
IT-driven
IT-driven
IT-driven
Self-service
IT-driven
Self-service
Not needed
Self-service
Traditional BI w/ RDBMS
Self-Service BI w/ RDBMS SQL-on-Hadoop
Self-Service Data Exploration
Zero-day analytics
®© 2014 MapR Technologies 10
Drill’s Data Model is Flexible
HBase
JSON BSON
CSV TSV
Parquet Avro
Schema-less Fixed schema
Flat
Complex
Flexibility
Flexibility
Name ! Gender ! Age !Michael ! M ! 6 !Jennifer ! F ! 3 !
{ ! name: { ! first: Michael, ! last: Smith ! }, ! hobbies: [ski, soccer], ! district: Los Altos !} !{ ! name: { ! first: Jennifer, ! last: Gates ! }, ! hobbies: [sing], ! preschool: CCLC !} !
RDBMS/SQL-on-Hadoop table
Apache Drill table
®© 2014 MapR Technologies 11
Drill Supports Schema Discovery On-The-Fly
• Fixed schema • Leverage schema in centralized
repository (Hive Metastore)
• Fixed schema, evolving schema or schema-less
• Leverage schema in centralized repository or self-describing data
2 Schema Discovered On-The-Fly Schema Declared In Advance
SCHEMA ON WRITE
SCHEMA BEFORE READ
SCHEMA ON THE FLY
®© 2014 MapR Technologies 12
Native JSON
SELECT json_value(po_document, '$.AllowPartialShipment’ RETURNING NUMBER) FROM j_purchaseorder;
SELECT po_document.AllowPartialShipment FROM j_purchaseorder;
JSON query with Oracle:
JSON query with Drill:
Relational databases cannot provide true schema-free JSON support.
®© 2014 MapR Technologies 14
High Level Architecture • Cluster of commodity servers
– Daemon (drillbit) on each node
• No dependency on other execution engines (MapReduce, Spark, Tez) – Better performance and manageability
• ZooKeeper maintains ephemeral cluster membership information – drillbit uses ZooKeeper to find other drillbits in the cluster – Client uses ZooKeeper to find drillbits
• Data processing unit is columnar record batches – Enables schema flexibility with negligible performance impact
®© 2014 MapR Technologies 15
Drill Maximizes Data Locality
Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
drillbit
DataNode/RegionServer/
mongod
drillbit
DataNode/RegionServer/
mongod
drillbit
DataNode/RegionServer/
mongod
ZooKeeper ZooKeeper
ZooKeeper …
®© 2014 MapR Technologies 16
SELECT* Query Execution
drillbit ZooKeeper
Client (JDBC, ODBC,
REST)
1. Find drillbits (once per session)
3. Create logical and physical execution plans 4. Farm out execution of fragments to cluster
(completely distributed execution)
ZooKeeper ZooKeeper
drillbit drillbit
2. Submit query to drillbit
5. Return results to client
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
®© 2014 MapR Technologies 17
Core Modules within drillbit
SQL Parser Hive
HBase
Distributed Cache
Sto
rage
Plu
gins
MongoDB
DFS
Phy
sica
l Pla
n
Execution Lo
gica
l Pla
n Optimizer
RPC Endpoint
®© 2014 MapR Technologies 19
Demo Plan 1. Run Drill 2. Configure DFS and MongoDB storage plugins 3. Explore the data
– Basics – Complex data – Views
®© 2014 MapR Technologies 21
Run Drill in Embedded Mode (sqlline) $ tar xf apache-‐drill-‐0.7.0.tar.gz $ cd apache-‐drill-‐0.7.0 $ bin/sqlline -‐u jdbc:drill:zk=local > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json` LIMIT 1; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | yelping_since | votes | review_count | name | user_id | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | 2012-‐02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee | qtrmBGNqCvupHMHL_bKFgQ |
• drillbit (Drill daemon) starts automatically in embedded mode • No ZooKeeper in embedded mode (hence zk=local) • Can’t use BI clients (JDBC/ODBC) in embedded mode
You can now access the Web UI: http://localhost:8047
®© 2014 MapR Technologies 22
Or Run Drill in Distributed Mode…
$ zkServer start • Make sure ZooKeeper (zkServer) is running:
• Access the Web UI: http://localhost:8047 • Connect a client to the cluster (eg, sqlline):
• Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes • If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired
cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/<clustername>
• Not sure if ZooKeeper is running? Run telnet localhost 2181 and make sure it connects
• Define the Drill cluster name and ZooKeeper nodes in conf/drill-‐override.conf • Start drillbit: $ bin/drillbit.sh start
$ bin/sqlline -‐u jdbc:drill:zk=localhost:2181
®© 2014 MapR Technologies 27
Inventory: DFS Files
{ "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-‐05-‐17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
®© 2014 MapR Technologies 28
Inventory: MongoDB Collections $ mongo MongoDB shell version: 2.6.5 > show databases; admin (empty) local 0.078GB yelp 0.453GB > use yelp > db.users.findOne() {
"_id" : ObjectId("54566cdf3237149de181a92a"), "yelping_since" : "2012-‐02", "votes" : { "funny" : 1, "useful" : 5, "cool" : 0 }, "review_count" : 6, "name" : "Lee", "user_id" : "qtrmBGNqCvupHMHL_bKFgQ", "friends" : [ ]
}
®© 2014 MapR Technologies 29
Let’s Go! > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json` WHERE stars = 1 LIMIT 1; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | votes | user_id | review_id | stars | date | text | type | business_id | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | {"funny":0,"useful":0,"cool":0} | Qrs3EICADUKNFoUq2iHStA | _ePLBPrkrf4bhyiKWEn4Qg | 1 | 2013-‐04-‐19 | I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. | review | vcNAWiLM4dR7D2nwwJ7nCA | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 30
Using Storage Plugins and Workspaces
> SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json` LIMIT 1; > SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1; > SELECT * FROM mongo.yelp.users LIMIT 1; > USE mongo.yelp; > SELECT * FROM users LIMIT 1;
Storage plugin Workspace
Path relative to workspace
Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table
®© 2014 MapR Technologies 31
Most Common User Names (MongoDB) > SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | users | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | David | 2453 | | John | 2378 | | Michael | 2322 | | Chris | 2202 | | Mike | 2037 | | Jennifer | 1867 | | Jessica | 1463 | | Jason | 1457 | | Michelle | 1439 | | Brian | 1436 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 32
Cities with the Most Businesses > SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | state | city | businesses | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 34
business.json (1) {
"business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -‐115.172588519464,
®© 2014 MapR Technologies 35
business.json (2) "state": "NV", "stars": 4.0,
"attributes": { "Alcohol": "full_bar”,
"Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false,
"classy": true, "trendy": false,
"casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false,
"dinner": true, "breakfast": false, "brunch": false}, }
}
®© 2014 MapR Technologies 36
Which Places Are Open Right Now (22:00)? > SELECT name, b.hours FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | hours | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Chang Jiang Chinese Kitchen | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"22:30","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}} | | Grand China Restaurant | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"23:00","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday":{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","open":"11:00"}} | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 37
It’s 10pm in Vegas and I Want Good Hummus! > SELECT name, stars, b.hours.Friday, categories FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | stars | EXPR$2 | categories | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | | Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 38
Flatten Repeated Values > SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | categories | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Pine Cone Restaurant | ["Restaurants"] | | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
> SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | categories | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Pine Cone Restaurant | Restaurants | | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 39
Most and Least Common Business Categories > SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | category | businesses | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Restaurants | 14303 | … | Australian | 1 | | Boat Dealers | 1 | | Firewood | 1 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ 715 rows selected (3.439 seconds)
> SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian'); +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | categories | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 41
Create a View for Name-Gender Mapping
> CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > USE dfs.tmp; > CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > SELECT * FROM dfs.tmp.names WHERE name = 'John'; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | gender | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | John | Male | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
columns[0] columns[4]
names.csv:
®© 2014 MapR Technologies 42
Most Common Names (and their Genders) on Yelp > SELECT u.name, n.gender, count(*) AS number FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | name | gender | number | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | David | Male | 2453 | | John | Male | 2378 | | Michael | Male | 2322 | | Chris | Unknown | 2202 | | Mike | Male | 2037 | | Jennifer | Female | 1867 | | Jessica | Female | 1463 | | Jason | Male | 1457 | | Michelle | Female | 1439 | | Brian | Male | 1436 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 43
Who Rates Higher – Men or Women? > SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | gender | users | stars | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Female | 103684 | 3.77 | | Male | 97430 | 3.696 | | Unknown | 18409 | 3.727 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
®© 2014 MapR Technologies 44
Who Writes More – Men or Women?
> SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | gender | review_length | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Male | 665 | | Female | 730 | | Unknown | 711 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+
It takes a 3-way join to find out…
®© 2014 MapR Technologies 46
Thank You • Learn: incubator.apache.org/drill/
• Download: incubator.apache.org/drill/download/
• Ask questions: [email protected]
• Contact me: [email protected]
®© 2014 MapR Technologies 47
Thank You
@mapr maprtech
Tomer Shiran, VP Product Management
MapRTechnologies
maprtech
mapr-technologies