the economics of sql on hadoop

The Economics of SQL on Hadoop

Watch the Recording of this Webinar

View the entire recorded webinar at:

http://info.datameer.com/Slideshare-Economics-SQL-Hadoop.html

About our Speakers

John Myers !John Myers joined Enterprise Management Associates in 2011 as senior analyst of the business intelligence (BI) practice area. John has 10+ years of experience working in areas related to business analytics in professional services consulting and product development roles, as well as helping organizations solve their business analytics problems, whether they relate to operational platforms, such as customer care or billing, or applied analytical applications, such as revenue assurance or fraud management. !

About our Speakers Stefan Groschupf!!▪  Stefan Groschupf is the co-founder and CEO of

Datameer. He is one of the original contributors to Nutch, the open source predecessor of Hadoop, Stefan has been at the forefront of the Hadoop and Big Data market.�Prior to Datameer, Stefan was the co-founder and CEO of Scale Unlimited, which implemented custom Hadoop analytic solutions for HP, Sun, Deutsche Telekom, Nokia and others. Earlier, Stefan was CEO of 101Tec, a supplier of Hadoop and Nutch-based search and text classification software to industry-leading companies such as Apple, DHL and EMI Music. Stefan has also served as CTO at multiple companies, including Sproose, a social search engine company.

About our Speakers

Matt Schumpert!!Matt has been working in enterprise software of over 10 years in various capacities, including sales engineering, strategic alliances and consulting. !!Matt currently runs the pre-sales engineering team at Datameer, supporting all technical aspects of customer engagement through roll-out of customers into production. ! !Matt holds a BS in Computer Science from the University of Virginia.!

Agenda ▪  EMA on Current State of the Big Data Industry!

–  Online Archiving in Practice!–  SQL on NoSQL: Metadata!–  Exploratory Use Cases!–  Late Binding Schemas better for Discovery!–  Economics of Hadoop!

▪  Datameer on how to solve these problems!–  Use Case #1: Semi-Structured Data !–  Use Case #2: Text Analytics data!–  Use Case #3: Path Analysis!

▪  Takeaways; and Question and Answer!

State of Big Data Industry

Online Archiving is the majority use case for Big Data projects

Moving Beyond select * from tablename SQL requires a managed set of metadata

Big Data Platforms have Multiple Uses: Discovery is a significant portion

Late Binding Schemas are good for Discovery

Free as a Free puppy…

Datameer Demos

Use Case #1: Semi-Structured Data ▪ Noisy, log-structured data à signal

Use Case #1: Semi-Structured Data ▪ Noisy, log-structured data à signal ▪ Extract, cast, & define fields on demand

Use Case #1: Semi-Structured Data ▪ Noisy, log-structured data à signal ▪ Extract, cast, & define fields on demand ▪ Painful/impossible without inspection

Use Case #1: Semi-Structured Data ▪ Noisy, log-structured data à signal ▪ Extract, cast, & define fields on demand ▪ Painful/impossible without inspection ▪ “One-offs” are possible with SQL+UDFs ▪ But better to collaborate with shared “views”

▪ Examples: ▪  “User-agent” string ▪ URL Parameters ▪  JSON

Use Case #2: Text Analytics ▪ Few/no known fields

Use Case #2: Text Analytics ▪ Few/no known fields ▪ Notion of a record is nebulous / fluid

Use Case #2: Text Analytics ▪ Few/no known fields ▪ Notion of a record is nebulous / fluid ▪ Wrangling and mining

Use Case #2: Text Analytics ▪ Few/no known fields ▪ Notion of a record is nebulous / fluid ▪ Wrangling and mining ▪ “Bag-of-Words” is a sensible start

Use Case #2: Text Analytics ▪ Few/no known fields ▪ Notion of a record is nebulous / fluid ▪ Wrangling and mining ▪ “Bag-of-Words” is a sensible start ▪ Again, frequent inspection is key

Use Case #3: Path Analysis ▪ Key component of clickstream analysis

Use Case #3: Path Analysis ▪ Key component of clickstream analysis ▪ Compares each record to the next/previous

Use Case #3: Path Analysis ▪ Key component of clickstream analysis ▪ Compares each record to the next/previous ▪ Defines/summarizes transitions, not events

Use Case #3: Path Analysis ▪ Key component of clickstream analysis ▪ Compares each record to the next/previous ▪ Defines/summarizes transitions, not events ▪ Supported by list/array types

Use Case #3: Path Analysis ▪ Key component of clickstream analysis ▪ Compares each record to the next/previous ▪ Defines/summarizes transitions, not events ▪ Supported by list/array types ▪ Requires multi-pass queries

Takeaways

When NOT to use SQL on Hadoop ▪ Structured Schemas

or “Schema on Write”

or “Schema on Write” ▪  “Realtime” Query

SLAs for operational or reporting tasks

or “Schema on Write” ▪  “Realtime” Query

SLAs for operational or reporting tasks

▪ Highly detailed SQL query requirements (SQL-2003)

When to use SQL on Hadoop ▪ Unstructured

Datasets and “Schema on Read”

▪ Discovery tasks designed to find new connections and new business value

▪  Lower level SQL queries (SQL-99)

Summary ▪  EMA on Current State of the Big Data Industry

–  Online Archiving in Practice –  SQL on NoSQL: Metadata –  Exploratory Use Cases –  Late Binding Schemas better for Discovery

▪  Datameer on how to solve these problems –  Use Case #1: Semi-Structured Data –  Use Case #2: Text Analytics –  Use Case #3: Path Analysis

Call To Action ■ Visit our website

–  www.datameer.com

■ Download our Trial –  http://www.datameer.com/Datameer-trial.html

the economics of sql on hadoop

datameer demos

text analytics data

majority use case

semistructured data

big data platforms

big data projectsslide

big data market

state of big data industry

Technology

set up hortonworks hadoop with sql...

a study of sql-on- hadoop systems

hadoop and no sql -...

sql server 2016 - assets.microsoft.com · polybase sql...

sql on hadoop - meetup on hadoop - zaloni.pdf · •...

priyank patel, teradata, hadoop & sql

impala: a modern sql engine for hadoop

final version sql over hadoop ver1

outline -...

sql on hadoop

hadoop and no sql -...

sql server 2016 datasheet - · pdf filepolybase sql server...

splout sql - web latency sql views for hadoop

compressed introduction to hadoop, sql-on-hadoop and nosql

a study of sql-on-hadoop systems - renmin univ. of...

hadoop - an introduction for sql server dbas

sql on hadoop technology, architecture &...

hadoop overview - portland sql user group - oregon...

benchmarking hadoop - which hadoop sql engine leads the herd

sql + hadoop: the high performance advantage�