bridging structured and unstructred data with apache hadoop and vertica

35
Bridging Unstructured & Structured Data with Hadoop and Vertica Glenn Gebhart [email protected] Steve Watt [email protected]

Upload: steve-watt

Post on 22-Jan-2015

6.438 views

Category:

Technology


3 download

DESCRIPTION

See

TRANSCRIPT

Page 1: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Bridging Unstructured & Structured Data with Hadoop and Vertica

Glenn Gebhart [email protected]

Steve Watt [email protected]

Page 2: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Contents

- Our background with Big Data

- Accelerating and monitoring Apache Hadoop deployments with HP CMU

- I have my Apache Hadoop Cluster deployed….. Now what ?

- Sample application scenario with Apache Hadoop and Vertica

Page 3: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

3 HP Confidential

Cluster Management Utility

Page 4: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Managing Scale Out with HP CMU- Proven cluster deployment and management tool

- 11 Years Experience

- Proven with clusters of 3500+ nodes

- Deployment and Management- Clone a Node (Hadoop Slave) and Deploy to an entire Logical Group.

- Provision applications and dependencies with parallel distributed copy (pdcp) and parallel distributed shell (pdsh)

- Command Line or GUI based cluster wide configuration

- Manage a node individually or manage a cluster as a whole

- Monitoring- Scalable Non-intrusive Monitoring across a wide set of infrastructure metrics

- Extensible through Collectl integration

Page 5: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

5 HP Confidential

Page 6: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

6 HP Confidential

Tech Bubble?

What does the Data Say?

Attribution: CC Pascal Terjan via Flickr

Page 7: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

7 HP Confidential

Page 8: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

8 HP Confidential

But what if I could turn that into this?

Company Investor Amount Round Month

Year Sector

InfoChimps Stage One Capital

350 000 Angel 09 2010 Enterprise

InfoChimps DFJ Mercury

1 200 000 Series A

11 2010 Enterprise

Color Labs Sequoia Capital

41 000 000

Series A

03 2011 Consumer Web

Page 9: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

And see how the amount invested this year differs from previous years?

Page 10: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

10 HP Confidential

Where is the money going?

Page 11: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

What type of startups get the most investment funding?

Page 12: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Amount invested in Software Startups by Zip Code

Page 13: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

13 HP Confidential

How did you do that?

How did you Do that?

Attribution: CC  Colin_K on Flickr

Page 14: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

14 HP Confidential

Apache

Identify Optimal Seed URLs& Crawl to a depth of 2

http://www.crunchbase.com/companies?

c=a&q=privately_held

Crawl data is stored in segment dirs on the HDFS

Page 15: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

15 HP Confidential

Page 16: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

16 HP Confidential

Company POJO then /t Out

Prelim Filtering on URL

Making the data STRUCTURED

Retrieving HTML

Page 17: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

17 HP Confidential

Aargh!

My viz tool requires zipcodes to plot geospatially!

Page 18: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Apache Pig Script to Join on City to get Zip Code and Write the results to Vertica

ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('\t') AS (State:chararray, City:chararray, ZipCode:int);

CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('\t') AS

(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amo

unt:int);

CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);

STORE CrunchBaseZip INTO

'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40),

Month int, Year int, Investor int, Amount varchar(40))}’

USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');

Page 19: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

The Story So Far

• Used Nutch to retrieve investment data from web site.

• Used Hadoop to extract and structure the data

• Used Pig to add zipcode data.

• End result is a collection of relations describing investment activity.

• We’ve got raw data, now we need to understand it.

Page 20: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Why Vertica?

• Vertica and Hadoop are complementary technologies.

• Hadoop’s strengths:

– Analysis of unstructured data (screen scraping, natural language recognition)

– Non-numeric operations (graphics preparation)

• Vertica’s strengths

– Counting, adding, grouping, sorting, …

– Rich suite of advanced analytic functions

– All at TB+ scales.

Page 21: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Built from the Ground Up: The Four C’s of Vertica

Achieve best data query performance with unique

Vertica column store

Linear scaling by adding more resources on the fly

Store more data, provide more views, use less

hardware

Query and load 24x7 with zero administration

Columnar storage and execution

Clustering CompressionContinuous

performance

Page 22: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Getting Data From Here To There

Page 23: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Connecting Vertica And Hadoop

• Vertica provides connectors for Hadoop 20.2 and Pig 0.7.

• Acts as a passive component; Hadoop/Pig connect to Vertica to read/write data.

• Input retrieved from Vertica using standard SQL query.

• Output written to Vertica table.

Page 24: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Vertica As a M/R Data Source

// Set up the configuration and job objectsConfiguration conf = getConf(); Job job = new Job(conf);

// Set the input format to retrieve data from Verticajob.setInputFormatClass(VerticaInputFormat.class);

// Set the query to retrieve data from the Vertica DB VerticaInputFormat.setInput(job,“SELECT * FROM foo WHERE bar = ‘baz’

);

Page 25: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Vertica As a M/R Data Sink

// Set up the configuration and job objectsConfiguration conf = getConf(); Job job = new Job(conf);

// Set the output format to to write data to Verticajob.setOutputKeyClass(Text.class);job.setOutputValueClass(VerticaRecord.class);job.setOutputFormatClass(VerticaOutputFormat.class);

// Define the table which will hold the outputVerticaOutputFormat.setOutput(job, <table name>, <truncate table?>,<col 1 def>, <col 2 def>, …, <col N def>

);

Page 26: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

26

Reading Data Via Pig

# Read some tuples

A = LOAD 'sql://< Your query here >' USING com.vertica.pig.VerticaLoader( ‘server1,server2,server3', ‘< DB Name>','5433',‘< user >',‘< password >’ );

Page 27: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

27

Writing Data Via Pig

# Write some tuples

STORE < some var > INTO '{ < table name > (< col 1 def >, < col 2 def >, … )}'USING com.vertica.pig.VerticaStorer( ‘< server >',‘< DB >','5433',‘< user >',‘< password >’);

Page 28: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Reporting And Data Visualization

Page 29: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Does My Favorite Application Work With Vertica?

• Vertica is an ANSI SQL99 compliant DB.

• Comes with drivers for ODBC, JDBC, and ADO.Net.

• If your tool uses a SQL DB, and speaks one of these protocols, it’ll work just fine.

Page 30: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

We Support…

Page 31: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

• Integrates smoothly with reporting frontends such as Jasper and Pentaho.

• Scriptable via the vsql command line tool.

• C/C++ SDK for parallelized, in-DB computation.

• But… you have to know what questions you want to ask.

Traditional Reports

Page 32: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Graphical, Real-Time Data Exploration

Page 33: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Wrap-Up

Page 34: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

• Solutions leveraging Vertica in conjunction with Hadoop are capable of solving a tremendous range of analytical challenges.

• Hadoop is great for dealing with unstructured data, while Vertica is a superior platform for working with structured/relational data.

• Getting them to work together is easy.

In Closing…

Page 35: Bridging Structured and Unstructred Data with Apache Hadoop and Vertica

Questions?