bdm39: hp vertica bi: sub-second big data analytics your users and developers can truly appreciate -...
TRANSCRIPT
HP VERTICA BISUB-SECOND BIG DATA ANALYTICS YOUR USERS AND DEVELOPERS CAN TRULY APPRECIATE
PRESENTED BY MINA NAGUIBBIG DATA MONTRÉAL AUGUST 2015
Director, Platform Engineering@AdGear
Background: Software hacker Network enthusiast Web designer, SQL weaver, kernel debugger, PM, RE, SRE, QA, ...
What I do: Hire great people at AdGear Offer technical leadership Get out of their way Observe, optimize, rinse, repeat
ABOUT ME
AdGear is a digital advertising technology company, providing platforms, ad technology and services to publishers, advertisers, media agencies and ad tech providers.
AdGear delivers a full-stack advertising platform that includes: Demand-Side Platform, Supply-Side Platform, 1st and 3rd Party Ad Server, Attribution and Analytics, and multiple retargeting offerings.
ABOUT ADGEAR
ABOUT ADGEAR
2008 year founded
40 employees
2 offices (514, 416)
~10 billion impressions served per month
0.5 Trillion Bid Requests per month
ADGEAR: DATAInternet advertising generates lots of data. The majority of which is transactional data that must be accurately accounted.
If you can't account for it, it didn't happen. The data generated is often more important than the occurrence of the event itself.
ADGEAR: SOME NUMBERS
September 2008 First event served in production
2008 2 events / second
2010 250 events / second
2012 2,500 events / second
2014 5,500 events / second
ADGEAR: SOME NUMBERS
September 2008 First event served in production
2008 2 events / second
2010 250 events / second
2012 80,000 events / second
2014 200,000 events / second
RTB Changed the game:
ADGEAR: DATAFrom Day 1:
Offer customers a self-serve reporting section in the UI to report on what happened
Make it responsive, pivotable, discoverable, useful and insightful
We're competing against dinosaurs with closed-day banking mentality - go for realtime and semi-realtime
Safe and correct - better say N/A than offer a partial metric
ADGEAR: DATAThe data architecture plan, circa 2008
Step 1: Log the event locally on the server it occurs on
Step 2: Harvest the events
Step 3: ????
Step 4: Profit!
ADGEAR: DATA
Step 1: Log the event locally on the server it occurs on
Step 2: Harvest the events
Step 3: ???? (How hard can this really be ?)
Step 4: Profit!
The data architecture plan, circa 2008
ADGEAR: DATA
2008 2009 2010
The elusive Step 3
Raw event management Home-grown "Harvester" libraryRaw event warehousing Single unix filesystem, .json.gz files, .sqlite files
Raw event analysis+aggregation "Harvester" library streaming abstraction, custom jobsAggregate metrics warehousing PostgreSQL (app-db) tables, key-value design
Reporting Primary web-based app accessing aggregates key-values table
ADGEAR: DATA
2009 2010 2011 2012
Raw event management Home-grown "Harvester" libraryRaw event warehousing Single unix filesystem, .json.gz files, .sqlite CEROD files
Raw event analysis+aggregation "Harvester" library streaming abstraction, custom jobsAggregate metrics warehousing PostgreSQL (app-db) tables, key-value design
Reporting Primary web-based app accessing aggregates key-values table
The elusive Step 3
ADGEAR: DATA
2009 2010 2011 2012
Raw event management Home-grown "Harvester" + "DDAL" librariesRaw event warehousing Multiple servers, unix filesystem, .json.gz files, .sqlite CEROD files
Raw event analysis+aggregation "Harvester" + "DDAL" libraries streaming abstraction, custom jobsAggregate metrics warehousing PostgreSQL (app-db) tables, key-value design
Reporting Primary web-based app accessing aggregates key-values table
The elusive Step 3
ADGEAR: DATA
2010 2011 2012 2013
Raw event management Home-grown "Harvester" + "DDAL" librariesRaw event warehousing Multiple servers, unix filesystem, .json.gz files, .sqlite CEROD files
Raw event analysis+aggregation "Harvester" + "DDAL" libraries streaming abstraction, custom jobsAggregate metrics warehousing Dedicated MongoDB server, hourly documents
Reporting Dedicated reporting service abstracting away Mongo DB
The elusive Step 3
ADGEAR: DATA
2011 2012 2013 2014
Raw event management Home-grown "Harvester" + "DDAL" librariesRaw event warehousing Multiple servers, unix filesystem, .json.gz files, .sqlite CEROD files
Raw event analysis+aggregation "Harvester" + "DDAL" libraries streaming abstraction, custom jobsAggregate metrics warehousing Dedicated PostgreSQL reporting DB, star schema
Reporting Dedicated reporting service abstracting away PG DB
The elusive Step 3
ADGEAR: DATA
2011 2012 2013 2014 2015
Raw event management Home-grown push mechanism
Raw event warehousing HDFS, .json.gz files, .avro files
Raw event analysis+aggregation Hadoop M+R, Pig, Hive
Aggregate metrics warehousing Dedicated PostgreSQL reporting DB, star schemaReporting Dedicated reporting service abstracting away PG DB
The elusive Step 3
ADGEAR: DATA
2012 2013 2014 2015
Raw event management Home-grown push mechanismRaw event warehousing HDFS, .json.gz files, .avro files
Raw event analysis+aggregation Hadoop M+R, Pig, HiveAggregate metrics warehousing Vertica
Reporting Dedicated reporting service abstracting away Vertica DB
The elusive Step 3
ADGEAR: DATA
2015
Raw event management Home-grown push mechanism, Kafka
Raw event warehousing HDFS, .json.gz files, .avro files Raw event analysis+aggregation Hadoop, HP Vertica, HiveAggregate metrics warehousing HP Vertica
Reporting Dedicated reporting service abstracting away Vertica DB
The elusive Step 3
ADGEAR: DATA
= The "Secret Sauce" *
* Actual unsolicited description used by myself and other Vertica customers
From a dev/ops perspective, Vertica is:
• A columnar database• Offers a familiar DB/Schema/Table/Row/Column
paradigm• Distributed + Horizontally scalable• Easily accessible from the CLI and many programming
languages• Extremely fast• SOLID SQL support. Not 100% ANSI SQL-99
Compliant, but more than enough for our use cases• Stable, predictable, easy to administer• Well documented• Enterprise-ready, in production at many large
companies
From a dev/ops perspective, Vertica is:
• A columnar database• Offers a familiar DB/Schema/Table/Row/Column
paradigm• Distributed + Horizontally scalable• Easily accessible from the CLI and many programming
languages• Extremely fast• SOLID SQL support. Not 100% ANSI SQL-99
Compliant, but more than enough for our use cases• Stable, predictable, easy to administer• Well documented• Enterprise-ready, in production at many large
companies
At AdGear
Fact Table NHour Dimension1 Dimension2 Dimension3 Dimension...N Metric1 Metric2 Metric...N
2015-08-05-01 1 55 105 9 1 0 02015-08-05-01 1 56 106 9 3551 6 92015-08-05-01 1 56 107 9 2382 6 662015-08-05-01 2 901 107 33 23 4 0
Growth via Append-Only row insertion
At AdGear
Fact Table 1 Fact Table 2 Fact Table 3
Dimension Table 1 Dimension Table 3 Dimension Table 5Dimension Table 2 Dimension Table 4
Simple SQL joins
To download and try:https://my.vertica.com/community/
Free, up to 1TB, 3 nodes, no time limit
Get in touch:http://adgear.com/
Mina NaguibTo learn more:http://www.vertica.com/
Thank you