hive et hadoop usage chez square

Hadoop and Hive at Square

Nicolas Thiébaud

!nicothieb@

nicolas@squareup.com

Data Engineering at Square

July 2014

Square: Make commerce easy

Remove crappy POSes from the counter

Building the best register for small businesses. Started with card processing and bringing more value to merchants using the point of sale.

Merchant and Buyer facing products

Square Register, Square Cash, Pickup, Feedback

Data products

Merchant Analytics, Capital

Data at SquareInternal Data !Produced on app servers (~200+ services), mysql or psql !Logging and tracing from apps and web to public endpoint !Example: payment data, user data, ledger entries

External Data !Payment processing partners ship flat files to us

Offline Data usage at Square !BI/Analysis/Reporting: ~200 mysql users, ~100 hadoop users !ML: Risk detection, recommendation !Apps: A/B testing, Commercial support, Capital

Data Architecture at Square: Kafka

Historical, most of our users still use this

App DB -> Analytical DB stripping out PII, cursoring, looking at binlog replication

Hadoop: Kafka as a backbone

App DB -> Kafka using cursoring and PII stripping

App Server -> Kafka (eg: tracing) in proto format

Feed consumption -> Kafka

Kafka written to hdfs using offsets, dupes are written when the consumer restarts

Raw data is deduped and extracted from protos to rcfiles in daily batches. Everything is exposed in Hive

Most datasets don’t fit in mysql. Most queries cannot run anymore

Analysts broke down their jobs to run on single day windows. The query sniper keeps hitting them.

Mysql no longer supported as source of truth for offline data. Tables are windowed

We keep revisiting the amount of data stored in MySQL

Everyone must migrate to hive (users and apps)

Mysql Analytical DBs will now be an export location for data reduced in Hadoop

All datasets must be present in Hadoop

Even small ones :)

Transitioning to Hive

Transitioning to HiveStability !Hive 10 + Hue 2.5 as starting point + many patches -> 2 restarts a day with small load !Decided to go to hive 12 and patch the bugs affecting us in an internal build !Two major tasks: 10 -> 12 and building hive internally

Reliability !Sentinel, data validation daemon !Conduit, hive etls !Customer defined SLA’s

Education !Office hours, trainings, mailing list

Project Babar: Building a stable Hive 12

Patch open source hive to address Square specific issues !Setup integration tests in kochiku, no performance test !Hiveserver only, no cli. Staging and production envs !Push and pull changes to apache jira

Build and deploy hive artifacts !Makefile !metastore, hiveserver (staging and prod), cli tools (beeline), hivesandbox !package configuration

Misc !hue 3.5 !hive-udfs

Internal Hive Build

cdh5-0.12.0_5.0.1 branch + 9 commits

3 test fixes, 2 square specific changes (pom + ci)

DATAPLAT-436 Beeline should return non-zero on invalid statements

HIVE-5799: session/operation timeout for hiveserver2

HIVE-5707: Validate values for ConfVar

HIVE-7040: Allow TCP keep alive on Hive Server 2

(merged in cdh5-0.12.0_5.0.1) HIVE-6893: out of sequence error in HiveMetastore

Story of HIVE-7040 + HIVE-5799HIVE-7040: Allow TCP keep alive on Hive Server 2 F5 stateful firewall kills open connections

HIVE-5799: session/operation timeout for hiveserver2 Beeline interrupt does not close sessions

Hive Ops trick: ./wait_for_hive_jobs && sudo sv restart /var/service/hiveserver

Next Steps

Figure out the best way to contribute back patches

HIVE-668{3,4}: Beeline comments suck

HIVE-7200: Beeline output displays column heading even if --showHeader=false is set

HIVE-4924: Support JDBC query timeouts

HIVE-5232: Use async interface for jdbc

Hive HA

hive et hadoop usage chez square

Technology

whirlwind tour of hadoop and hive

オライリーセミナー hadoop/hiveを学ぶ...

hadoop and hive

hadoop and hive at orbitz, hadoop world 2010

hadoop from hive with stinger to tez

hive hadoop

hadoop chez kobojo

epam. hadoop mr streaming in hive

reporting: from mysql to hadoop/hive

hadoop hive

analyse tweets using flume, hadoop and hive

hadoop m/r pig hive

apache hadoop and hive

hadoop deck map reduce hive

processing relational data with hive lecture bigdata...

hive – sql on top of hadoop. content background concepts...

hadoop, hbase, and hive

hadoop, hive, spark and object stores

hadoop testing with apche hive

hive - core...