scale-out databases for cern use cases strata hadoop world london 6 th of may,2015 zbigniew...

33

Upload: rosamond-atkins

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB
Page 2: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Scale-out databases for CERN use cases

Strata Hadoop WorldLondon 6th of May,2015

Zbigniew Baranowski, CERN IT-DB

Page 3: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

About Zbigniew• Joined CERN in 2009

• Developer • Researcher • Database Administrator & Service Manager

• Responsible for • Engineering & LHC control database infrastructure• Database replication services in Worldwide LHC Computing Grid

3

Page 4: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Outline• About CERN• The problem we want to tackle• Why Hadoop? Why Impala?• Results of Impala evaluation• Summary & Future plans

4

Page 5: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

About CERN• CERN - European Laboratory for Particle Physics• Founded in 1954 by 12 countries for fundamental physics research• Today 21 member states + world-wide collaborations

• 10’000 users from 110 countries

5

Page 6: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

LHC is the world’s largest particle accelerator• LHC = Large Hadron Collider

• 27km ring of superconducting magnets; 4 big experiments• Produces ~30 Petabytes annually• Just restarted after an upgrade – x2 collision energy (13 TeV) is

expected by June 2015

6

Page 7: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Outline• About CERN• The problem we want to tackle• Why Hadoop? Why Impala?• Results of Impala evaluation• Summary & Future plans

7

Page 8: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB
Page 9: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Data warehouses at CERN • More than 50% (~ 300TB) of data stored in RDBMS at

CERN are time series data!

9

Tim

e

Values

Date Time X Y05/03/15 00:00:00 32 4305/03/15 01:00:00 12 43205/03/15 02:00:00 43 2105/03/15 03:00:00 34 2105/03/15 04:00:00 45 405/03/15 05:00:00 32 3205/03/15 06:00:00 42 1205/03/15 07:00:00 24 3405/03/15 08:00:00 12 405/03/15 09:00:00 42 3205/03/15 10:00:00 34 505/03/15 11:00:00 45 3205/03/15 12:00:00 32 3205/03/15 13:00:00 21 4

Page 10: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Time series in RDBMS@CERN Logging systems

• LHC log data: 50kHz archiving, 200 TB + 90 TB/year

Control and data acquisition systems (SCADA)• LHC detector controls• Quench Protection System: 150kHz archiving, 2TB/day

Grid monitoring and dashboards

and many others…10

Page 11: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

11

Nature of the time series data@CERN• (signal_id, timestamp, value(s) )• Data structure in RDBMS

• Partitioned Index Organized Table• Index key: (signal_id, time)• Partition key: time (daily)

Day 1 Day 2 Day 3

Page 12: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

There is a need for data analytics • Users want to analyze the data stored in RDBMS

• sliding window aggregations• monthly, yearly statistics calculations• correlations• …

• Requires sequential scanning of the data sets

• Throughput limited to 1 GB/s • On currently deployed shared storage RDBMS clusters

12

Page 13: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Outline• About CERN• The problem we want to tackle• Why Hadoop? Why Impala?• Results of Impala evaluation• Summary & Future plans

13

Page 14: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Benefits of Hadoop for data analysis• It is an open architecture

• Many interfaces to data• Declarative -> SQL• Imperative-> Java, Python, Scala

• Many ways/formats for storing the data• Many tools available for the data analytics

14

Page 15: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

• Shared nothing -> It scales!

15

0 10 20 30 40 500 GB/s

1000 GB/s2000 GB/s3000 GB/s4000 GB/s5000 GB/s6000 GB/s7000 GB/s8000 GB/s9000 GB/s

10000 GB/s

2247 MB/s

4014 MB/s

5374 MB/s

7053 MB/s

9069 MB/sSequential scan of data with

MapReduce

nodes #

thro

ug

hp

ut

Hardware used: CPU: 2 x 8 x 2.00GHz RAM: 128GB

Storage: 3 SATA disks 7200rpm (~120MB/s per disk)

Benefits of Hadoop for data analysis

Page 16: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Why Impala?• Runs parallel queries directly on Hadoop

• SQL for data exploration – declarative approach

• Non MapReduce based implementation -> better performance than Hive

• C++

• Unified data access protocols (ODBC, JDBC)• easy binding of databases with applications 16

Page 17: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Outline• About CERN• The problem we want to tackle• Why Hadoop? Why Impala?• Results of Impala evaluation• Summary & Future plans

17

Page 18: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Impala evaluation plan from 2014• 1st step: data loading

• 2nd step: data querying

• 3rd step: assessments of the results & users acceptance

18

Page 19: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

19

Data loading• Uploading data from RDBMS to HDFS

• Periodical uploading with Apache Sqoop• Live streaming from Oracle via GoldenGate (PoC)

• Loading the data into final structures/tables• Using Hive/Impala

DATA

Page 20: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

20

Different aspects of storing data• Binary vs text

• Partitioning• Vertical• Horizontal

• Compression

Page 21: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

CSV SequenceFile Avro Parquet0 GB

200 GB

400 GB

600 GB

800 GB

1000 GB

1200 GB

1400 GB

1600 GB

1800 GB

1240 GB

1545 GB

542 GB 558 GB

331 GB265 GB 226 GB 288 GB

109 GB 117 GB 171 GB

Data size comparison – 8 days of ACCLOG

no compression snappy bzip2

Data

Volu

me

Size of origi-nal data

stored in a re-lational data-base - 649GB

Page 22: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

22

Software used: CDH5.2+

Hardware used for testing: 16 ‘old’ machines

CPU: 2 x 4 x 2.00GHz RAM: 24GB

Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host

CSV SequenceFile Avro Parquet0 s

200 s

400 s

600 s

800 s

1000 s

1200 s

1400 s

1600 s

1800 s

2000 s

757 s682 s

216 s328 s

687 s572 s

113 s 117 s

1800 s

118 s

Impala sequential scans of 8 days of ACCLOG data

no compression snappy bzip2

Execu

tion

tim

e

Page 23: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Scalability test of SQL on Hadoop (parquet)

4 nodes 8 nodes 12 nodes 16 nodes0.00 GB/s

0.50 GB/s

1.00 GB/s

1.50 GB/s

2.00 GB/s

2.50 GB/s

Impala

Hive

Th

rou

gh

pu

t

23

Hardware used: CPU: 2 x 4 x 2.00GHz RAM: 24GB

Storage: 12 SATA disks 7200rpm (~120MB/s per disk) per host

Page 24: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Querying the time series data• Two types of data access

• A) data extractions for a given signal within a time range (with various filters)• B) statistics collection, signal correlations and aggregations

• RDBMS• For A: index range scans -> fast data access -> 1 day within 5-10s• For B: fast full index scans -> reading entire partitions -> max 1 GB/s

• Impala• Similar performance for A and B -> reading entire partitions• For A: lack of indexes -> slower than RDBMS for most of the cases• For B: a way faster than RDBMS thanks to shared nothing/scalable architecture

24

Page 25: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Making single signal data retrieval faster with Impala

• Problem: no indexes in Impala – full partition scan needed• With daily partitioning we have 40 GB to read

• Possible solution: Fine-grain partitioning• (year, month, day, signal id)

• Concern: Number of HDFS objects• 365 days * 1M signals = 365M of files per year• File size: 41KB only!

• Solution: multiple signals data grouped in a single partition 25

10000, 2015-01-09, 1799000, 2015-01-09, 455000, 2015-01-09, 5

10115, 2015-01-09, 5.610715, 2015-01-09, 9.8

99074, 2015-01-09, 3.310074, 2015-01-09, 34

Bucket 0

Bucket 15

Bucket 74

id, time, value

Page 26: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Bucketing: proof of concept• Based on mod(signal_id, x) function

• where x is tunable number of partitions created per day• (year, month, day, mod(signal id, x) )

• And it works!• 10 partitions per day = 4GB to read• Data retrieval time was reduced 10 times (from 15s to <2s)

• We have modified the Impala planner code to make the function based partition pruning implicitly• No need of explicit specification of a grouping function in ‘where’ clause 26

Page 27: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Profiling Impala queries execution (parquet) • Workload evenly distributed across our test cluster

• All machines similarly loaded• Sustained IO load: however storage not pushed to the limits• Our tests are CPU-bound

CPU fully utilised on all cluster nodes

Constant IO load

27

Page 28: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Benefits from a columnar store when using parquet

• Test done with complex analytic query

• Joining 5 tables with 1400 columns in total (50 used)

_x0007_Parquet

_x0004_Avro0 GB

20 GB

40 GB

60 GB

80 GB

100 GB

120 GB

4.4 GB

96.2 GB110.9 GB

data formats

data

volu

me

Execution time: 16s

Execution time: 53s

Amount of data read

28

Page 29: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

Outline• About CERN• The problem we want to tackle• Why Hadoop? Why Impala?• Results of Impala evaluation• Summary & Future plans

29

Page 30: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

What we like about Impala• Functionalities

• SQL for MPP• Extensive execution profiles• Support of multiple data formats and compressions• Easy to integrate with other systems (ODBC, JDBC)

• Performance• Scalability• Data partitioning• Short circuits reads & Data locality

Page 31: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

31

Adoption of SQL on Hadoop• Plans for the future

• Bring Impala pilot project to production• Develop more solutions for our users community• Integration with current systems (Oracle)

• Looking forward to product enhancements• For example indexes

Page 32: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

32

Conclusions• Hadoop is good for data warehousing

• scalable• many interfaces to the data• already in use at CERN for dashboards, system log analysis, analytics

• Impala (SQL on Hadoop) performs and scales• data format choice is a key (Avro, Parquet)• good solution for our time series DBs

Page 33: Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB

33

Acknowledgements • CERN users community

• Ch. Roderick, P. Sowinski, J. Wozniak• M. Berges, P. Golonka, A. Voitier

• CERN IT-DB• M. Grzybek, L. Canali, D. Lanza, E. Grancher, M. Limper, K.

Surdy