data infrastructure architecture for medium size organization: tips for collecting, storing and...

Egor PakhomovData Architect, Anchorfreeegor@anchrofree.com

Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.

Medium organization (<500 people)

Big organization ( >500 people)

DATA CUSTOMERS >10 >100

DATA VOLUME “Big data” “Big data”

DATA TEAM PEOPLE RESOURCES

Enough to integrate and support some open source stack

Enough to write our own data tools

FINANCIAL RESOURCES Enough to buy hardware for Hadoop cluster

Enough to buy some cloud solution (Databricks cloud, Google BigQuery...)

Examples:Examples:

Data infrastructure architecture

HOW TO MANAGE BIG DATA

WHEN YOU ARE NOT THAT BIG?

Data architect in AnchorFree

About me

Spark contributor since 0.9Integrated spark in Yandex Islands. Worked in Yandex Data Factory

Participated in “Alpine Data” development - Spark based data platform

Agenda

Data aggregation

Why SQL is important and how to use it in Hadoop?

• SQL vs R/Python• Impala vs Spark• Zeppelin vs SQL

desktop client

How to store data to query it fast and change easily?

• JSON vs Parquet• Schema vs schema-

How to aggregate your data to work better with BI tools?

• Aggregate your data!• SQL code is code!

Querying

Storage

Aggregation

Querying

Why SQL is important and how to use it in Hadoop?

1. SQL vs R/Python2. Impala vs Spark3. Zeppelin vs SQL desktop client

Analysts

Regular data transformations

What do you need from SQL engine?

Fast Reliable Able to process terabytes of data

Support Hive metastore

Support modern SQL statements

Hive metastore role

Hive Metastoretable_1 -> file341, file542, file453

table_2 -> file457, file458, file459table_3 -> file37, file568, file359table_4 -> file3457, file568, file349…..

Driver of SQL engine 1

Executor Executor Executor Executor

Which one would you choose? Both!

SparkSQL ImpalaSUPPORT HIVE METASTORE + +FAST - +RELIABLE (WORKS NOT ONLY IN RAM) + -

JSON SUPPORT + -HIVE COMPATIBLE SYNTAX + -OUT OF THE BOX YARN SUPPORT + -MORE THAN JUST A SQL FRAMEWORK + -

Connect Tableau to HadoopStep 1

Hadoop

ODBC/JDBC server

Give SQL to users

Hadoop

ODBC/JDBC server

Step 2

1. Manage desktop application on N laptops

2. One spark context per many users

3. Lack of visualizing

4. No decent resource scheduling

Would not work...

No decent resource scheduling: One user blocks everyone

No decent resource scheduling: Hadoop good in resource scheduling!

Apache Zeppelin is our solution

1. Web-based

2. Notebook-based

3. Great visualisation

4. Works with both Impala and Spark

5. Has cloud solution with support - Zeppelin Hub from NFLabs

It’s great!

Apache Zeppelin integration

Hadoop

Storage

How to store data to query it fast and change easily?

1. JSON vs Parquet2. Schema vs schema-less

What would you need from data storing?

Flexible format

Fast querying Access to “raw” data

Have schema

Can we choose just one data format? We need both!

Json Parquet

FLEXIBLE +

ACCESS TO “RAW” DATA +

FAST QUERYING +

HAVE SCHEMA +

IMPALA SUPPORT +

FORMAT QUERY TIME

Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec

JSON SELECT Sum(Get_json_object(line, ‘$.some_field’)) FROM logs.json_datasource

764 sec

Parquet is 5 times faster!

But! when you need raw data, 5 times slower is not that bad

Let’s compare elegance and speed:

{“First name”: “Mike”,“Last name”: “Smith”,

“Gender”: “Male”,“Country”: “US”

{“First name”: “Anna”,“Last name”: “Smith”,

“Age”: “45”,“Country”: “Canada”,

Comments: ”Some additional info”}...

FIRST NAME

LAST NAME GENDER AGE

Mike Smith Male NULL

Anna Smith NULL 45

... ... ... ...

JSON Parquet

How data in these formats compare

Aggregation

How to aggregate your data to work better with BI tools?

1. Aggregate your data!2. SQL code is code!

● “Big data” does not mean you need to query all data Daily

● BI tools should not do big queries

Aggregate your data!

BI tool

select * from ...

How aggregation works?

Git with queries

Query executor

Aggregated table

Report development process

Creating aggregated table in Zeppelin

Creating BI report based on this table

Adding queries to git to run daily

Publishing report

Data for report changing process:

Change query in git1

One more tip)

1. Need to apply our patches to source code

2. Move to new versions before any release

3. Move to new version on part of infrastructure - rest remain on old one

We do not use Spark, which comes with Hadoop installation

Questions?

Contact:Egor Pakhomov

egor@anchorfree.compahomov.egor@gmail.comhttps://www.linkedin.com/in/egor-pakhomov-35179a3a

data infrastructure architecture for medium size organization: tips for collecting, storing and...

Technology

sludge containers - wastequip s line of sludge containers is...

drinking water on statia no permanent sweet water sources:...

cheminformatics and chemical...

11 environmental information systems - tu braunschweig ·...

citect scada 2018 r2 - controsys … scada...citect scada...

collecting and storing data from internet based sources

what is bioinformatics? course...

medium & heavy duty cantilever racking · medium & heavy...

collecting and storing sequences in the laboratory

wonderware industrial information management · just noise....

identifying, collecting, pressing, mounting, and storing...

slide 2 collecting, storing and analyzing big data

proflex 800 cors - pubs.usgs.govproflex 800 cors...

basic disputes under carrier...

unit-ii enterprise resource...

ritu gupta and b.m - e-liseprints.rclis.org/29026/1/res...

sc380_d04.01. framework for base registry access and ... ·...

cs 350 algorithms for gis. what is gis? definitions a...

request for proposals - dmh.ms.gov · case-by-case basis,...

insights 2.3 documentation - arcgis...arcgis is a...