synthetic data generation for realistic analytics...

45
Synthetic Data Generation for Realistic Analytics Examples and Testing Ronald J. Nowling Red Hat, Inc. [email protected] http://rnowling.github.io/

Upload: others

Post on 20-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Synthetic Data Generation for Realistic Analytics Examples and

Testing Ronald J. Nowling

Red Hat, Inc. [email protected]

http://rnowling.github.io/

Page 2: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Who Am I?

•  Software Engineer at Red Hat •  Data Science Team, Emerging

Technologies – Evaluate open-source Big Data space – Ensure software works for Red Hat

customers – Promote data science internally through

consulting projects •  Apache BigTop PMC

2  

Page 3: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Synthetic Data

•  No licensing, privacy, or intellectual property concerns

•  Scalable: Laptops to Clusters! •  More reliable than external data sets •  Enable more realistic example

applications •  Enable more comprehensive testing than

wordcount and TeraSort

3  

Page 4: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

4  

Page 5: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

5  

Page 6: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

6  

Page 7: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

7  

Page 8: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

8  

Page 9: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

9  

Page 10: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Timings

•  Data set – 1000’s of files – 100’s of GBs compressed (gzip)

•  Conversion from .tsv.gz -> Parquet ~45 min

•  Compute aggregations on Parquet data and write out ~2 min

10  

Page 11: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Synthetic Data

•  Sensitive Data – Real data on cluster for scalability testing and

validation – Synthetic data for local development and testing

•  Smaller data sets for checking calculations – Total aggregation results requires re-running old

pipeline – Extra burden on operations team – Delay for development team

11  

Page 12: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

12  

Page 13: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

13  

Page 14: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

14  

Page 15: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

15  

Page 16: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

16  

Page 17: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

17  

Page 18: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Issues Tackled

•  Error in account validation introduced while refactoring code

•  Usage of the correct join types •  Validation of date-time operations •  Correct Output Formats

18  

Page 19: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Gzipped Files

•  Gzip doesn’t support random access – entire file needs to be decompressed sequentially

•  Large files – multiple gigabytes uncompressed

•  Too many files read in parallel –> long GC or OOM errors

19  

Page 20: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

(Quirky) TSV Files

•  Tab-separated, no quoting •  Escaped tabs and newlines within records – E.g., \\n or \\t

•  Improperly escaped tabs and newlines – E.g., \\\t vs \\\\t

20  

Page 21: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Solutions

•  Convert to Parquet as quickly as possible •  Use fewer cores per node – More RAM / task (partition)

•  2-phase grouping algorithm – Group within partition – Group partition ends using shuffle – Union

•  Optimized string operations – Use iterators instead of concatenation and

replace – Custom CSV parser implementation

21  

Page 22: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Apache BigTop BigPetStore Blueprints

•  Problem domain: Transactions for a fictional chain of pet stores

•  BigPetStore Data Generator simulates customer purchasing behavior to generate realistic transaction data

•  Blueprints for big data ecosystem – Hadoop: MapReduce / Pig / Hive / Mahout – Spark – Flink (in progress)

22  

Page 23: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigPetStore

23  

Page 24: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigPetStore

24  

HCFS

Page 25: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigPetStore

25  

Core (RDDs) HCFS

Page 26: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigPetStore

26  

Spark SQL

Core (RDDs) HCFS

Page 27: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigPetStore

27  

Spark SQL MLLib

Core (RDDs) HCFS

Page 28: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Team Cluster

•  ~10 nodes •  40 cores, 400GB RAM per node

28  

Page 29: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Potential Issues

•  Infrastructure •  Storage •  Software Installation •  Software Upgrades •  Spark Configuration Tuning •  User Management

29  

Page 30: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Real Stories

•  Creating a new user – User Gluster permissions incorrect

•  Cluster upgrade – Spark upgrade didn’t take because of issue with

Ansible role configuration – Wiped out our spark.conf – master / mesos

settings wrong

•  Gluster moint points disappeared on reboot – Not set in fstab

30  

Page 31: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

31  

Page 32: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

32  

Page 33: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

33  

Page 34: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

34  

Page 35: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

k8petstore

35  

Page 36: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Use Cases

•  Configuration •  Scalability •  Fault Tolerance

36  

Page 37: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

k8petstore

•  OpenContrail networking solution demo1 •  Kubernetes JuJu Charm documentation

example2 •  Kubernetes v1.0 launch talk at OSCON3 [1] -

https://pedrormarques.wordpress.com/2015/04/24/kubernetes-and-opencontrail/

[2] - http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html [3] - http://www.oscon.com/open-source-2015/public/schedule/detail/45281

37  

Page 38: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

APACHE BIGTOP DATA GENERATORS

38  

Page 39: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigPetStore

39  

Page 40: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigTop Weatherman

40  

Page 41: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

BigTop Bazaar

41  

Page 42: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Vision

•  Encourage synthetic data generation for testing and realistic examples

•  Serve as a resource for the larger Apache and open source communities

•  Emphasis on –  Flexibility – Scalability – Realism

•  We look forward to collaborating and getting folks involved!

42  

Page 43: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Resources

http://bigtop.apache.org/

http://github.com/apache/bigtop

http://rnowling.github.io/

43  

Page 44: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

Conclusion

•  Synthetic data generators and blueprints are useful!

•  Case studies: – Data Processing Pipelines – Cluster Deployment – Kubernetes

•  BigPetStore and BigTop Data Generators efforts in Apache BigTop

•  Open invitation to get involved and collaborate

44  

Page 45: Synthetic Data Generation for Realistic Analytics …rnowling.github.io/static/rnowling_apache_big_data_eu...• Data Science Team, Emerging Technologies – Evaluate open-source Big

QUESTIONS

45