analyse von großen datensätzen in den...
TRANSCRIPT
![Page 1: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/1.jpg)
Analyse von großen Datensätzen in den Lebenswissenschaften
und der Bioinformatik (19403201)
Tim Conrad
Session 15
![Page 2: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/2.jpg)
• Technologies / Frameworks for Big Data analysis• ETL in the Cloud
![Page 3: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/3.jpg)
Data Streams = continuous flows of dataExample Analyses:
![Page 4: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/4.jpg)
![Page 5: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/5.jpg)
https://databricks.com/session/a-platform-for-large-scale-neuroscience
![Page 6: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/6.jpg)
![Page 7: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/7.jpg)
![Page 8: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/8.jpg)
Neuroscience @ Freeman Lab, Janelia Farm
![Page 9: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/9.jpg)
![Page 10: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/10.jpg)
![Page 11: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/11.jpg)
11
![Page 12: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/12.jpg)
Analyzing STREAM DATA:
Ingest, Process, Store
12
![Page 13: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/13.jpg)
http://blog.infochimps.com/2012/10/30/next-gen-real-time-streaming-storm-kafka-integration/
Common Pipeline: Ingest, Process, Store
Processing Stack
![Page 14: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/14.jpg)
How to process big streaming data
![Page 15: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/15.jpg)
How to process big streaming data
![Page 16: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/16.jpg)
16
![Page 17: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/17.jpg)
Stream Ingestion Systems
![Page 18: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/18.jpg)
Stream Processing Systems
![Page 19: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/19.jpg)
Stream Storing Systems
![Page 20: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/20.jpg)
Frameworks / Technologies for Big Data Analysis
![Page 21: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/21.jpg)
![Page 22: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/22.jpg)
![Page 23: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/23.jpg)
![Page 24: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/24.jpg)
![Page 25: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/25.jpg)
![Page 26: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/26.jpg)
![Page 27: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/27.jpg)
![Page 28: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/28.jpg)
![Page 29: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/29.jpg)
![Page 30: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/30.jpg)
https://www.elastic.co/de/products/kibana
![Page 31: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/31.jpg)
Technologies „in the field“
31
![Page 32: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/32.jpg)
Cloud ETL Tools
AWS GlueServerless ETL
Azure Data FactoryVisual Cloud ETL
Google Cloud DataflowUnified Programming Model for Data Processing
![Page 33: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/33.jpg)
AWS Glue architecture
Source: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html
![Page 34: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/34.jpg)
Components:• Data Catalog• Crawlers• ETL jobs/scripts• Job scheduler
Useful for…• …running serverless queries against S3 buckets and
relational data• …creating event-driven ETL pipelines• …automatically discovering and cataloging your
enterprise data assets
AWS Glue
![Page 35: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/35.jpg)
Can act as metadata repository for other Amazon services
Tables - Added to “databases” using
the wizard or a crawler
Data sources: Amazon S3, Redshift, Aurora, Oracle, PostgreSQL, MySQL, MariaDB, MS SQL Server, JDBC, DynamoDB
Crawlers connect to one or more data stores, determine the data structures, and write tables into the Data Catalog
Data catalog - the central component
![Page 36: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/36.jpg)
Jobs PySpark or Scala scripts, generated by AWS Glue Visual dataflow can be generated, but not used for development Execute ETL using the job scheduler, events, or manually invoke Built-in transforms used to process data
ApplyMapping• Maps source and target columns
Filter• Output selected fields to new DynamicFrame
SelectFields
SplitRows• Load new DynamicFrame based on
filtered records• Split rows into two new DynamicFrames
based on a predicate Join SplitFields
• Joins two DynamicFrames • Split fields into two new DynamicFrames
![Page 37: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/37.jpg)
![Page 39: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/39.jpg)
Unified programming model for batch (historical) and streaming (real-time) data pipelines and distributed compute platform• Reuse code across both batch and streaming pipelines• Java or Python based
Programming model an open source project - Apache Beam (https://beam.apache.org/)• Runs on multiple different distributed processing back-ends:
Spark, Flink, Cloud Dataflow platforms
Fully managed service• Automated resource management and scale-out
Google Cloud Dataflow overview
Source: https://cloud.google.com/dataflow/
![Page 40: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/40.jpg)
Dataflow I/O transforms
![Page 41: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/41.jpg)
![Page 42: Analyse von großen Datensätzen in den ...medicalbioinformatics.de/downloads/lectures/BigData_Analysis_in... · Stream Processing Systems. Stream . Storing Systems. Frameworks](https://reader033.vdocuments.net/reader033/viewer/2022042307/5ed3734e87321e22e04826a7/html5/thumbnails/42.jpg)
Build data pipelines using a visual ETL user interface• Visual Studio Team Services (VSTS) Git integration for collaboration,
source control, and versioning
Drag, drop, link activities• Copy Data: Source to Target• Transform: Spark, Hive, Pig,
streaming on HDInsight, Stored Procedures, ML Batch Execution, etc.
• Control flow: If-then-else, For-each, Lookup, etc.
Azure Data Factory
Source: https://azure.microsoft.com/en-us/blog/continuous-integration-and-deployment-using-data-factory/