![Page 1: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/1.jpg)
MAKING BIG DATA COME ALIVE
Integrating Apache Spark And NiFiFor Data Lakes
Ron Bodkin Founder & PresidentScott Reisdorf R&D Architect
![Page 2: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/2.jpg)
2
Agenda• Requirements• Design• Demo
![Page 3: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/3.jpg)
3
• A central repositorywith trusted,consistent data
• Reduce costs by offloading analyticalsystems and archiving cold data
• Derive value quicklywith easier discoveryand prototyping
• A laboratory for experimenting withnew technologiesand data
Goals for a Data Lake
![Page 4: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/4.jpg)
4
• Automation of pipelines with metadata and performance tracking
• Governance withclear distinction ofroles and responsibilities
• SLA tracking withalerts on failures orviolations
• Interactive data discovery and experimentation
What’s Needed For A Hadoop Data Lake?
![Page 5: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/5.jpg)
5
Example Ingestion Project
• 4000+ unique flat files and RDMS tables, plus a few streaming data feeds• Mix of incremental and snapshot data• Ingest into Hadoop (minimally HDFS and Hive tables)• Cleansing/encryption and data validation• Metadata capture
Focus shifts over time from data ingestion to transformation then to analytics
![Page 6: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/6.jpg)
6
Design
![Page 7: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/7.jpg)
7
Apache Spark Functions•Cleanse• Validate• Profile•Wrangle
![Page 8: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/8.jpg)
8 05/03/2023© 2016 Think Big, a Teradata Company
Pipeline design with Apache• Visual drag-and-drop • Dozens of data connectors• 150+ pre-built transforms• Data lineage• Batch and Streaming• Extensible
![Page 9: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/9.jpg)
9
Role separation
• IT Designers design models in NiFi• Register with framework• Integrated development
process© 2016 Think Big, a Teradata Company 05/03/2023
Apache NiFi Think Big framework
• Users configure new feeds• Based on common model• Generated and executed in NiFi
register
deploy
![Page 10: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/10.jpg)
101005/03/2023
© 2015 Think Big, a Teradata Company
User features around
org. roles
Visual design
Streaming and Batch
Fully governed
Integrated Best
Practices
Secure, modern
architecture
Design Approach
Will be open source (Apache
license)
![Page 11: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/11.jpg)
1111
Ingest and Prepare
• UI-guided feed creation• Data protection• Data cleanse• Data validation• Data profiling• Powered by Apache Spark
![Page 12: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/12.jpg)
Unpack and/or merge small files
Put file HDFS
Cleanse/Standardize
Spark
Data ProfileSpark
Metadata
ValidateSpark
Data Ingest Model
Metadata determines behavior of individual componentsAdds many Hadoop-specific higher-level NiFi processors
Index TextElasticsearch
Merge / DedupeHive
Compress & Archive Originals
HDFS,S3
Extract Table JDBC
Get File(s)Filesystem
MessageJMS/Kafka
OtherHTTP/REST, etc.
Data policies
12
![Page 13: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/13.jpg)
1313
Data self-service and “wrangle”
• Graphical SQL builder• 100+ transform functions• Machine learning• Publish and schedule• Powered by Apache Spark
![Page 14: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/14.jpg)
1414
Data Discovery
• Google-like searching • Extensible metadata• Data profile • Data sampling
![Page 15: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/15.jpg)
1515
Operations
• Dashboard• Health Monitoring• Data Confidence• SLA enforcement• Alerts• Performance
reports
![Page 16: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/16.jpg)
16
• Powerful search capabilities for users against data(think Google-like searching)
• NiFi processor extracts source data from Hadoop tablefor indexing in ElasticSearch
• Incremental updates during ingest
ElasticSearch – Full Text Indexing
Data Lakeselect id,user,tweetfrom twitter_feed
extract JSON
![Page 17: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/17.jpg)
17
Demo
![Page 18: Integrating Apache Spark and NiFi for Data Lakes](https://reader033.vdocuments.net/reader033/viewer/2022061610/586e8d1c1a28aba0038b8749/html5/thumbnails/18.jpg)
1818