destroying data silos
TRANSCRIPT
![Page 1: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/1.jpg)
Destroying Data Silos
Hellmar Becker
Senior IT Specialist
Hadoop Summit 2015, Brussels
![Page 2: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/2.jpg)
Who am I?
2
![Page 3: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/3.jpg)
3
Datalake in ING NL
Integrate all data sources
within the bank into
one processing platform
• Batch data streams
• Live transactions
• Model building for customer interaction
Open source software where possible!
![Page 4: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/4.jpg)
Zoom in: Datalake Archive
4
Today, let’s focus on one specific part of the story:
• Collect data in a unified format
• Store these data secure from manipulation and
• unauthorized access
• Make data available to analytical applications
• Business Intelligence, Data Science
Hadoop based cluster is a good solution
to address these targets
![Page 5: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/5.jpg)
Circa 2000: Data Warehouse
• Based on relational database technology (Oracle, DB2, …)
• Challenge 1: Data model is difficult to adapt after the fact
• Challenge 2: Resilience and fault tolerance are not built in
• Challenge 3: Scaling proves difficult and expensive (specialized hardware)
• Challenge 4: RDBMS brings a lot of overhead – e. g. referential integrity
Modern data platforms (Hadoop, Spark, Cassandra) address many of these issues
Old world vs. New World
5
Operational
data
Staging
Files
ETL Operational
data
Data Mart
Data Mart
Data Mart
Metadata
Detail data Aggregated
data
Reporting
Analytics
Predictive
Modeling
![Page 6: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/6.jpg)
6
Target: Data Lake Architecture
![Page 7: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/7.jpg)
Pick your battles
• Toolset in the bank has grown around RDBMS and mainframe
• We cannot sweep out everything, have to handle legacy
• Plant a seed: Replace one component and connect it to all legacy interfaces
• Grow from there!
7
Operational
data
Staging
Files
ETL Operational
data
Data Mart
Data Mart
Data Mart
Metadata
Detail data Aggregated
data
Reporting
Analytics
Predictive
Modeling
![Page 8: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/8.jpg)
Challenges
• Zero Touch Deployment
• Risk issues with deployment tools that require admin (root) access to servers
• Policies within the organization
• Example: The unit of consideration is a single server, but we need to look at entire
clusters
• Legacy protocols – Mainframe data formats, e. g. character sets
• Security is paramount – protect sensitive data
8
![Page 9: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/9.jpg)
Security Concept
Authentication Management
• Using Kerberos – proven technology, secure but hard to configure
• Need to align access with HR database – connect to corporate directory
Authorization Management
• Uniform views across all components of a cluster
• Using Ranger to secure all services with a uniform set of policies
Auditing
• Ranger logs all interactions in order to exterminate threats
Connecting the Pieces
• Sideline challenge: Linux world and Windows world need to be connected
9
![Page 10: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/10.jpg)
Security Concept
10
![Page 11: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/11.jpg)
Agile Working
11
• Setup of this kind of project requires interdisciplinary
cooperation
• DevOps teams provide a lot of the required skills
with short communication paths
• Cooperation across department boundaries can be a
challenge
• Agile delivery vs. Expectations and timelines
• Manage external dependencies in a Scrum setting
![Page 12: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/12.jpg)
Shaping the Future
12
Existing standards do not always fit our goals and tools
Work with interdepartmental teams – DevOps, Infra,
DBAs, Business, Risk, Legal
We are influencing the standards that the bank will set
for coming systems!
![Page 13: Destroying Data Silos](https://reader031.vdocuments.net/reader031/viewer/2022020108/587286f91a28abc7068b7557/html5/thumbnails/13.jpg)
Attributions
• Hellmar in Nîmes / With Python in Mindanao, by the author
• Domtoren in het oranje licht by helena_is_here is licensed under CC BY 2.0
• Data Pipeline, ING OIB Image Bank
• Data Pipeline, ING OIB Image Bank, edited (cropped) by the author
• Baby Elephant with mother by David Rosen is licensed under CC BY 2.0
• Bruarfoss Waterfall in winter, Iceland by Diana Robinson is licensed under CC BY-
ND 2.0
• Elephants at Pinnawala by Jan Arendtsz is licensed under CC BY-NC 2.0
13