![Page 1: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/1.jpg)
Metadata Managementin Big Data
Data Management Challenges@ezzibdeh
Tariq Ezzibdeh
![Page 2: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/2.jpg)
Aim• Outline some perspective on metadata management principles that
apply in the big data space and beyond
• Provide some data governance foundations in the data space that essentially would outlast the actual technologies to serve the needs of the future
• Discuss technologies and solutions currently in the market
![Page 3: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/3.jpg)
Big Data - Overview• Big Data 5 V’s:• Volume• Velocity • Variety • Veracity
• Platform of today – set of relatively split up components • Data is stored on HDFS – File system• Catalogue of Data and its schema is maintained in another service – TBC! • Query front ends – Query engines based on different requirements
=> Value
![Page 4: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/4.jpg)
Platform Architecture – Modern Architecture
Data
Sou
rces
Ac
quisi
tion
Data
Sys
tem
s
Staging ZoneETL and data
standardization
Pristine ArchiveCompressed
Gzip etc.
Data WarehouseImmutable data
Analytics ZoneAllocated data changes
Schema CatalogueWell-define reference to data structures and attributes
Data LedgerTrack data and its access with lineage and operations
Big
Data
Pla
tform
Data Marts UI/API
Apps
Source: Hortonworks
![Page 5: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/5.jpg)
Why do we need to manage metadata for Big data platforms?
Large volumes of data landing in Hadoop/Big Data
Growing users working with the data
The need for effective control & consumption
of Data
The implementation needs to:• Offer good data visibility across your cluster• Capture data lineage across source systems and in the platform • Audit and record operations that are performed in the platform• Enforce policies that are defined by the platform stewards • Help reduce data redundancy on the platform
Source: Cloudera
![Page 6: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/6.jpg)
Metadata in Action
![Page 7: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/7.jpg)
Metadata – What is it?Data about Data!• Business Metadata
Supplies the business context around data, such as the business term’s name, definition, owners or stewards, and associated reference data
• Technical Metadata provides technical information about the data, such as the name of the
source table, the source table column name, and the data type (e.g., string, integer)
• Operational Metadata furnishes information about the use of the data, such as date last updated,
number of times accessed, or date last accessed
Source: Informatica
![Page 8: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/8.jpg)
Why do I need all this metadata?• Data lake will contain all types of data – log streams - kafka,
DBS – sqoop… don’t make your lake turn into a swamp!• Consistency of definitions - To reconcile the difference in terminology such as
"clients" and "customers," "revenue" and "sales”
• Clarity of data lineage – About origins of a data set and can be granular enough to define information at the attribute level, including operations on it
• To understand data usage on your cluster • Optimize queries and views
• Compliance and Regulatory • Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA,
Basel II • Security - Authorization, Authentication – Handling sensitive data • Auditing - Recoding every attempt to access• Archive & Retention - Data life cycle policies
Source: Teradata/Techtarget
![Page 9: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/9.jpg)
Metadata System ArchitectureTopologically, metadata repository architecture defines one of the following three styles:• Centralized Metadata repository
Efficient access and adaptability, scalability and high performance Single point of failure and continuous synchronization
• Distributed Metadata repository Access to metadata repo in real-time, up-to date metadata Overhead in maintaining the configuration of the source system changes and
HA • Federated or Hybrid Metadata repository
Central definition storage with references to the proper locations of the accurate definitions
Source: Techtarget
![Page 10: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/10.jpg)
Use-cases for the need for Metadata
![Page 11: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/11.jpg)
Use Cases – Analytics 1. Finding the Data: Data Scientists spend a lot time finding the
correct columns for variable selection• Around 80% of the data scientist’s time on column investigation with SMEs
2. Profile of Data: Reduce the number of time spent on data profiling by the ad-hoc queries • ~78% of the queries run on the cluster are profiling queries
3. Track the transformation: Data Scientists would like to understand how the data sets are derived • Not fully tracked except at a high level
Source: Aetna
![Page 12: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/12.jpg)
1. Finding the data: Challenges• Hive requires relatively manual traversal of the schema to find the
table and columns• HDFS also requires traversal of the directory listing to find a file • Any documentation (external to the system) become outdated and
are not always reliable • No simple way to add business metadata
Source: Aetna
![Page 13: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/13.jpg)
HDFS/Hive Architecture
hadoop.apache.orgBen Lever -Slideshare
Source:
![Page 14: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/14.jpg)
1. Finding the data: Solutions• Run-time capture of metadata of hive and HDFS, and store in
repository• Provide an API to query the metadata and search across it • Provide an API or other ways to enrich the data with its business
context
Business Metadata
Technical/Physical Metadata
Hive
HDFS
Ingestion/Sqoop
Apache Atlas
Source: Aetna
![Page 15: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/15.jpg)
2. Profile of Data: Challenges and Solutions
• Access to hive metastore will introduce latency in production• Lack of comprehensive information
provided by the hive metastore
78%
18%
4%
Average Daily QueryProfiling Exploratory Production
• Provide a system with business, technical data that are cross referenced• Have a framework for the data scientist
to accommodate additional profiling
Source: Aetna
![Page 16: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/16.jpg)
3. Track the transformation: Challenges and Solutions
• Documenting transformation is manual and difficult to scale • Mechanism for auditing data pipeline still
lacking• Data quality and provenance is too manual
• Leverage metadata already captured to construct transformations • Provide an API to query transformations• Provide a visualization for the transformations
Source: Aetna
![Page 17: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/17.jpg)
What do we need? 1. A Searchable platform for all the data types for business and technical metadata 2. Data profile store with basic metrics of the data
• Min• Max • Column distribution
3. Visual lineage for the data flow from the source system to different components within the platform • ETL operations – HL view• Analytics queries
4. Automated Metadata driven data ingestion and thus management • The Data Lake concept relies on capturing a robust set of attributes for every piece of content
within the lake• Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking
facility.
![Page 18: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/18.jpg)
Solutions for Hadoop
![Page 19: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/19.jpg)
Apache Atlas – deep dive• Apache Atlas Capabilities: Overview
• Data Classification• Import or define taxonomy business-oriented annotations for
data• Define, annotate, and automate capture of relationships
between data sets • Export metadata to third-party systems
• Centralized Auditing • Capture security access information • Capture the operational information for execution, steps, and
activities • Search & Lineage (Browse)
• Text-based search features locates relevant data and audit event across Data Lake quickly and accurately
• Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information
• Security & Policy Engine • Rationalize compliance policy at runtime based on data
classification schemes
Source: Hortonworks
Open-source Incubator project
![Page 20: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/20.jpg)
Demo
Apache Atlas in action!
![Page 21: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/21.jpg)
Possible solutions for other platforms
![Page 22: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/22.jpg)
Netflix – Managing Data Platforms
Source: Netflix
![Page 23: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/23.jpg)
Possible Solutions for other Platforms
Metacat • Apply Metadata management
on Service layer• Federated metadata catalog for
the whole data platform• Proxy service to different
metadata sources• Data metrics, data usage,
ownership, categorization and retention policy …
• Common interface for tools to interact with metadata
Tracking Data Difference• Apply Metadata management
on Service layer• Track the changes to
documents/entities• Custom code tracking through
logs collected as Mongo, or use a module called MongoID
Netflix OSS
![Page 24: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/24.jpg)
Where Else?{ "Description": "A containerized foobar", "Usage": "docker run --rm example/foobar [args]", "License": "GPL", "Version": "0.0.1-beta", "aBoolean": true, "aNumber" : 0.01234, "aNestedArray": ["a", "b", "c"] }
<meta name=”description” content=”155 characters of message matching text with a call to action goes here”>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.JOSA.Meta</groupId> <artifactId>project</artifactId> <version>1.0</version> </project>
![Page 25: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/25.jpg)
Notes - Summary • Consider the different types of metadata you need to manage• Build a robust descriptive dictionary for the data • Manage metadata as a team effort. It has a lot of benefit so make it
Agile but effective.
Finally…remember thatOne’s Metadata – d/dx – is someone else’s Data!
![Page 26: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/26.jpg)
Resources • HDP 2.3 Preview Sandbox VM: (Hortonworks) – http://hortonworks.com/hdp/whats-new/ • Apache Atlas:– http://atlas.incubator.apache.org/ – http://incubator.apache.org/projects/atlas.html – https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi• Metadata Management (General)– https://www.informatica.com/content/dam/informatica-com/global/amer/us/collateral/white-paper/metadata-management-data-governance_white-paper_2163.pdf
![Page 27: JOSA TechTalk: Metadata Managementin Big Data](https://reader036.vdocuments.net/reader036/viewer/2022062901/58f1e5b11a28abf8638b45b5/html5/thumbnails/27.jpg)
Tariq Ezzibdeh
Questions..?Contact info: