hadoop 2.0 - solving the data quality challenge
DESCRIPTION
The Briefing Room with Dr. Claudia Imhoff and RedPoint Global Live Webcast on July 22, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7bb4cbc33402c3b5f649343052cb9a6d Whether data is big or small, quality remains the critical characteristic. While traditional approaches to cleansing data have made strides, nonetheless, data quality remains a serious hurdle for all organizations. This is especially true for identity resolution in customer data, but also for a range of other data sets, including social, supply chain, financial and other domains. One of the most promising approaches for solving this decades-old challenge incorporates the power of massive parallel processing, a la Hadoop. Register for this episode of The Briefing Room to learn from veteran Analyst Dr. Claudia Imhoff, who will explain how Hadoop 2.0 and its YARN architecture can make a serious impact on the previously intractable problem of data quality. She’ll be briefed by George Corugedo of RedPoint Global, who will show how his company’s platform can serve as a super-charged marshaling area for accessing, cleansing and delivering high-quality data. He’ll explain how RedPoint was one of the first applications to be certified for running on YARN, which is the latest rendition of the now-ubiquitous Hadoop. Visit InsideAnlaysis.com for more information.TRANSCRIPT
Grab some coffee and enjoy the pre-show banter before the top of the hour!
The Briefing Room
Hadoop 2.0: Solving the Data Quality Challenge
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Topics
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: INNOVATIVE TECHNOLOGY
August: BIG DATA ECOSYSTEM
September: INTEGRATION
Twitter Tag: #briefr
The Briefing Room
Analyst: Dr. Claudia Imhoff
Claudia Imhoff is President & Founder of
Intelligent Solutions, Inc.
Twitter Tag: #briefr
The Briefing Room
RedPoint Global
! RedPoint Global is a data management and integrated marketing technology company
! Its Convergent Marketing Platform™ offers products designed for data management, collaboration and architecture integration.
! RedPoint Data Management for Hadoop is YARN-compliant and enables analysts to access and manipulate data directly within the Hadoop cluster.
Twitter Tag: #briefr
The Briefing Room
Guest: George Corugedo
George Corugedo is Chief Technology Officer & Co-Founder at RedPoint Global Inc. A mathematician and seasoned technology executive, George has over 20 years of business and technical expertise. As co-founder and CTO of RedPoint Global, George is responsible for leading the development of the RedPoint Convergent Marketing Platform™. A former math professor, George left academia to co-found Accenture’s Customer Insight Practice, which specialized in strategic data utilization, analytics and customer strategy. Previous positions include director of client delivery at ClarityBlue, Inc., a provider of hosted customer intelligence solutions to enterprise commercial entities, and COO/CIO of Riscuity, a receivables management company specializing in the utilization of analytics to drive collections.
The Neglected Discipline of Data Quality in Hadoop July 2014
11 © RedPoint Global Inc. 2014 Confidential
Overview – Challenges to Adoption
• Severe shortage of MR skilled resources
• Very expensive resources and hard to retain
• Inconsistent skills lead to inconsistent results
• Under uAlizes exisAng resources
• Prevents broad leverage of investments across enterprise
Skills Gap
• A nascent technology ecosystem around Hadoop
• Emerging technologies only address narrow slivers of funcAonality
• New applicaAons are not enterprise class
• Legacy applicaAons have built short term capabiliAes
Maturity & Governance
• Data is not useful in its raw state, it must be turned into informaAon
• Benefit of Hadoop is that same data can be used from many perspecAves
• Analysts must now do the structuring of the data based on intended use of the data
Data Into InformaAon
12 © RedPoint Global Inc. 2014 Confidential
Key Points to Cover Today
! Broad functionality across data processing domains
! Validated ease of use, speed, match quality and party data superiority
! Hadoop 2.0/YARN certified – 1 of first 17 companies to do so
! Not a repackaging of Hadoop 1.0 functionality. RedPoint Data Management is a pure YARN application (1 of only 2 in the initial wave of certifications)
! Building a complex job in RPDM takes a fraction of the time that it takes to write the same job in Map Reduce and none of the coding or java skills.
! Big functional footprint without touching a line of code
! Design model consistent with data flow paradigm
! RPDM has a “Zero-Footprint” install in the Hadoop cluster
! The same interface and functionality is available for both structured and unstructured databases. Thus it is seamless to work across both from a users perspective.
! Data quality done completely within the cluster
13 © RedPoint Global Inc. 2014 Confidential
Key features of RedPoint Data Management
Master Key Management
ETL & ELT Data Quality
Web Services IntegraAon
IntegraAon & Matching
Process AutomaAon & OperaAons
• Profiling, reads/writes, transformaAons
• Single project for all jobs
• Cleanse data • Parsing, correcAon • Geo-‐spaAal analysis
• Grouping • Fuzzy match
• Create keys • Track changes • Maintain matches over Ame
• Consume and publish • HTTP/HTTPS protocols • XML/JSON/SOAP formats
• Job scheduling, monitoring, noAficaAons
• Central point of control
All func(ons can be used on both TRADITIONAL and BIG DATA
Creates clean, integrated, ac/onable data – quickly, reliably and at low cost
14 © RedPoint Global Inc. 2014 Confidential
RedPoint Data Management on Hadoop
ParAAoning AM / Tasks
ExecuAon AM / Tasks Data I/O Key / Split
Analysis
Parallel SecAon (UI)
YARN
MapReduce
15 © RedPoint Global Inc. 2014 Confidential
RedPoint Functional Footprint
Monitoring and Management Tools
AMBARI
MAPREDUCE
REST
DATA REFINEMENT
HIVE PIG
HTTP
STREAM
STRUCTURE
HCATALOG (metadata services)
Query/Visualization/ Reporting/Analytical
Tools and Apps
SOURCE DATA
- Sensor Logs - Clickstream - Flat Files - Unstructured - Sentiment - Customer - Inventory
DBs
JMS Queue’s
Files Fil
es Files
Data Sources
RDBMS
EDW
INTERACTIVE
HIVE Server2
LOAD
SQOOP
WebHDFS
Flume
NFS
LOAD SQOOP/Hive
Web HDFS
YARN
� � � � � � � � � �
� � � � � � � � � � �
� � � � � � � � � � �
� �
� �
� n
HDFS
1 � � � � � � � � � � � �
�
� � � � � � � � � � � � �
� � � � � � � � � � � � �
� � � � � � � � � � � � �
16 © RedPoint Global Inc. 2014 Confidential
RedPoint
Benchmarks – Project Gutenberg
Map Reduce Pig
Sample MapReduce (small subset of the entire code which totals nearly 150 lines): public static class MapClass extends Mapper<WordOffset, Text, Text, IntWritable> { private final static String delimiters = "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WordOffset key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line, delimiters); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Sample Pig script without the UDF: SET pig.maxCombinedSplitSize 67108864 SET pig.splitCombination true A = LOAD '/testdata/pg/*/*/*'; B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; C = FOREACH B GENERATE UPPER(word) AS word; D = GROUP C BY word; E = FOREACH D GENERATE COUNT(C) AS occurrences, group; F = ORDER E BY occurrences DESC; STORE F INTO '/user/cleonardi/pg/pig-count';
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
6 minutes runtime 15 minutes runtime 3 minutes runtime
Extensive optimization needed User Defined Functions required prior to running script
No tuning or optimization required
17 © RedPoint Global Inc. 2014 Confidential
Attributes of Information
RELEVANT InformaAon must pertain to a specific problem. General data must be connected to reveal relevance of the informaAon.
COMPLETE ParAal informaAon is oaen worse than no informaAon. ParAal informaAon frequently leads to worse conclusions than if no data had been used at all.
ACCURATE This one is obvious. In a context like health care, inaccurate data can be fatal. Precision is required across all applicaAons of informaAon.
CURRENT As data ages, it becomes less accurate. MulAple research studies by Google and others show the decay in the accuracy of analyAcs as data becomes stale.
ECONOMICAL There has to be a clear cost benefit. This requires work to idenAfy the realizable benefit of informaAon but this is also what rives the use if successful
18 © RedPoint Global Inc. 2014 Confidential
Reference Architecture for Matching in Hadoop
Data Sources CRM
ERP
Billing
Subscriber
Product
Network
Weather
Compete
Manuf.
Clickstream
Online Chat
Sensor Data
Social Media
Call Detail Records
FabricaAon Logs
Sales Feedback
Field Feedback
Field Feedback
+
19 © RedPoint Global Inc. 2014 Confidential
Resource Manager
Launches Tasks
Node Manager
DM App Master
DM Task
Node Manager
DM Task
DM Task
Node Manager
DM Task
DM Task
Launches DM App Master
Data Management Designer
DM ExecuAon
Server
Parallel SecAon
Running DM Task
12
3
RedPoint DM for Hadoop: Processing Flow
20 © RedPoint Global Inc. 2014 Confidential
The Data Management designer
21 © RedPoint Global Inc. 2014 Confidential
DM Hadoop Settings
22 © RedPoint Global Inc. 2014 Confidential
DM Parallel Section on Hadoop
23 © RedPoint Global Inc. 2014 Confidential
Who Should Care
! Companies interested in exploring the promise of Big Data Analytics and need an easy way to get started.
! Companies already investing heavily investing in Big
Data Analytics technologies but are stuck due to the shortage of skilled resources
! Large organizations that are focused on “Operational Offloading” and need to achieve it cost effectively
! Companies who recognize that much of the data that lands in Hadoop is external to the organization and need to have Data Quality and proper data governance applied to their Hadoop data.
24 © RedPoint Global Inc. 2014 Confidential
Data Inputs
RedPoint Convergent Marketing Ecosystem
Enh
an
ce
me
nt
SQL
Soc
ial
No
SQ
L
Analytics Machine Learning
GIS
Address Std.
Inbox Analysis Segmentation Attribution
CRM Trigger Audience Offer
Marketing Rules Engine
RedPoint Interaction
Social Mobile Digital Real Time
Cache
Analytics Marketing Operations Hadoop
RedPoint Data Management
Geocoding
Web Services
25 © RedPoint Global Inc. 2014 Confidential
RedPoint real-time decisions: how it works (web site example)
www
profile data context data
real-‐Ame profile
winning content RedPoint Machine Learning
rules
inbound personalizaAons combined with outbound contacts to create cross-‐channel interacAon history
update/ maintain over Ame
web site
REDPOINT EXECUTION ENVIRONMENT
personalizaAon opportunity
API call
personalized content CONTENT NEEDED
content
candidate content with associated eligibility & scoring rules
content stored in RedPoint, or RedPoint points to content in CMS or other system
API Nulla tincidunt dolor sit amet erat. Suspendisse dictum mauris sollicitudin luctus varius. Duis a mauris leo. Aenean vel euismod est. Phasellus pretium, sem id varius viverra, nisl elit commodo orci, vel sollicitudin dolor nibh ut nisl. Sed ut magna a arcu vulputate bibendum.
Duis vehicula tellus commodo mauris consequat rutrum eget sit amet arcu. Sed quis erat leo. Morbi accumsan aliquet tellus, ac consectetur nibh aliquet nec. Vivamus vel lacus ac ipsum ornare rhoncus. Aliquam libero magna, hendrerit vitae cursus vitae, accumsan eu sapien.
1st Party Customer data in database(s) and/or Hadoop
26 © RedPoint Global Inc. 2014 Confidential
RedPoint vs. alternatives
ü û
ü û
ü û
ü û
ü û
ü û
ü û
Pure YARN, no MapReduce Graphical UI, not code-‐based
All DQ/DI funcAons available Executes in Hadoop, no data movement
Zero footprint install, nothing in the cluster Same product for Hadoop and database
Top rated for ease-‐of-‐use
Twitter Tag: #briefr
The Briefing Room
Perceptions & Questions
Analyst: Dr. Claudia Imhoff
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Solve your business puzzles with Intelligent Solutions
SPONSORED BY HOSTED BY
Data Quality in the Hadoop Age
By Claudia Imhoff, PhD Intelligent Solutions, Inc.
Boulder BI Brain Trust [email protected]
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Claudia Imhoff
29
President and Founder Intelligent Solutions, Inc.
A thought leader, visionary, and practitioner, Claudia Imhoff, Ph.D., is an internationally recognized expert on analytics, business intelligence, and the architectures to support these initiatives. Dr. Imhoff has co-authored five books on these subjects and writes articles (totaling more than 150) for technical and business magazines. She is also the Founder of the Boulder BI Brain Trust (BBBT), an international consortium of independent analysts and experts. You can follow them on Twitter at #BBBT or become a subscriber at www.bbbt.us. Email: [email protected]
Phone: 303-444-6650 Twitter: Claudia_Imhoff
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Agenda
§ Extending the Data Warehouse Architecture § Things to Ponder…
30
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Next Generation BI
31Based on a concept by Shree Dandekar of Dell Slide compliments of Colin White – BI Research, Inc.
New business insights
Reduced costs
New technologies
Enhanced data
management
Advanced analytics
New deployment
options
Next generation
BI
DRIVERS
TECHNOLOGIES
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Systems of Record
§ Remember – It all starts here! § Transactional systems generate most of the data used for all other
activities – operational processes, BI & analytical capabilities, etc.
§ The point here is a reminder: § Extend OLTP systems of record as a “key” source of data § Many companies do not (or can not) leverage data they already
have in their operational systems
32
Operational systems
RT BI services
Other internal & external structured & multi-structured data
Real-time streaming data
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Next Generation – Extended Data Warehouse Architecture (XDW)
33
Traditional EDW environment
Investigative computing platform
Data refinery
Data integration platform
Analytic tools & applications
Operational real-time environment
RT analysis platform
Other internal & external structured & multi-structured data
Real-time streaming data Operational systems
RT BI services Slide created by Colin White – BI Research, Inc.
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Use Case: Traditional EDW
34
Most BI environments today: § New technologies can be incorporated
into the EDW environment to improve performance, efficiency & reduce costs
Use cases: § Production reporting (data quality
sensitive) § Historical comparisons § Customer analysis (next best offer,
segmentation, life-time value scores, churn analysis, etc.)
§ KPI calculations § Profitability analysis § Forecasting
Data integration platform
Traditional EDW environment
Analytic tools & applications
Operational systems
RT BI services
real-time models & rules
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Data Quality Needed
§ EDW is now the “production” analytical environment § Produces standard reports, comparisons, and analytics to be used
as final word on situations
§ Data must be integrated as much as possible § Data must be run through data quality grist mill § There must be a full audit trail from source to ultimate
report, analytic, etc.
35
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Use Case: Data Refinery
36
Ingests raw detailed data in batch and/or real-time into managed data store (lake, hub, swamp, dump…)
Distills the data into useful business information and distributes the results to downstream systems
May also directly analyze some data
Employs low-cost hardware and software to enable large amounts of detailed data to be managed cost effectively
Requires (flexible) governance policies to manage data security, privacy, quality, archiving and destruction
Traditional EDW environment
Investigative computing platform
Data refinery
Data integration platform
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Data Quality Needed
§ This is not a data dumping ground! § It should be monitored and assessed as to the data integration and
quality needs
§ Just because you can store massive sets of data doesn’t mean it is ignored or assumed to not need governance
§ Nor does it mean that there is no need for a business case for the massive amount of data § If analytic accuracy is at 99% using 45% of the data, why deal with
all of it?
§ But speed of integration and quality processing is also important
37
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Use Case: Investigative Computing
New technologies used here include: § Hadoop, in-memory computing,
columnar storage, data compression, appliances, etc.
Use cases: § Data mining and predictive modeling
for EDW and real-time environments § Cause and effect analysis § Data exploration (“Did this ever
happen?” “How often?”) § Pattern analysis § General, unplanned investigations
of data
38
Data refinery
Data integration platform
Analytic tools & applications
Operational real-time environment
RT analysis platform
Investigative computing platform
Operational systems
RT BI services
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Data Quality Needed
§ Much more experimental in nature – lots of queries with null results
§ Analytics may be approximations § Data integration may be needed for some data, not for
other § Data quality also varies in terms of what data must go
through DQ process § Difficulty is in determining what get integrated and run
through data quality processing
39
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Use Case: Real Time Operational Environment
Embedded or callable BI services:
§ Real-time fraud detection § Real-time loan risk assessment § Optimizing online promotions § Location-based offers § Contact center optimization § Supply chain optimization
Real-time analysis engine: § Traffic flow optimization § Web event analysis § Natural resource exploration
analysis § Stock trading analysis § Risk analysis § Correlation of unrelated data
streams (e.g., weather effects on product sales)
40Operational real-time environment
RT analysis platform
Other internal & external structured & multi-structured data
Real-time streaming data
Operational systems
RT BI services
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Data Quality Needed
§ Because of operational nature, data must be as good as it can possibly be
§ Data may or may not bee integrated with other operational systems’ data
§ False positives and negatives to models must be reconciled as quickly as possible
§ But speed of integration and quality processing is of the utmost importance!
41
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
All Components Must Work Together
42
analytic models analyses
Analytic tools & apps
Investigative computing platform
Data refinery Operational systems
existing customer
data
next best customer offer
3rd party data location data social data
feedback
RT analysis platform call center dashboard or web event stream
Slide created by Colin White – BI Research, Inc.
Traditional EDW environment
Other internal & external structured & multi-structured data
Real-time streaming data
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Agenda
§ Extending the Data Warehouse Architecture § Things to Ponder…
43
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 44
What Makes People Think These Have Gone Away?
§ Data Redundancy § Each system, application, and department in enterprise collects
own version of key business entities and attributes § Data Inconsistency
§ Enormous resources (time, money, and people) spent in reconciliation because of fractured data
§ Business Inefficiency § Fractured data generates business inefficiency – low productivity,
inefficient supply chain management, customer dissatisfaction, wasted marketing efforts
§ Business Change § Organizations are constantly changing and these disruptive events
cause a constant stream of changes to data
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved 45
Data Quality Challenges
§ Cultural Hurdles § Generating business case and obtaining
executive backing and funding § Requires a phased approach to quality deployment
§ Overcoming political barriers § E.g., moving from enterprise view to LOB/parochial
view of quality, yet still agreeing on common business definitions
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Data Quality Challenges
§ Technology Challenges § Unusual sources of data § Creating a flexible data governance model § Supporting complex & constantly changing data § Providing a flexible data integration
infrastructure § Wild West mentality…
46
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Data Governance and Data Quality is Changing
§ People using BI must “trust” the data § IT must work with the business to create certified data sets § Note: not all data must be certified but all data usage must be
documented and monitored
§ Governance still has an important role § Determine whether data used is “governed” (e.g., in a data
warehouse or MDM environment) or “ungoverned” (e.g., individual spreadsheets, external source)
§ Difficulty is figuring out differences – hence the need to monitor data usage
§ IT must have monitoring or oversight capability Note: LOB IT or experienced information producers may
have to take on some previously traditional central IT roles 47
Copyright © Intelligent Solutions, Inc. 2014 All Rights Reserved
Questions
§ What are the biggest challenges for data quality in the Hadoop age?
§ How do you justify the need for integration and quality processing in the “age of hurry up and give me the data”?
§ Not all data needs to be cleaned up and integrated but how do people determine what does and doesn’t?
§ What tips can you give us to help get the time, resources and funding for DQ in the refinery?
§ Technologically speaking, what is different about the Hadoop environment versus a traditional RDBMS one?
§ Who sponsors/is responsible for the data quality/integration effort in the age of Hadoop?
48
Twitter Tag: #briefr
The Briefing Room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: INNOVATIVE TECHNOLOGY
August: BIG DATA ECOSYSTEM
September: INTEGRATION
Twitter Tag: #briefr
The Briefing Room
THANK YOU for your
ATTENTION!