track b-1 建構新世代的智慧數據平台
TRANSCRIPT
建構新世代的智慧數據平台 尹寒柏 Bob Yin Senior Product Specialist
10 2 MAINFRAME
CLIENT-SERVER WEB
SOCIAL INTERNET OF THINGS
CLOUD
Few Employees
Many Employees
Customers/ Consumers
Business Ecosystems
Communities & Society
Devices & Machines
10 4
10 6
10 7
10 9 10 11
Front Office Productivity Back Office
Automation E-Commerce
Line-of-Business Self-Service
Social Engagement
Real-Time Optimization
1960s-1970s 1980s
1990s
2011 2014
2007
OS/360
TECHNOLOGY
USERS
VALUE TECHNOLOGIES
SOURCES
BUSINESS
What are your Business Initiatives related to Big Data?
• Fraud Detection • Risk & Portfolio
Analysis • Investment
Recommendations
Financial Services • Proactive Customer
Engagement • Location Based
Services
Retail & Telco
• Connected Vehicle • Predictive
Maintenance
Manufacturing
• Predicting Patient Outcomes
• Total Cost of Care • Drug Discovery
Healthcare & Pharma
• Health Insurance Exchanges
• Public Safety • Tax Optimization • Fraud Detection
Public Sector
Media & Entertainment • Online & In-Game
Behavior • Customer X/Up-Sell
80% of the work in big data projects is data integration and data quality
“80% of the work in any data project is in cleaning
the data”
“70% of my value is an ability to pull the data,
20% of my value is using data-science…”
“I spend more than half my time integrating,
cleansing, and transforming data without
doing any actual analysis.”
InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012
Big data expertise is scarce and expensive
Data warehouse appliance platforms are expensive
We aren’t sure how big data analytics will create business opportunities
Analytical tools are lacking for big data platforms like Hadoop and NoSQL databases
Our data’s not accurate
Hadoop and NoSQL technologies are hard to learn
What Are Your Primary Concerns About Using Big Data Software
38%
33%
31%
22%
21%
17%
Staff Projects with Readily Available Skills • Informatica Developers are Hadoop Developers
Hand-coding A large global bank grew staff from 2 Java
developers to 100 Informatica developers after implementing Informatica Big Data Edition
Careerbuilder.com found in a survey there were 27,000 requests for Hadoop
skills and only 3,000 resumes with Hadoop skills
– whereas there are over 100,000 trained Informatica developers globally.
Increase Developer Productivity • Informatica Developers are up to 5x more productive
4 weeks 4 days!
2X performance!
Vs.
Hadoop Hand-coders
Informatica developers
Informatica Developers are 5x more productive based on
customer POCs
Why Informatica for Big Data & Hadoop
Informatica on Hadoop Why Customers Care Visual development environment Increase productivity up to 5x
over hand-coding 100K+ trained Informatica developers globally
Use existing & readily available skills for big data
200+ high-performance connectors (legacy & new)
Move all types of customer data into Hadoop faster
100+ pre-built transforms for ETL & data quality
Provide broadest out-of-box transformations on Hadoop
100+ pre-built parsers for complex data formats
Analyze and integrate all types of data faster
Vibe “Map Once, Deploy Anywhere” virtual data machine
An insurance policy as new data types and technologies change
Reference architectures to get started
Accelerate customer success with proven solution
Unleash the Power of Hadoop Informatica Developers are Now Hadoop Developers
Archive
Profile Parse Cleanse ETL Match
Stream
Load Load
Services
Events
Replicate
Topics
Machine Device, Cloud
Documents and Emails
Relational, Mainframe
Social Media, Web Logs
Data Warehouse
Mobile Apps
Analytics & Op Dashboards
Alerts
Analytics Teams
Transactions, OLTP, OLAP
Social Media, Web Logs
Documents, Email
Machine Device, Scientific
Maximize Your Return On Big Data
Data Warehouse MDM
Operational Systems Analytical Systems Data Assets Data Products
Data Mart
ODS
OLTP
OLTP
Access & Ingest
Parse & Prepare
Discover & Profile
Transform & Cleanse
Extract & Deliver
Manage (i.e. Security, Performance, Governance, Collaboration)
& other NoSQL
Hadoop complements your existing infrastructure
Data Warehouse
MDM
Applications
Data Ingestion and Extraction • Moving terabytes of data per hour
Replicate
Streaming
Batch Load
Extract
Archive Extract Low Cost Store
Transactions, OLTP, OLAP
Social Media, Web Logs
Documents, Email
Industry Standards
Machine Device, Scientific
Access All Types of Data • 200+ High Performance Connectors, Pre-built Parsers for Specialized Data
Formats
WebSphere MQ JMS MSMQ SAP NetWeaver XI
JD Edwards Lotus Notes Oracle E-Business PeopleSoft
Oracle DB2 UDB DB2/400 SQL Server Sybase
ADABAS Datacom DB2 IDMS IMS
Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP
Informix Teradata Netezza ODBC JDBC
VSAM C-ISAM Binary Flat Files Tape Formats…
Web Services TIBCO webMethods
SAP NetWeaver SAP NetWeaver BI SAS Siebel
Flat files ASCII reports HTML RPG ANSI LDAP
EDI–X12 EDI-Fact RosettaNet HL7 HIPAA
ebXML HL7 v3.0 ACORD (AL3, XML)
XML LegalXML IFX cXML
AST FIX SWIFT Cargo IMP MVR
Salesforce CRM Force.com RightNow NetSuite
ADP Hewitt SAP By Design Oracle OnDemand
Facebook Twitter LinkedIn
Kapow Datasift Pivotal
Vertica Netezza
Teradata Aster
Messaging, and Web Services
Relational and Flat
Files
Mainframe and Midrange
Unstructured Data and Files
MPP Appliances
Packaged Applications
Industry Standards
XML Standards
SaaS/BPO
Social Media
Cloud of Connectors
Real-Time Data Collection and Streaming
15
Ultr
a M
essa
ging
Bus
Pub
lish
/ Sub
scrib
e
Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing.
HDFS
Targets
Web Servers, Operations Monitors, rsyslog, log files, JSON, TCP/UDP, HTTP, SLF4J, etc.
Handhelds, Smart Meters, etc. Discrete Data Messages, MQTT
Sources
Zookeeper
Management and Monitoring
Internet of Things, Sensor Data
PowerCenter Real-Time Edition, Rulepoint (CEP)
No SQL Databases: Cassandara Node
Node
Node
Node
Node
Node
Transformations: Filtering, Timestamp, Static Text, Custom
Informatica Vibe Data Stream for Machine Data
16
• High performance/efficient streaming data collection over LAN/WAN
• GUI interface provides ease of configuration, deployment & use
• Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources
• Enable real-time interactions & response
• Real-time delivery directly to multiple targets (batch/stream processing)
• Highly available; efficient; scalable
• Available ecosystem of light weight agents (sources & targets)
Streaming Analytics Complex Event Processing
NoSQL Support for HBase
18
Read from HBase as standard source
Write to HBase as standard target
Complete Mapping with HBase Src/Tgt can execute on hadoop
Sample HBase column families (Stored in JSON/complex formats)
NoSQL Support for MongoDB
Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse)
Access, integrate, transform, & ingest data into MongoDB
Sampling MongoDB data & flattening it to relational format
Graphical representa.on highligh.ng data, segments, separators, and missing or invalid data
Big Data Parser Easy Deployment of Industry Standards
Import pre-‐built industry libraries and easily customize for specific needs
Support of Healthcare industry standards and more
Libraries are constantly maintained to ensure con.nued compliance
Big Data Parser on Taobao
CUSTOMER_ID example COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value & Pattern
Analysis of Hadoop Data
1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc.
Drill down into actual data values to inspect results across entire data set, including potential duplicates
Value and Pattern Frequency to isolated
inconsistent/dirty data or unexpected patterns
Hadoop Data Profiling results – exposed to anyone in enterprise
via browser
Stats to identify outliers and
anomalies in data
Hadoop Data Profiling Results
• Big Data cleansing, deduplication, parsing Execute Data Quality on Hadoop
23
Address Validation
Standardize
Parsing
Matching
Address Validation and Geocoding enrichment across
260 countries
Probabilistic or Deterministic Matching
Standardization and Reference Data Management
Parsing of Unstructured Data/Text Fields of all data types of data (customer/
product/ social/ logs)
DQ logic pushed down/run natively ON Hadoop
Data Quality Taiwan Address
Cross-language matching
Abdulaziz A/Rahman Al Sugair ععببددااللععززييزز ععببددااللررححممنن االلصصققييرر
Abd. A.Rhman Hammed Al-Shuqair ععببددااللللهه ععببددااللررححممنن ححممدد االلششققيي
ععببددااللععززييزز ععببددااللللهه االلششققييرر ععببددااللععززييزز ببنن ممححممدد االلصصقق Abdulrahman Abdullah A.Alshegri
Arabic:
Toyotomi Hideyoshi 豊臣秀吉 トヨトミヒデヨシ とよとみひでよし
上本町207 シャトー上本町303 シャトー上本町303 兵庫県 小野市 上本町207 上本町303 シャトー上本町33 兵庫県 野市
Japanese:
Cross-language matching example
繁簡
簡英
簡英(廣東)
SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM
( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2
INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY;
Data Integration & Quality on Hadoop
Hive-QL
1. Entire Informatica mapping translated to Hive Query Language
2. Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker).
3. Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe
MapReduce
UDF
Configure Mapping for Hadoop Execution
No need to redesign mapping logic to execute on either
Traditional or Hadoop infrastructure.
Configure where the integration logic should run – Hadoop or Native
Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments
Cmd_Choose LoadPath
MT_Load2Hadoop + Parse
Cmd_Load2 Hadoop MT_Parse
Cmd_ProfileData MT_Cleanse
MT_Data Analysis
Notification
Name Type Default Value Description
$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task
$User.DataSourceConnection String HiveSourceConnection Source connection object
$User.ProfileResult Integer 100 Output from “profiling” commnad task.
Add
Edit
Remove
List of variables:
Full traceability from workflow to MapReduce jobs
View generated Hive scripts
Unified Administration Single Place to Manage & Monitor
Map Once. Deploy Anywhere.
ON PREMISE HADOOP 3rd PARTY APPLICATIONS
CLOUD