track b-1 建構新世代的智慧數據平台

建構新世代的智慧數據平台尹寒柏 Bob Yin Senior Product Specialist

10 2 MAINFRAME

CLIENT-SERVER WEB

SOCIAL INTERNET OF THINGS

CLOUD

Few Employees

Many Employees

Customers/ Consumers

Business Ecosystems

Communities & Society

Devices & Machines

10 4

10 6

10 7

10 9 10 11

Front Office Productivity Back Office

Automation E-Commerce

Line-of-Business Self-Service

Social Engagement

Real-Time Optimization

1960s-1970s 1980s

1990s

2011 2014

2007

OS/360

TECHNOLOGY

USERS

VALUE TECHNOLOGIES

SOURCES

BUSINESS

What are your Business Initiatives related to Big Data?

• Fraud Detection • Risk & Portfolio

Analysis •  Investment

Recommendations

Financial Services • Proactive Customer

Engagement • Location Based

Services

Retail & Telco

•  Connected Vehicle •  Predictive

Maintenance

Manufacturing

•  Predicting Patient Outcomes

•  Total Cost of Care •  Drug Discovery

Healthcare & Pharma

•  Health Insurance Exchanges

•  Public Safety •  Tax Optimization •  Fraud Detection

Public Sector

Media & Entertainment • Online & In-Game

Behavior • Customer X/Up-Sell

80% of the work in big data projects is data integration and data quality

“80% of the work in any data project is in cleaning

the data”

“70% of my value is an ability to pull the data,

20% of my value is using data-science…”

“I spend more than half my time integrating,

cleansing, and transforming data without

doing any actual analysis.”

InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012

Big data expertise is scarce and expensive

Data warehouse appliance platforms are expensive

We aren’t sure how big data analytics will create business opportunities

Analytical tools are lacking for big data platforms like Hadoop and NoSQL databases

Our data’s not accurate

Hadoop and NoSQL technologies are hard to learn

What Are Your Primary Concerns About Using Big Data Software

38%

33%

31%

22%

21%

17%

Staff Projects with Readily Available Skills •  Informatica Developers are Hadoop Developers

Hand-coding A large global bank grew staff from 2 Java

developers to 100 Informatica developers after implementing Informatica Big Data Edition

Careerbuilder.com found in a survey there were 27,000 requests for Hadoop

skills and only 3,000 resumes with Hadoop skills

– whereas there are over 100,000 trained Informatica developers globally.

Increase Developer Productivity •  Informatica Developers are up to 5x more productive

4 weeks 4 days!

2X performance!

Vs.

Hadoop Hand-coders

Informatica developers

Informatica Developers are 5x more productive based on

customer POCs

Why Informatica for Big Data & Hadoop

Informatica on Hadoop Why Customers Care Visual development environment Increase productivity up to 5x

over hand-coding 100K+ trained Informatica developers globally

Use existing & readily available skills for big data

200+ high-performance connectors (legacy & new)

Move all types of customer data into Hadoop faster

100+ pre-built transforms for ETL & data quality

Provide broadest out-of-box transformations on Hadoop

100+ pre-built parsers for complex data formats

Analyze and integrate all types of data faster

Vibe “Map Once, Deploy Anywhere” virtual data machine

An insurance policy as new data types and technologies change

Reference architectures to get started

Accelerate customer success with proven solution

Unleash the Power of Hadoop Informatica Developers are Now Hadoop Developers

Archive

Profile Parse Cleanse ETL Match

Stream

Load Load

Services

Events

Replicate

Topics

Machine Device, Cloud

Documents and Emails

Relational, Mainframe

Social Media, Web Logs

Data Warehouse

Mobile Apps

Analytics & Op Dashboards

Alerts

Analytics Teams

Transactions, OLTP, OLAP


Documents, Email

Machine Device, Scientific

Maximize Your Return On Big Data

Data Warehouse MDM

Operational Systems Analytical Systems Data Assets Data Products

Data Mart

ODS

OLTP

OLTP

Access & Ingest

Parse & Prepare

Discover & Profile

Transform & Cleanse

Extract & Deliver

Manage (i.e. Security, Performance, Governance, Collaboration)

& other NoSQL

Hadoop complements your existing infrastructure

Data Warehouse

MDM

Applications

Data Ingestion and Extraction •  Moving terabytes of data per hour

Replicate

Streaming

Batch Load

Extract

Archive Extract Low Cost Store

Transactions, OLTP, OLAP


Documents, Email

Industry Standards

Machine Device, Scientific

Access All Types of Data •  200+ High Performance Connectors, Pre-built Parsers for Specialized Data

Formats

WebSphere MQ JMS MSMQ SAP NetWeaver XI

JD Edwards Lotus Notes Oracle E-Business PeopleSoft

Oracle DB2 UDB DB2/400 SQL Server Sybase

ADABAS Datacom DB2 IDMS IMS

Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP

Informix Teradata Netezza ODBC JDBC

VSAM C-ISAM Binary Flat Files Tape Formats…

Web Services TIBCO webMethods

SAP NetWeaver SAP NetWeaver BI SAS Siebel

Flat files ASCII reports HTML RPG ANSI LDAP

EDI–X12 EDI-Fact RosettaNet HL7 HIPAA

ebXML HL7 v3.0 ACORD (AL3, XML)

XML LegalXML IFX cXML

AST FIX SWIFT Cargo IMP MVR

Salesforce CRM Force.com RightNow NetSuite

ADP Hewitt SAP By Design Oracle OnDemand

Facebook Twitter LinkedIn

Kapow Datasift Pivotal

Vertica Netezza

Teradata Aster

Messaging, and Web Services

Relational and Flat

Files

Mainframe and Midrange

Unstructured Data and Files

MPP Appliances

Packaged Applications

Industry Standards

XML Standards

SaaS/BPO

Social Media

Cloud of Connectors

Real-Time Data Collection and Streaming

15

Ultr

a M

essa

ging

Bus

Pub

lish

/ Sub

scrib

e

Leverage High Performance Messaging Infrastructure Publish with Ultra Messaging for global distribution without additional staging or landing.

HDFS

Targets

Web Servers, Operations Monitors, rsyslog, log files, JSON, TCP/UDP, HTTP, SLF4J, etc.

Handhelds, Smart Meters, etc. Discrete Data Messages, MQTT

Sources

Zookeeper

Management and Monitoring

Internet of Things, Sensor Data

PowerCenter Real-Time Edition, Rulepoint (CEP)

No SQL Databases: Cassandara Node

Node

Node

Node

Node

Node

Transformations: Filtering, Timestamp, Static Text, Custom

Informatica Vibe Data Stream for Machine Data

16

•  High performance/efficient streaming data collection over LAN/WAN

•  GUI interface provides ease of configuration, deployment & use

•  Continuous ingestion of real-time generated data (sensors; logs; etc.). Machine generated & other data sources

•  Enable real-time interactions & response

•  Real-time delivery directly to multiple targets (batch/stream processing)

•  Highly available; efficient; scalable

•  Available ecosystem of light weight agents (sources & targets)

Streaming Analytics Complex Event Processing

NoSQL Support for HBase

18

Read from HBase as standard source

Write to HBase as standard target

Complete Mapping with HBase Src/Tgt can execute on hadoop

Sample HBase column families (Stored in JSON/complex formats)

NoSQL Support for MongoDB

Access, integrate, transform & ingest MongoDB data into other analytic systems (e.g. Hadoop, data warehouse)

Access, integrate, transform, & ingest data into MongoDB

Sampling MongoDB data & flattening it to relational format

Graphical representa.on highligh.ng data, segments, separators, and missing or invalid data

Big Data Parser Easy Deployment of Industry Standards

Import pre-‐built industry libraries and easily customize for specific needs

Support of Healthcare industry standards and more

Libraries are constantly maintained to ensure con.nued compliance

Big Data Parser on Taobao

CUSTOMER_ID example COUNTRY CODE example

3. Drilldown Analysis (into Hadoop Data)

2. Value & Pattern

Analysis of Hadoop Data

1. Profiling Stats: Min/Max Values, NULLs, Inferred Data Types, etc.

Drill down into actual data values to inspect results across entire data set, including potential duplicates

Value and Pattern Frequency to isolated

inconsistent/dirty data or unexpected patterns

Hadoop Data Profiling results – exposed to anyone in enterprise

via browser

Stats to identify outliers and

anomalies in data

Hadoop Data Profiling Results

•  Big Data cleansing, deduplication, parsing Execute Data Quality on Hadoop

23

Address Validation

Standardize

Parsing

Matching

Address Validation and Geocoding enrichment across

260 countries

Probabilistic or Deterministic Matching

Standardization and Reference Data Management

Parsing of Unstructured Data/Text Fields of all data types of data (customer/

product/ social/ logs)

DQ logic pushed down/run natively ON Hadoop

Data Quality Taiwan Address

Cross-language matching

Abdulaziz A/Rahman Al Sugair ععببددااللععززييزز ععببددااللررححممنن االلصصققييرر

Abd. A.Rhman Hammed Al-Shuqair ععببددااللللهه ععببددااللررححممنن ححممدد االلششققيي

ععببددااللععززييزز ععببددااللللهه االلششققييرر ععببددااللععززييزز ببنن ممححممدد االلصصقق Abdulrahman Abdullah A.Alshegri

Arabic:

Toyotomi Hideyoshi 豊臣秀吉トヨトミヒデヨシとよとみひでよし

上本町２０７シャトー上本町３０３シャトー上本町３０３兵庫県小野市上本町２０７上本町３０３シャトー上本町３３兵庫県野市

Japanese:

Cross-language matching example

繁簡

簡英

簡英(廣東)

SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM

( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY) JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY) JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY) WHERE nation.N_NAME = 'UNITED STATES' ) T2

INSERT OVERWRITE TABLE TARGET1 SELECT * INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY, count(ORDERKEY2) GROUP BY CUSTKEY;

Data Integration & Quality on Hadoop

Hive-QL

1.  Entire Informatica mapping translated to Hive Query Language

2.  Optimized HQL converted to MapReduce & submitted to Hadoop cluster (job tracker).

3.  Advanced mapping transformations executed on Hadoop through User Defined Functions using Vibe

MapReduce

UDF

Configure Mapping for Hadoop Execution

No need to redesign mapping logic to execute on either

Traditional or Hadoop infrastructure.

Configure where the integration logic should run – Hadoop or Native

Mixed Workflow Orchestration One workflow running tasks on hadoop and local environments

Cmd_Choose LoadPath

MT_Load2Hadoop + Parse

Cmd_Load2 Hadoop MT_Parse

Cmd_ProfileData MT_Cleanse

MT_Data Analysis

Notification

Name Type Default Value Description

$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task

$User.DataSourceConnection String HiveSourceConnection Source connection object

$User.ProfileResult Integer 100 Output from “profiling” commnad task.

Add

Edit

Remove

List of variables:

Full traceability from workflow to MapReduce jobs

View generated Hive scripts

Unified Administration Single Place to Manage & Monitor

Map Once. Deploy Anywhere.

ON PREMISE HADOOP 3rd PARTY APPLICATIONS

CLOUD

track b-1 建構新世代的智慧數據平台

Technology