[김유진] data science, big data, and analytics of ibm
DESCRIPTION
[김유진] Data Science, Big Data, and Analytics of IBMTRANSCRIPT
-
2013 IBM Corporation
Data Science, Big Data, and Analytics of IBM
-
2013 IBM Corporation
INDEX
Part 1 About IBM IBM Research & Use Case Smarter Planet
Part 2 Data Data Science Big Data Analytics
2
-
2013 IBM Corporation
Part 1
About IBM IBM Research & Use Case Smarter Planet
3
-
2013 IBM Corporation
IBM
4
, IT
,
: 1967 4 25, IBM1401 IBM 100%
: 1,135
( ) 2011 2010 2009
12,061 12,250 12,068
1,304.4 987.9 633.5
Premier Partner/ISV 65()
Advanced Partner 73
Member Partner 1,179
(Distributor) 9
37
2011.03 45 -
2011.01 IT ' (ACO) AA
2009.11 IBM
2008.12 IT '
2008.09
2007.04 IBM 40
2007.03 1
2004 3000
2003 IBM
2002 IBM (SI)
IBM
()
-
2013 IBM Corporation 5
,
(, , DB)
,
,
,
,
,
,
(GBS)
/
IT (GTS)
(SWG)
Database
Web Application
Groupware
(STG)
Unix Svr
NT
POS
I Series
(GPS)
CRM
(R&D)
Trend
(IGF)
/
/
/
/
/
/
-
2013 IBM Corporation 6
,
(, , DB)
,
,
,
,
,
,
GTS SD
GTS Service Delivery
ITS
ITS (Information Technology Services)
SO Sales
SO (Strategic
Outsourcing) Sales
SO Client
Service
SO (Strategic Outsourcing) Client Service
MTS
MTS (Maintenance and Technical
Support)
ST&MA&C
M
Strategy& Marketing*CM
OFFERGROUP
Offering Group
ITS COVERAGE
ITS
ITS Delivery
ITS Presale & Delivery
ITS SALES
Opportunity Owner
Operation
Growth Initiative
Large Deal
Consulting
Service
S&T (Strategy & Transformation)
Sector
FSS
AMS & Delivery
AMS
Commercial
Electronics
I&G (Innovation & Growth)
BAO (Business Analytics &
Optimization)
EA (Enterprise Applications)
AIS (Application Innovation Service)
Delivery Excellence
Ops &
Support
Global Business Service
Global Technology services
-
2013 IBM Corporation 7
IBM ,
IBM , 6
IBM 6
IBM 10 , 12 IBM 16,000
IBM GBS
IT
IBM GTS
IT
IBM GPS
IBM STG
IBM SWG
,
IBM Financing
/
IT
-
2013 IBM Corporation
IBM Research ( ) 3,000 researchers in 12 labs
Watson
Ireland
2010
Australia
2010
New!
Almaden
1986
1995
Austin
1961
Zurich
1955 1972
Haifa Tokyo
1982
1998
India
2012
Africa Brazil
2010
1995
China
4 labs participated in the Watson project
Almaden
1986
1995
Austin
-
2013 IBM Corporation
Analytics enable better Decisions for Water System Management (Washington D.C. Water and Sewer Authority)
Replacement
What is the state of the water delivery and sewage disposal?
What is the best to allocate capital for infrastructure network upgrade?
Failure Association
How does environmental conditions impact failure?
Does one brand hydrant fail more frequently than the other brand?
How does aging process impact asset condition?
PM Optimization
Asset Failure & Risk
Preventive Maintenance
Can I reduce PM cost? Which failures are driving my water mains repair costs?
Which pipes should I replace to prevent challenges next winter?
Failure Prediction
Which hydrant will fail most likely in the next 6 months?
What type of failure will most likely happen given the current condition?
How likely is the pipe segment going to fail?
Application of these techniques in an engagement with Washington D.C. Water and Sewer Authority resulted in
25% increase in maintenance crew utilization 30-50% cost savings on selected inspection and preventive maintenance significant revenue increase through loss prevention and differential pricing
-
2013 IBM Corporation
Preventive Maintenance for Water System (Washington D.C. Water and Sewer Authority)
Min
s.t.
Inspection cost
Repair cost Penalty cost
Downtime (repair)
Periodic inspection
interval
Max allowable periodic inspection interval (364 days)
Optimize preventive maintenance time for each hydrant by considering the following factors:
Inspection cost for PM before failure Repair cost given failure Penalty cost during downtime Failure risk
PM time (days)
# of hydrants
(100,150] 1436
(150, 200] 2153
(200, 250] 2584
> 250 1005 Maintenance
planning
-
2013 IBM Corporation
Customized Weather Forecast
Damage Model
Outage Prediction
Response Plan
Data Assimilation
Revised Outage
Revised Response
Plan Execution
Report
1 2 3 4 5 6 7 8
Optimized Maintenance Plan
Outage/Damage Prediction and Response Optimization (Utility Company)
Prediction
Optimization
Real-time analytics
-
2013 IBM Corporation
Predicting Multi-Category Daily Damage Counts (Utility Company)
Objective: Predict the daily multi-category damage counts based on the weather conditions on the region level
Date range: 01/2010~02/2013
Number or Records: 52, 206 for 34 regions
Response Categories: 13 (C1~C13)
Data Characteristics: - target: daily damage counts in multiple categories
- predictors:
1. Cumulative rainfall in the preceding two weeks;
2. In Day 0, -1, and -2: aggregate the weather conditions
Methods: Random Forests Model, Multivariate Poisson Regression Model
Weather conditions
Damages
Day 0
Day -1
Day -2
24 hour
24 hour
24 hour
cumulative rainfall
14-day window
12AM
temperature (min, max)
rain rate (max)
daily rain (max)
monthly rain (max)
humidity (max)
average wind speed (max)
wind gust speed (max)
wind gust frequency
pressure (min, max)
C1, C2, C3
-
2013 IBM Corporation
Maintenance Scheduling (Semiconductor Manufacturing Plant)
The scheduling problem for a wafer fab is a complex extension of the Resource Constrained Project Scheduling Problem that handles planned and unplanned orders.
The objectives are to minimize the sum of the expected WIP in the time periods utilized by maintenance operations, minimize the number of technicians used, avoid performing maintenance early, satisfy business
rules.
The scheduling problem needs to integrate the Production schedule with the maintenance schedule so as to avoid maintenance during high demand for a machine
The system is currently deployed and generating schedules daily at IBMs East Fishkill 300mm semi-conductor manufacturing plant.
Maintenance
Scheduling
-
2013 IBM Corporation
Anomaly Detection (Semiconductor Manufacturing) - Integrated Outlier Management in Tracer
Information Theoretic
Outlier Detection
(Entropy Based)
Comparison of the the chamber of interest
and the control band from the other chambers
(Mean m*Std)
Outlier detection for the
chamber of Interest
(CUSUM Based Method )
Step I
Step II
Objective: Exclude spurious values from score calculation.
Method: an Integrated methodology consisting of Information Theoretic Method and Statistical Method; implemented in two steps.
Step I: Calculation done in the context of the data from one chamber group, one recipe, one SVID, both time periods.
Step II: Calculation done in the context of the data from one chamber group, one recipe, one SVID, single time period (reference/current).
time
SV
ID
UCL
Chamber i
LCL
UCL_CUSUM
outliers
-
2013 IBM Corporation
Process Monitoring (Semiconductor Manufacturing) - Hotellings T-squared Control Chart
Objective: Design Hotellings T-squared control charts for manufacturing tools.
Method: a complete procedure consisting of Phase-I design (initial study) and Phase II design (process monitoring)
- Phase I: remove the outliers from the trace data collected from processes under normal conditions and calculate the in-control mean and covariance matrix;
- Phase II: build the control chart using the in-control mean and covariance matrix from Phase I to monitor the current processes.
0 50 100 150 200 250
05
10
15
20
Hotelling's T-squared Control Chart
Wafer Label
T-s
qu
are
d V
alu
e
UCL Types
UCL for Phase-I Design
UCL for Phase-II Design
Phase-I design
Phase-II design
Out of control
-
2013 IBM Corporation
Process Monitoring and Quality Control (Semiconductor Manufacturing) - Motivation for virtual metrology applications
Virtual metrology (VM) generally refers to a model based prediction of some process outcome when there is no physical measurement of that outcome
Predictive modeling: The underlying models are learned from histories of the actual physical outcomes and process trace data
Benefits: Detect faulty wafers early
Improve process control: from lot-to-lot wafer-to-wafer level
Reduce physical measurements for process monitoring and control
Throttle valve positions
Electric bias, impedance, etc Gas flows
Temperature & pressure
Tools publish large amounts of real-time data
Can we use the data for process control?
-
2013 IBM Corporation
Process Monitoring and Quality Control (Semiconductor Manufacturing) - Performance of VM-enhanced process control
Simulation results for a given set of parameters:
VM-EWMA : reduced process variance around 70%
VM-LM : reduce process variance around 30%
Given a target process variance, e.g. 0.03, we can reduce the measurement frequency
VM-LM: 1 out of 6 wafers 1 out of 19 wafers
VM-EWMA: 1 out of 6 wafers 1 out of 94 wafers
0 50 100 1500
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
Wafer Index
Variance o
f P
rocess O
utc
om
es LM
VM-EWMA
VM-LM
LM:
VM-LM:
VM-EWMA:
-
2013 IBM Corporation
Business goal: Early anomaly detection to avoid emergency stops of the system
Technical task: Detect anomalously behaving modules by comparing with previous normally-working state
# of sensors ~ 100
Technical hurdle: Nave thresholding for individual sensors is hard since the system frequently changes its operational mode
Result: Detected about 60% of the serious faults that cannot be detected with conventional methods
Anaconda captures the interdependency pattern between variables, and detects a deviation from the normal pattern
Example:
Example of detected faults
air flow rate in
take
pre
ssure
air flow rate
inta
ke
pre
ssure
normal faulty
Power plant monitoring based on ANACONDA
-
2013 IBM Corporation
Unusual change in dependency
IBM Anomaly Analyzer for Correlational Data (ANACONDA) leverages a unique dependency-based anomaly detection technology
ANACONDA monitors the dependency among variables
Setting a fixed threshold on individual variables leads to many false alerts for dynamic systems
ANACONDA computes the anomaly score for individual variables
Learns dependency patterns from past data under a normal condition
Alert is raised if the present dependency is significantly different from the normal pattern
-
2013 IBM Corporation
Dependency discovery is a key technology
ANACONDA leverages sparse structure learning technique for dependency discovery
Automatically discovers important dependencies among sensors
Dependency is indentified by building sensor-wise predictive models
Sensor4
Sensor5
Sensor1
Sensor2
Sensor3
Sensor6
Sensor1
Sensor4
Sensor5
Sensor1
Sensor2
Sensor3
Sensor6
Sensor2
Repeated until
convergence
-
2013 IBM Corporation
VoC FAQ
INBOUND
OUTBOUND
,
, ,
IBM
Healthcare advisor Engagement advisor
,
, Q&A
FAQ
-
2013 IBM Corporation 22
http://www.ibm.com/smarterplanet/kr/ko/overview/ideas/index.html
Smarter Planet
-
2013 IBM Corporation
Part 2
Data Data Science Big Data Analytics
23
-
2013 IBM Corporation
Data = Digitialization of all things
24
Text
Number
Sound Signal
Image
(, , , ) Amount (, , ), , DNA,
Data Type Form/Meaning
Video
Transformed
SNS, , , WEB, , ,
, , , , , ,
Number
Number
Number
, , , ,
Number
Number + Text WEB LOG, , , , , , Number
, , , ,
, CCTV, UCC,
Feature , ,
, , , ,
Text
Feature , ,
-
2013 IBM Corporation 25
Data Science = Handeling of Digital Information
-
2013 IBM Corporation 26
Data Scientist of Korea = Group of Speciailst
IT System
DB
(R)
System
IT Architect IT Outsorcing
-
2013 IBM Corporation
27
Data Scientist of Big Data
-
2013 IBM Corporation 28
Big Data
-
2013 IBM Corporation
Predictive analytics at the heart of the enterprise LOB 3
LOB 2
LOB 1
Customer
Interactions
Corporate Goals
Risk
Retain
Grow
Attract
Fraud
Channels
Moments of Truth
I buy
I renew
I claim
I mend
I cancel
Business Processes
Customer Support
Claims Processing
Underwriting
Fraud Management
Sales Effectiveness
Marketing
Optimized Business Processes
Customer Support
Claims Processing
Underwriting
Fraud Management
Sales Effectiveness
Marketing
Analytical Foresight
Claims Profile
Fraud Risk
Customer LTV
Retention Risk
Best Offers
Customer Experience
Optimal Campaigns
Risk Assessment
Pla
tfo
rm
Data Mining & Statistics
Decision Optimization
Data Collection
Base Services
Visualization
Attitudinal
Data
Interaction
Data
Behavioral
Data
Demographic
Data
Customer
Feedback
29
-
2013 IBM Corporation
Visualization & Discovery Integration
Workload Optimization Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime / Scheduler
Advanced Analytic Engines
File System
MapReduce
HDFS
Data Store HBase
Text Processing Engine & Extractor Library)
BigSheets JDBC
Applications & Development
Text Analytics MapReduce
Pig & Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard & Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit & History
Lineage
R
Guardium
Platform Computing
Cognos
IBM Open Source
Symphony
GPFS FPO
Optional
Symphony AE
The IBM Big Data Platform Big Data
-
2013 IBM Corporation 31