mlconf nyc ted willke
Post on 17-Oct-2014
724 views
DESCRIPTION
TRANSCRIPT
![Page 1: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/1.jpg)
CONTEXT
SEMANTICS!
![Page 2: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/2.jpg)
Danny : isBrotherOf : Nezihfood cart : uses : bicyclesFrank : isFriendsWith : MohitFrank : isFriendsWith : TedFrank : likes : bicyclesFrank : likes : food cartsIvy : isFriendsWith : KushalIvy : isFriendsWith : TedIvy : likes : bicyclesIvy : likes : food cartsKushal : isFriendsWith : MohitKushal : isFriendsWith : NezihNezih : is FriendsWith : TedTed : likes : bicycles
![Page 3: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/3.jpg)
This model... ... infers this interest.
Ted Kushal
Mohit
Danny
Ivy
Frank
Nezih
friends
friends
friends
brothers
friends
friends
friends
friends
FoodCart
likes
likes likesBicycles
likes likes
likes
uses
Likes?
![Page 4: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/4.jpg)
Virtuous cycle of data
CLOUD
Richer data to analyze CLIENTS
Richer data from devices
Richer user experiences
INTELLIGENT SYSTEMS
![Page 5: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/5.jpg)
SEMANTIC INFORMATIONIS FUEL FOR THE CYCLE
![Page 6: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/6.jpg)
1985 1995 2005 2015
enterpriseNoSQL
Docs+
SemanticsRDF
WIDESPREADMACHINE LEARNINGON THIS
![Page 7: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/7.jpg)
IMAGINE THE POSSIBILITIES
![Page 8: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/8.jpg)
Graph centrality
High
Program Importance(Centrality)
Low
Graph ofchannel viewingbehavior
Current popularsurfing patterns
SH002463130000 EP005544723744
Changes in surfing behavior may predict customer churn.
![Page 9: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/9.jpg)
Preference and Similarity Recommendations
User
Movie
1.7MM Nodes23.9MM Edges
similar cast
prefers
similar topic
userId: A0A22A5
title: The Godfather genre: Crime dramacast: [M. Brando, Al Pacino]
title: Scarfacegenre: Crime dramacast: [Al Pacino, M. Pfeiffer]
title: The Departedgenre: Crime dramacast: [L. DiCaprio, M. Damon]
weight=11.8
weight=0.67
weight=0.03
weight=14.98
Min-cost path search
![Page 10: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/10.jpg)
10
URL Ground-Truth Data
IP/Domain Reputations
420MM Records
74.5MM Nodes185MM Edges
URL
Domain
IP Address
Calculation of priorsLBP Messaging
Loopy Belief Propagation on the (semantic) web
84.231.82.93
86.39.155.137
forum.vsichko.com
hermansonskok.se
euskzzbz.nonetheups.com
keesenbep.spaces.live.com
![Page 11: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/11.jpg)
Loopy Belief Propagation on the (semantic) web
![Page 12: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/12.jpg)
A yogaball
graph.
Really!?!
![Page 13: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/13.jpg)
You may actually need this
• When the problem is an information network
• When a graph is a natural way of expressing the algorithm
• When you want to study specific relationships
• When you want faster machine learning or solvers on sparse data
shortest path
central influence
sub networks
triangle count
![Page 14: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/14.jpg)
But there are challenges.
Handling all that data.
Finding people good at both handling all that data and data analysis.
Putting exploratory work into production fast enough to keep up
with the competition.
14
![Page 15: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/15.jpg)
Congratulations! Youare a
data scientist!
![Page 16: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/16.jpg)
It’s a demanding job
Ingest & Clean
EngineerFeatures
StructureModel
TrainModel
Query & Analyze
Learn
Visualize
Skills shortage at intersection of
systems engineering and
data analysis
Painful data ingestion and preparation
Workflows that are not designedwith loopbacks in mind
Few tools for analyzingsemantics at scale
Composing pipeline is
DIY
![Page 17: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/17.jpg)
Decomposingthe “data scientist”
Source: 2013 Report from Accenture Institute for High Performance
![Page 18: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/18.jpg)
IMAGINE A PLATFORM FOR DATA SCIENTISTSDOCS + SEMANTICS + MACHINE LEARNING
![Page 19: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/19.jpg)
Ease-of-use: Making big data familiar
Python
R
Dataflow GUI
...
Datacenter / Cloud
Network
Client
BIG DATA
API
ConnectManag
eSecure
Analyzedistributed and parallel
ManageSecure
Connect
Analyzelocal
Query
Big Data Java/Scala/C++ Computational
Frameworks
Big Data Algorithms
Cluster Workload Mgmt
Cluster Storage
Machine Learning & Statistics
Data WranglingAnalyst Skills
The Other Skills
![Page 20: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/20.jpg)
Delivering it
FILESYSTEMS AND NOSQL STORAGE
HW PLATFORM
APACHE HADOOP APACHE SPARK
DATA WRANGLINGMACHINE LEARNING AND
STATISTICSGraphical
AlgorithmsClassical
Algorithms
Graph Construction
Tools
Useful String
Manipulation
Useful Math
Operators
BIG DATA API
DATA SCIENCE SERVER (Query and Scripting)
Intel Analytics Toolkit
A UNIFIED DOCUMENT + SEMANTIC STORE
The Ask
![Page 21: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/21.jpg)
Approach Algorithm Category Applications/Use Cases
Loopy Belief Propagation (LBP) Structured Prediction Personalized recs, image de-noising
Label Propagation Structured Prediction Personalized recommendations
Alternating Least Squares (ALS) Collaborative Filtering
Recommenders
Conjugate Gradient Descent (CGD) Collaborative Filtering
Recommenders
Connected Components Graph Analytics Network manipulation, image analysis
Latent Dirichlet Allocation (LDA) Topic Modeling Document Clustering
Structure Attribute Clustering Network analysis, consumer seg
K-Truss Clustering Social network analysis
KNN* Clustering Recommenders
Logistic Regression* Classification Fraud detection
Random Forest* Classification Fraud detection, consumer seg
Generalized Linear Model (Binomial, Poisson)
Non-linear Curve Fitting
Forecasting, pricing, market mix models
Association Rule Mining Data Mining Market basket analysis, recommenders
Frequent Pattern Mining* Data Mining Pattern Recognition
Bringing a full spectrum of possibilitiesG
raph
21
![Page 22: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/22.jpg)
Article Tagging Problem
• Articles are tagged by experts with MeSH terms, drawn from a hierarchical controlled vocabulary of 55,000 keywords• Process is resource-intensive – can we automate
it?• Categorize articles into a hierarchy that matches
the same categorization from the MeSH controlled vocabulary
![Page 23: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/23.jpg)
Hierarchy Level
Article Count
![Page 24: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/24.jpg)
Demo: Graph Analytics For Medical Journal Analysis
INGEST&
CLEAN
ENGINEERFEATURES
STRUCTUREGRAPH
QUERY & ANALYZE
LEARN
VISUALIZE
PARSE AND EXTRACT WORDS
CREATE ARTICLE/
WORD LISTBUILD GRAPH QUERY/
VISUALIZE DATA
DETECT CLUSTERS
USING LDA
• Medline™ XML• MeSH Ontology XML
• Create list of unique words
• Stemming and lemmatization
• Index word list• Transform articles
into list of article/word pairs
• Extract vertices• Assign id columns
to vertex property• Assign year and
count edge properties
• Gremlin query for each visual
• Python web server and other libraries
• Select optimization parameters
• Invoke LDA
![Page 25: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/25.jpg)
The Playbook?
PARSE AND
EXTRACT WORDS
CREATE ARTICLE/
WORD LIST
BUILD GRAPH
QUERY/VISUALIZE
DATA
DETECT CLUSTERS
USING LDA
Parse Prepare graph dataBasic analysis Run LDA
INSIGHTFULRESULT
This never happens!
![Page 26: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/26.jpg)
The Real Playbook
PARSE AND
EXTRACT WORDS
CREATE ARTICLE/
WORD LIST
BUILD GRAPH
QUERY/VISUALIZE
DATA
DETECT CLUSTERS
USING LDA
Parse
Correct mistake
Prepare graph data
Correct schema mistake
Correct aggregation mistake
Data validation
Correct dataset mistake
Guess LDA settings
Tune and re-run
Detect bias in dataset
![Page 27: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/27.jpg)
WE NEED THE AGILITY OF INTERACTIVE SCRIPTINGANDTHE
BRAINS AND BRAWN OF SCALABLE GRAPH ANALYTICS
![Page 28: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/28.jpg)
Build Frame
28
![Page 29: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/29.jpg)
Build Graph
29
![Page 30: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/30.jpg)
Query Vertices
30
![Page 31: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/31.jpg)
LDA with 3 Topics
![Page 32: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/32.jpg)
LDA with 5 Topics
![Page 33: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/33.jpg)
LDA with 7 Topics
![Page 34: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/34.jpg)
Query Vertices Again – Now with ML Properties
34
![Page 35: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/35.jpg)
Following Analysis
Wakefulness
Sleep
Animals
Electroencephalography
Circadian Rhythm
Arousal
Sleep Stages
REM
Mental Recall
Attention
Rats
Child
Evoked Potentials
Aged
Schizophrenia
Ocular
Conditioning
Infant
Psychophysics
Dreams
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
Top MeSH terms that predict which category an article will be assigned
![Page 36: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/36.jpg)
Reimagining 2014
New partnerships in big data
Contributions to the open source community
The Intel Analytics Toolkit – COMING SOON
SEMANTICS + MACHINE LEARNINGTOGETHER AT LAST!
![Page 37: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/37.jpg)
INTERESTED IN THE INTEL ANALYTICS [email protected]
![Page 38: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/38.jpg)
![Page 39: MLconf NYC Ted Willke](https://reader033.vdocuments.net/reader033/viewer/2022061105/54412731afaf9f62208b460b/html5/thumbnails/39.jpg)
Legal DisclaimersAll products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor_number
Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM). Functionality, performance or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your PC manufacturer. For more information, visit http://www.intel.com/go/virtualization
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security
Intel, Intel Xeon, Intel Atom, Intel Xeon Phi, Intel Itanium, the Intel Itanium logo, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Other names and brands may be claimed as the property of others.Copyright © 2013, Intel Corporation. All rights reserved.