hadoop user group 29jan2015 apache flink / haven / capgemnini rex
TRANSCRIPT
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Hadoop User Group29 janvier 2015
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Agenda
#1: Traitement des données non structurées (Vidéos, images, …) avec Haven pour
Hadoop,
#2: Apache Flink: Fast and Reliable Large-scale Data Processing,
#3: Etude de cas, projet Hadoop dans le domaine des RH avec Capgemini.
La vectorisation des documents : rendre comparables des informations non
structurées, de nouvelles opportunités pour un acteur de l’emploi
21h00 : Cocktail dinatoire
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Hadoop User GroupHaven pour analyser 100% des informationsFrédéric Demongeot – EMEA Subject Matter Expert
29 Janvier 2015
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.4
Big Data landscape
Human InformationMachine Data
Business
Data
10% of Information
90% of Information
Annual
Growth
~100%
~10%
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.5
Haven – Big Data platform
Haven
Social media IT/OT ImagesAudioVideoTransactional
dataMobile Search engineEmail Texts
Catalog massive
volumes of
distributed data
Hadoop/
HDFS
Process and
index all
information
Autonomy
IDOL
Analyze at
extreme scale
in real-time
Vertica
Collect & unify
machine data
Enterprise
Security
Powering
HP Software
+ your apps
nApps
Documents
hp.com/haven
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.6
Leverage existing tools in shared Vertica and Hadoop storage environment
A few words on Vertica
Hadoop
HDFS
External Tables
Flex Tables
Click Stream, Web Session Data
Hive
Integration
(HCatalog)
webHDFS
ANSI SQL
webHCAT
Storage Tiering
Hive Pig
MapReduce HB
ase
Cop
y
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.7
Vertica SQL on Hadoop
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.
Autonomy Haven Examples
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.9
“Autonomy, with the power of its IDOL engine,
takes fan data, collects, it stores, and stitches it
together…that helps us understand what is being
talked about across the ecosystem of the sport.”- Senior Director of IT, NASCAR
HAVEn Solution:
• Autonomy IDOL + Autonomy Explore + HP Enterprise
services
Results:
• Wildly successful NASCAR Fan and Media Engagement
Center collects and aggregates fan information
• Understands sentiment, identifies emerging issues, and
uncovers trends that help the NASCAR team share and enrich
the fan and broadcast experience
NASCAR
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.10
Car Manufacturer
Use Case: Brand monitoring
• Have one single Big Data platform they leverage for multiple projects
• Customer brand and partner analytics
• Connected vehicles
• Aggregate multiple data sources and store all on HDFS, such as:
• Social media
• YouTube rich media
• Internal data sources, such as CRM, car logs
• Rich media analysis - logo recognition, face recognition, speech to text, etc.
• Sentiment Analysis
HAVEn technology
• Autonomy
• Hadoop
• nApps – Integration of Haven technologies
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.11
BlaBlaCarImproving a ride-sharing community marketplace
Challenge
• Marketing campaigns and web experience for 10M+
members in 12 countries limited by infrequent and slow
data analysis
Solution
• HP Vertica Analytics Platform
• Cloudera Distributed Hadoop, Tableau, & Data Science
Studio
Result
• Optimized performance of CRM campaigns: program
development improved experience for 2M customers per
month
• Refined targeted marketing by integrating social media
and predicting customer behavior through pattern
recognition
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.
The Challenge of Human Information
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.13
What is Human Information
Information that is created by people and understood by people
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.14
Why is human information different?
Human Information is made up of ideas, is diverse, and has
context.
=
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.15
Strong Information & Weak Information
Key Words are small amounts of very strong information without
context
Larger amounts of weaker information is what humans refer to as “context”
“Mercury”
Is it a planet?Is it an element?Is it a car?With high certainty; it’s an element!
“A heavy element and the only metal that is liquid at standard conditions for
temperature and pressure with the symbol Hg and atomic number 80,
commonly known as quicksilver”
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.16
How does HP IDOL approaches Human Information
Using Adaptive Probabilistic Concept Modeling
Techniques that provide continuous learning based on context.
Techniques that deal in a scalable manner with the subtlety of the real
world.
Adaptive
Probabilistic
Concept
Modeling
Techniques that inform the importance of patterns found in data
This proprietary combination of mathematics model Human Information and is;
Automatic, Fast, Data agnostic, Language independent, Scalable, Accurate,
Dynamic, Real-time, Voice & Video
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.
IDOL - OS of Human Information
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.18
The OS of Human Information
securely to any source of information through a single cross-enterprise interface layer;Connect
the meaning, concepts and key attributes in all types of human friendly information including documents, emails,
databases, clickstreams, audio, social and rich media, etc…Understand
Inquire, Investigate, Interact and Improve quickly, correctly and compliantly based on a holistic view of
information, market conditions and social trends.
Act &
Automate
IDOL(Intelligent Data Operating Layer) provides an interface for accessing human information from all sources and
of all types and provides common services that leverage the information to applications that need to access them to;
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.19
IDOL: the OS for human information
• Mathematically based
• 15 years and over $280M in R&D
• 170+ Patents
• Language independent
• Built for infrastructure
• All file types, all media types (voice/video)
• Scalable and with security
• Platform/OS /device agnostic
• Managed in place
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.20
Inquire“Search your data”
When you have criteria or an object that form a question, Inquire functions allow you to return
results that answer that question. It allows you to sift through large quantities of data to find
specific documents that relate to your question or an area of interest.
Investigate“Analyze your
data”
Investigate functions allow you to use information contained in the results of an inquiry to
analyse those results. The analysis might provide insights that allow you to improve your
inquiry, or it might provide more general information about your content.
Interact“Personalize your
data”
Using information with affinity to the user to create conceptual Profiles and Agents that reflect
the user’s information needs which in turn can be used to power other functions, Interact
functions encompasses all the functionality to achieve this goal.
Improve“Enhance your
data”
Improve functions enhance your information with more details that help with the Inquire and
Investigate functions. These functions allow you to add information to data of any type, be it
audio, video or text, that makes it easier to search and retrieve information, or to identify key
features of your content.
The World as Services
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.21
Analyse your DataInquire
“Search your data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
What insights does our Human Information hold?
Is there structure that I can use to navigate the data?
Expose Concepts and Patterns
Help me evaluate the information quicklyIntelligent Summarization (simple, concept and context)
Intelligent Highlighting (search terms, phrases, concepts, context, fidelity to query grammar)
Concept Streaming (Real time summaries from Audio, contextual to queries and intent)
Intelligent Results de-duplication including “near” de-duplication
Structured, Semi-structured & XML support
Parametric Searching (unlimited nesting and association support)
Directed Navigation (create compelling navigation for users)
Structured Refinement
Automatic Query Guidance (providing top themes from query results in real time)
Concept Navigation via advanced visualizations (node graphs, theme tracking, broadcast analysis)
Language Independent
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.22
Visualization of main topics Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.23
Enhance your DataInquire
“Search your data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Human Information is rich in features that can, when identified, enhance our analysis
Automatic Classification or ClusteringAutomatically determine categories based on patterns and relationships in Human Information
Spot analysis of all themes and grouping within Human Information at any moment in time
Time sensitive analysis; What’s hot? What’s New?
Supervised Classification
Create categories using business rules or training and classify information into those categories
Eduction and Entity ExtractionExtract features and determine characteristics in Human Information
Names, Addresses, Credit Card Information, Sentiment, Intent…..
Audio AnalysisExtract features of Audio information
Speaker independent speech to text, speaker identification, audio events, language identification…..
Image and Video AnalysisExtract features from Video information
Next generation image classification (is this a car?/find more like “this”)
On-screen OCR, logo detection, intelligent scene analysis, Colour and texture analysis, story segmentation….
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.24
Hundreds of conceptual entities
Eduction
Quickly narrow search results with auto-identified facets
and conceptual entities such as employee names from
documents
Validate or customize entities
• Is this a valid credit card number?
• What are all docs that contain SSNs?
• If area code is 415, output as Home Office
Pinpoint accuracy for multibyte languages such as CJK,
Thai and some European languages
NamesPlacesIP addressesCompaniesEventsRelationshipsMedicinesAirportsCarsSocial Security numbersPhone numbersCredit cardsDatesHolidaysJob titlesCurrencies… many more
Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.25
Eduction Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
<Organization>• National Security Agency
<Names>• President Obama
• Vladimir Putin
• Edward Snowden
<Places>• Moscow
• St. Petersburg
• Washington
• Syria
• Russia
<Author> • Carla Anne Robbins
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.26 © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Topical sentiment analysis
Decomposition and classification within a
sentence to pull out specific topics
“I stayed at the Marriott last week, and though the
mattresses were very nice, the service was awful.”
Is this Positive? Negative? Neutral?
How much Positive? How much Negative?
Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.27
Search video as easily as textTransform rich media into intelligent assets
Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Live video or
playback from
archived footage
On-screen text
recognition
Face identification
Automatically generated
transcript using speech
recognition
Speaker identification
Timecode
synchronization
Automatic keyframe
generation
AutomateAutomatically create metadata,
keyframes, transcriptions
UnderstandUnderstand video footage and
audio streams in real time
ActApply advanced analytics such as
clustering and categorization, and link
with other file types
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.28
Most advanced speech technology
Convert spoken words to text
• Acoustic + Language Model
• Speech-to-Text and IDOL’s conceptual understanding
Eliminate manually adding metadata to A/V clips
Phonetic approaches have major problems
• No Conceptual or Contextual Language Understanding
• Keyword-Based
Model of language disambiguates similar terms
• U.S. President “Bush”
• “bush” as in a large plant
Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.29
Image technology: Text
Document field extraction
Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
<item><price>$6.23</price><date>10/2/2012</date><purpose>Lunch</purpose>…</item>
OCR: Read text from images
1D and 2D barcode reading
ISBN (“9870140189865”)
PDF-417 (“LASTNAME, FIRSTNAME,…”)
Data Matrix
(“The Future of
Ticketing…”)
Many more (about 20
barcode types)
Image artifacts such as wrinkled paper
Avoid non-text parts of the image
Column understanding
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.30
Image technology: 2D objects
Registered image Test image
Generic Logo recognition
Registered
Logos
Test image
Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.31
Image technology: Human analysis Inquire“Search your
data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Primary clothing color =
white
Not nude
Primary clothing color =
white
Not nude
Primary clothing color =
black
Not nude
Face detection
Face analysis
Found “President Obama”
face
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.
IDOL Architecture
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.33
IDOL architecture supporting next gen apps
Social Media Video Audio Email Texts Mobile TransactionalData
Documents XML Search Engine Images
HP Autonomy
IDOL Applications
Autonomy Connectors
eDiscovery
Enterprise Search
Media
MonitoringSocial Media
AnalyticsDecision
Support
Augmented
Reality
Partner/
In-house apps
HC Analytics
2D/3D clustering, Acoustic signature, Active matching, Agents, Alerting, Auto language detection, Auto query guidance, Boolean & legacy, Operations, Breaking news
clustering, Categorization, Collaboration, Community, Concept highlighting, Concept-query, Summarization, Conceptual retrieval, Context summarization, Cross-
modal suggest, Dynamic n-dimensional, Taxonomy generation, Dynamic XML, Consumption, Eduction, Exact phrase matching, Expertise location, Explicit
profiling, Face recognition, Field modulation, Frame analysis, Fuzzy matching, Hot clustering, Hyperlinking, Image analysis, Image association, Implicit profiling,
Keyword search, Mail object identification, Melody classification, Melody identification, Metadata recognition, Natural language retrieval, Object identification, Object
recognition, Ontology generation, Parametric refinement, Phrase spotting, Proper name identification, Query by example, Real-time aggregation, Routing,
Scene detection, Script alignment, Sentiment analysis, Soundex matching, Speaker identification, Speaker recognition, Spectographic analysis, Spell checking,
Tag reconciliation, Transcription, Video analysis, Voice printing, Word spotting, Work groups, XML tagging….
Repositories
Information
Types
Apps
500
Functions
IDOL Services Multimedia
Informatics
Enrichment
CaptureInteractionAnalytics
Discovery
Concept
CloudsActive
MatchingVisualization
SharePoint, Hadoop, Email,
ERP,CRM, DB, Data Warehouse, Jive, …
ACA
MediaBin
Connected LiveVault
TRIM
AeD
Data Protector
WorkSite
DigitalSafe
Connectors
…
CloudEnterprise
IDOLOS for Human Information
Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.34
Hadoop Plays1. Keyview used in Hadoop to extract text and metadata from data objects using map reduce
2. Connectors used to fill the data-lake from enterprise repositories
3. IDOL (in conjunction with 1 & 2) used to provide deep text analytics on data objects
Inquire“Search your data”
Investigate“Analyze your
data”
Interact“Personalize your
data”
Improve“Enhance your
data”
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.35
HP IDOL for Hadoop – Potential use case
• HP Hadoop Connectors can ingest
data from other systems into Hadoop
• HP KeyView extracts text and
metadata from Hadoop data
• HP IDOL functions can be performed
on Hadoop data via SDK
Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.
Thank You
What is Flink
Collection programming APIs for batch and real-time
streaming analysis
Backed by a very robust execution backend
• with true streaming capabilities,
• custom memory manager,
• native iteration execution,
• and a cost-based optimizer.
38
The case for Flink
Performance and ease of use
• Exploits in-memory and pipelining, language-embedded logical APIs
Unified batch and real streaming
• Batch and Stream APIs on top of streaming engine
A runtime that "just works" without tuning
• C++ style memory management inside the JVM
Predictable and dependable execution
• Bird’s-eye view of what runs and how, and what failed and why
39
Example: WordCount
40
case class Word (word: String, frequency: Int)
val env = ExecutionEnvironment.getExecutionEnvironment
env.readTextFile(...).flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency”).print()
env.execute()
Flink has mirrored Java and Scala APIs that offer the same
functionality, including by-name addressing.
Example: Window WordCount
41
case class Word (word: String, frequency: Int)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .window(Count.of(100)).every(Count.of(10)).groupBy("word").sum("frequency”).print()
env.execute()
Defining windows
Trigger policy• When to trigger the computation on current window
Eviction policy• When data points should leave the window
• Defines window width/size
E.g., count-based policy• evict when #elements > n
• start a new window every n-th element
Built-in: Count, Time, Delta policies
42
Flink API in a nutshell
map, flatMap, filter, groupBy,
reduce, reduceGroup,
aggregate, join, coGroup,
cross, project, distinct, union,
iterate, iterateDelta, ...
All Hadoop input formats are
supported
API similar for data sets and
data streams with slightly
different operator semantics
Window functions for data
streams
Counters, accumulators, and
broadcast variables
43
Flink stack
44
Flink Optimizer Flink Stream Builder
Common API
Scala API Java API
Python API(upcoming)
Graph API
(Gelly)
Apache
MRQL
Flink Local RuntimeEmbedded
environment(Java collections)
Local
Environment(for debugging)
Remote environment(Regular cluster execution)
Apache Tez
Data
storageHDFS Files S3 JDBC Flume
Rabbit
MQKafkaHBase …
Single node execution Standalone or YARN cluster
Technology inside Flink
Technology inspired by compilers +
MPP databases + distributed systems
For ease of use, reliable performance,
and scalability
case class Path (from: Long, to:Long)val tc = edges.iterate(10) {
paths: DataSet[Path] =>val next = paths
.join(edges)
.where("to")
.equalTo("from") {(path, edge) =>
Path(path.from, edge.to)}.union(paths).distinct()
next}
Cost-based
optimizer
Type extraction
stack
Memory
manager
Out-of-core
algos
real-time
streamingTask
schedulin
g
Recovery
metadata
Data
serialization
stack
Streaming
network
stack
...
Pre-flight
(client) Master
Workers
Notable runtime features
1. Pipelined data transfers
2. Management of memory
3. Native iterations
4. Program optimization
46
Pipelined data transfers
47
Staged (batch) execution
Romeo, Romeo, where art thou Romeo?
Loa
dLog
Searc
h for
str1
Searc
h for
str2
Searc
h for
str3
Grep 1
Grep 2
Grep 3
Stage 1:Create/cache Log
Subseqent stages:Grep log for matches
Caching in-memory and disk if needed
48
Pipelined execution
Romeo, Romeo, where art thou Romeo?
Loa
dLog
Searc
h for
str1
Searc
h for
str2
Searc
h for
str3
Grep 1
Grep 2
Grep 3
001100110011001100110011
Stage 1:Deploy and start operators
Data transfer in-
memory and disk if needed
49
Note: Log
DataSet is
never
“created”!
Pipelining in Flink
Currently the default mode of operation
• Much better performance in many cases – no need to materialize large data sets
• Supports both batch and real-time streaming
In the future pluggable
• Batch will use combination of blocking and pipelining
• Streaming will use pipelining
• Interactive will use blocking
50
Memory management
51
Memory management in Flink
public class WC {public String word;public int count;
}
empty
page
Pool of Memory Pages
Sorting,
hashing,
caching
Shuffling,
broadcasts
User code
objects
Ma
na
ged
Unm
an
aged
52
Flink contains its own memory management stack. Memory is
allocated, de-allocated, and used strictly using an internal buffer pool
implementation. To do that, Flink contains its own type extraction and
serialization components.
Configuring Flink
Per job
• Parallelism
System config
• Total JVM heap size (-Xmx)
• % of total JVM size for Flink runtime
• Memory for network buffers (soon not needed)
That's all you need. System will not throw an OOM exception to you.
53
Benefits of managed memory
More reliable and stable performance (less GC effects, easy to go to disk)
54
Native iterative processing
55
Example: Transitive Closure
56
case class Path (from: Long, to: Long)
val env = ExecutionEnvironment.getExecutionEnvironment
val edges = ...
val tc = edges.iterate (10) { paths: DataSet[Path] =>val next = paths.join(edges).where("to").equalTo("from") {
(path, edge) => Path(path.from, edge.to)}.union(paths).distinct()
next}
tc.print()env.execute()
Iterate natively
57
partial
solution partial
solution X
other
datasets
Y initial
solution
iteration
result
Replace
Step function
Iterate natively with deltas
58
partial
solution
delta
set X
other
datasets
Y initial
solution
iteration
result
workset A B workset
Merge deltas
Replace
initial
workset
Effect of delta iterations
0
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
45000000
1 6 11 16 21 26 31 36 41 46 51 56 61
# o
f e
lem
en
ts u
pd
ate
d
iteration
Iteration performance
60MapReduce
Closing
61
Flink roadmap for 2015
Unify batch and streaming
Machine learning library and Mahout
Graph processing library improvements
Interactive programs and Zeppelin
Logical queries and SQL
And many more
62
Thank you for your invitation
Check out the project website: http://flink.apache.org
[email protected] other mailinglists
Twitter: @ApacheFlink
Feedback & contributions welcome
63
Flink community
0
20
40
60
80
100
120
Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16
#unique contributors by git commits
(without manual de-dup)
flink.apache.org
@ApacheFlink
Hadoop User Group
29th January 2015 @ HP
Text matching engine
About Capgemini
With more than 130,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2013 global revenues of EUR 10.1 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization,
Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore®, its worldwide delivery model. Learn more about us at www.capgemini.com.
Rightshore® is a trademark belonging to Capgemini
About Capgemini
67© 2015 Capgemini. All rights reserved.
Capgemini Global BIM Service Line
Capgemini ‘s global reach with operations in 44
countries and a focus on BIM with over 9600
BIM practitioners.
A uniquely integrated approach to Information
Strategy based around the Capgemini
“Intelligence Enterprise”.
Deep Industry sector knowledge supported by
Sector Specific BIM offerings.
Capgemini’s best-in-class Rightshore®
capability for BIM for development and
management of BIM – 4000 BIM experts in
India CoE.
A unmatched (and vendor independent) depth
of technology experience. Capgemini works
with all the major BI software vendors to deliver
solutions appropriate to the customer’s needs.
850+ M EUR revenue in 2013
Europe:
South Africa
Argentina
Brazil
Mexico
United States
Canada
Saudi Arabia India
Australia
China
Morocco
Austria
Finalnd
France
Italy
Germany
Norway
Netherlands
Poland
Spain
Sweden
Switzerland
UK
68© 2015 Capgemini. All rights reserved.
Contact information
Please contact:
• Edmond [email protected]
• Mouloud [email protected]
• Nathalie [email protected]
69© 2015 Capgemini. All rights reserved.
The matching process in 3 steps
7
0
6 months of applications(~200 000)
7 days of offers(~50 000)
Cleaning up documents
Vectorization of documents
~10 billons of possible combinations
Similarity computation
70© 2015 Capgemini. All rights reserved.
Cleaning Up the documents with Apache Lucene (UDF Hive)• Removing the useless words by :
• A French vocabulary analyzer (general useless words in analysis : le, la, … articles..) • A customized dictionary defined by users (specific useless words/regex such as: email, addresses,
numbers/dates ..)• Extracting roots of remaining words (stemming) :
• Get rids off gender and plural problems• Correct some possible misspelling in job applications
Formation en production mecanique bac+4 Experience en management en tant que responsable de service qualite :service de controle, de validation et de production de lavage de pieces clients; Sydeb Renault, GM Strasbourg,
ACMS Mecachrome Encadrement et supervision du personnel du service qualite : …
form production mecan bac experienc manag tant responsabl servic qualit servic control valid production lavag piec client sydeb renault gm strasbourg acm mecachrom encadr supervision personel servic qualit …
+
Cleaning up the documents
71© 2015 Capgemini. All rights reserved.
1 – Corpus dictionary
Key: a.a: Value: 0Key: a.a.c: Value: 1…Key: form : Value: 1474…Key: production : Value: 15500 …
Normalized TFIDF vectors
Doc_1: { (w1; tfidf_1); (w2 ; tfidf_2); … } Doc_2: { (w_1; tfidf_1); (w_2 ; tfidf_2); … } …
Normalized Vectors
The documents vectorizationtransform texts inputs into
comparable quantifiable mathematical objects : vectors
Vector Basis(~ 1,2 million words/pairs of words)
2 – Relative weight
TFIDF(mot, doc)= TF / DF• TF (Term Frequency)• DF (Document Frequency)
Documents (Applications + Offers)
Weight word 1
Weight word 2
… Weight word X
Doc_ 1 0.53 0.93 … 0
Doc_2 0 0.89 … 0.12
… … … … …
Vector coordinatesTIDF
“Vectorization” of documents
72© 2015 Capgemini. All rights reserved.
Similarity coefficient = Cosine between 2 vectors
measure of the TFIDF angle ( positive ) between 0 et +1
In SQL :
Independent information Exact same information
90° 0°…
- id_CV- id_word- tfidf
Offer
- id_offer- id_word- tfidf
SELECT
id_offer
,id_cv
,SUM(cv.tfidf*offer.tfidf) cos_sim
FROM offer
INNER JOIN CV ON offer.id_mot = CV.id_mot
GROUP BY id_offer, id_cv
Application
id_word=
id_word
=1=1
Similarity process
73© 2015 Capgemini. All rights reserved.
Facts from the field…
JOB OFFER
Indeed offer Operator / Operator of the chemical
manufacturing In a company dedicated to the
production and packaging of chemicals such as
solvents, aerosols, greases for vehicle engines,
you'll be loads of different product mixes
Similarity = 17%
XXXX : Technician manufacturing Chemical industry
AREAS OF EXPERTISE :
Monitor compliance on instruments…
Similarity = 13%
YYYY : Operator chemistry
PROFESSIONAL SKILLS Manipulation measurement tools and
controls
Similarity = 9,7%
ZZZZ
Weighing and metering products ehiniiques manipulation tool
controle Ph meter, meter, microscope ...
Procedures for cleaning and disinfection...
Similarity > 12%: High confidence Similarity
9-12%: Moderate confidence Similarity 5-
9%: Risky match
Similarity < 5%: High risk match
74© 2015 Capgemini. All rights reserved.
Thank you
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Merci 29 janvier 2015