hadoop user group 29jan2015 apache flink / haven / capgemnini rex

76
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Hadoop User Group 29 janvier 2015

Upload: hadoop-user-group-france

Post on 14-Jul-2015

1.180 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Hadoop User Group29 janvier 2015

Page 2: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Agenda

#1: Traitement des données non structurées (Vidéos, images, …) avec Haven pour

Hadoop,

#2: Apache Flink: Fast and Reliable Large-scale Data Processing,

#3: Etude de cas, projet Hadoop dans le domaine des RH avec Capgemini.

La vectorisation des documents : rendre comparables des informations non

structurées, de nouvelles opportunités pour un acteur de l’emploi

21h00 : Cocktail dinatoire

Page 3: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Hadoop User GroupHaven pour analyser 100% des informationsFrédéric Demongeot – EMEA Subject Matter Expert

29 Janvier 2015

Page 4: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.4

Big Data landscape

Human InformationMachine Data

Business

Data

10% of Information

90% of Information

Annual

Growth

~100%

~10%

Page 5: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.5

Haven – Big Data platform

Haven

Social media IT/OT ImagesAudioVideoTransactional

dataMobile Search engineEmail Texts

Catalog massive

volumes of

distributed data

Hadoop/

HDFS

Process and

index all

information

Autonomy

IDOL

Analyze at

extreme scale

in real-time

Vertica

Collect & unify

machine data

Enterprise

Security

Powering

HP Software

+ your apps

nApps

Documents

hp.com/haven

Page 6: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.6

Leverage existing tools in shared Vertica and Hadoop storage environment

A few words on Vertica

Hadoop

HDFS

External Tables

Flex Tables

Click Stream, Web Session Data

Hive

Integration

(HCatalog)

webHDFS

ANSI SQL

webHCAT

Storage Tiering

Hive Pig

MapReduce HB

ase

Cop

y

Page 7: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.7

Vertica SQL on Hadoop

Page 8: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.

Autonomy Haven Examples

Page 9: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.9

“Autonomy, with the power of its IDOL engine,

takes fan data, collects, it stores, and stitches it

together…that helps us understand what is being

talked about across the ecosystem of the sport.”- Senior Director of IT, NASCAR

HAVEn Solution:

• Autonomy IDOL + Autonomy Explore + HP Enterprise

services

Results:

• Wildly successful NASCAR Fan and Media Engagement

Center collects and aggregates fan information

• Understands sentiment, identifies emerging issues, and

uncovers trends that help the NASCAR team share and enrich

the fan and broadcast experience

NASCAR

Page 10: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.10

Car Manufacturer

Use Case: Brand monitoring

• Have one single Big Data platform they leverage for multiple projects

• Customer brand and partner analytics

• Connected vehicles

• Aggregate multiple data sources and store all on HDFS, such as:

• Social media

• YouTube rich media

• Internal data sources, such as CRM, car logs

• Rich media analysis - logo recognition, face recognition, speech to text, etc.

• Sentiment Analysis

HAVEn technology

• Autonomy

• Hadoop

• nApps – Integration of Haven technologies

Page 11: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.11

BlaBlaCarImproving a ride-sharing community marketplace

Challenge

• Marketing campaigns and web experience for 10M+

members in 12 countries limited by infrequent and slow

data analysis

Solution

• HP Vertica Analytics Platform

• Cloudera Distributed Hadoop, Tableau, & Data Science

Studio

Result

• Optimized performance of CRM campaigns: program

development improved experience for 2M customers per

month

• Refined targeted marketing by integrating social media

and predicting customer behavior through pattern

recognition

Page 12: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.

The Challenge of Human Information

Page 13: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.13

What is Human Information

Information that is created by people and understood by people

Page 14: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.14

Why is human information different?

Human Information is made up of ideas, is diverse, and has

context.

=

Page 15: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.15

Strong Information & Weak Information

Key Words are small amounts of very strong information without

context

Larger amounts of weaker information is what humans refer to as “context”

“Mercury”

Is it a planet?Is it an element?Is it a car?With high certainty; it’s an element!

“A heavy element and the only metal that is liquid at standard conditions for

temperature and pressure with the symbol Hg and atomic number 80,

commonly known as quicksilver”

Page 16: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.16

How does HP IDOL approaches Human Information

Using Adaptive Probabilistic Concept Modeling

Techniques that provide continuous learning based on context.

Techniques that deal in a scalable manner with the subtlety of the real

world.

Adaptive

Probabilistic

Concept

Modeling

Techniques that inform the importance of patterns found in data

This proprietary combination of mathematics model Human Information and is;

Automatic, Fast, Data agnostic, Language independent, Scalable, Accurate,

Dynamic, Real-time, Voice & Video

Page 17: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.

IDOL - OS of Human Information

Page 18: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.18

The OS of Human Information

securely to any source of information through a single cross-enterprise interface layer;Connect

the meaning, concepts and key attributes in all types of human friendly information including documents, emails,

databases, clickstreams, audio, social and rich media, etc…Understand

Inquire, Investigate, Interact and Improve quickly, correctly and compliantly based on a holistic view of

information, market conditions and social trends.

Act &

Automate

IDOL(Intelligent Data Operating Layer) provides an interface for accessing human information from all sources and

of all types and provides common services that leverage the information to applications that need to access them to;

Page 19: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.19

IDOL: the OS for human information

• Mathematically based

• 15 years and over $280M in R&D

• 170+ Patents

• Language independent

• Built for infrastructure

• All file types, all media types (voice/video)

• Scalable and with security

• Platform/OS /device agnostic

• Managed in place

Page 20: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.20

Inquire“Search your data”

When you have criteria or an object that form a question, Inquire functions allow you to return

results that answer that question. It allows you to sift through large quantities of data to find

specific documents that relate to your question or an area of interest.

Investigate“Analyze your

data”

Investigate functions allow you to use information contained in the results of an inquiry to

analyse those results. The analysis might provide insights that allow you to improve your

inquiry, or it might provide more general information about your content.

Interact“Personalize your

data”

Using information with affinity to the user to create conceptual Profiles and Agents that reflect

the user’s information needs which in turn can be used to power other functions, Interact

functions encompasses all the functionality to achieve this goal.

Improve“Enhance your

data”

Improve functions enhance your information with more details that help with the Inquire and

Investigate functions. These functions allow you to add information to data of any type, be it

audio, video or text, that makes it easier to search and retrieve information, or to identify key

features of your content.

The World as Services

Page 21: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.21

Analyse your DataInquire

“Search your data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

What insights does our Human Information hold?

Is there structure that I can use to navigate the data?

Expose Concepts and Patterns

Help me evaluate the information quicklyIntelligent Summarization (simple, concept and context)

Intelligent Highlighting (search terms, phrases, concepts, context, fidelity to query grammar)

Concept Streaming (Real time summaries from Audio, contextual to queries and intent)

Intelligent Results de-duplication including “near” de-duplication

Structured, Semi-structured & XML support

Parametric Searching (unlimited nesting and association support)

Directed Navigation (create compelling navigation for users)

Structured Refinement

Automatic Query Guidance (providing top themes from query results in real time)

Concept Navigation via advanced visualizations (node graphs, theme tracking, broadcast analysis)

Language Independent

Page 22: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.22

Visualization of main topics Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Page 23: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.23

Enhance your DataInquire

“Search your data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Human Information is rich in features that can, when identified, enhance our analysis

Automatic Classification or ClusteringAutomatically determine categories based on patterns and relationships in Human Information

Spot analysis of all themes and grouping within Human Information at any moment in time

Time sensitive analysis; What’s hot? What’s New?

Supervised Classification

Create categories using business rules or training and classify information into those categories

Eduction and Entity ExtractionExtract features and determine characteristics in Human Information

Names, Addresses, Credit Card Information, Sentiment, Intent…..

Audio AnalysisExtract features of Audio information

Speaker independent speech to text, speaker identification, audio events, language identification…..

Image and Video AnalysisExtract features from Video information

Next generation image classification (is this a car?/find more like “this”)

On-screen OCR, logo detection, intelligent scene analysis, Colour and texture analysis, story segmentation….

Page 24: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.24

Hundreds of conceptual entities

Eduction

Quickly narrow search results with auto-identified facets

and conceptual entities such as employee names from

documents

Validate or customize entities

• Is this a valid credit card number?

• What are all docs that contain SSNs?

• If area code is 415, output as Home Office

Pinpoint accuracy for multibyte languages such as CJK,

Thai and some European languages

NamesPlacesIP addressesCompaniesEventsRelationshipsMedicinesAirportsCarsSocial Security numbersPhone numbersCredit cardsDatesHolidaysJob titlesCurrencies… many more

Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Page 25: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.25

Eduction Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

<Organization>• National Security Agency

<Names>• President Obama

• Vladimir Putin

• Edward Snowden

<Places>• Moscow

• St. Petersburg

• Washington

• Syria

• Russia

<Author> • Carla Anne Robbins

Page 26: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.26 © Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Topical sentiment analysis

Decomposition and classification within a

sentence to pull out specific topics

“I stayed at the Marriott last week, and though the

mattresses were very nice, the service was awful.”

Is this Positive? Negative? Neutral?

How much Positive? How much Negative?

Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Page 27: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.27

Search video as easily as textTransform rich media into intelligent assets

Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Live video or

playback from

archived footage

On-screen text

recognition

Face identification

Automatically generated

transcript using speech

recognition

Speaker identification

Timecode

synchronization

Automatic keyframe

generation

AutomateAutomatically create metadata,

keyframes, transcriptions

UnderstandUnderstand video footage and

audio streams in real time

ActApply advanced analytics such as

clustering and categorization, and link

with other file types

Page 28: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.28

Most advanced speech technology

Convert spoken words to text

• Acoustic + Language Model

• Speech-to-Text and IDOL’s conceptual understanding

Eliminate manually adding metadata to A/V clips

Phonetic approaches have major problems

• No Conceptual or Contextual Language Understanding

• Keyword-Based

Model of language disambiguates similar terms

• U.S. President “Bush”

• “bush” as in a large plant

Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Page 29: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.29

Image technology: Text

Document field extraction

Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

<item><price>$6.23</price><date>10/2/2012</date><purpose>Lunch</purpose>…</item>

OCR: Read text from images

1D and 2D barcode reading

ISBN (“9870140189865”)

PDF-417 (“LASTNAME, FIRSTNAME,…”)

Data Matrix

(“The Future of

Ticketing…”)

Many more (about 20

barcode types)

Image artifacts such as wrinkled paper

Avoid non-text parts of the image

Column understanding

Page 30: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.30

Image technology: 2D objects

Registered image Test image

Generic Logo recognition

Registered

Logos

Test image

Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Page 31: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.31

Image technology: Human analysis Inquire“Search your

data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Primary clothing color =

white

Not nude

Primary clothing color =

white

Not nude

Primary clothing color =

black

Not nude

Face detection

Face analysis

Found “President Obama”

face

Page 32: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.

IDOL Architecture

Page 33: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.33

IDOL architecture supporting next gen apps

Social Media Video Audio Email Texts Mobile TransactionalData

Documents XML Search Engine Images

HP Autonomy

IDOL Applications

Autonomy Connectors

eDiscovery

Enterprise Search

Media

MonitoringSocial Media

AnalyticsDecision

Support

Augmented

Reality

Partner/

In-house apps

HC Analytics

2D/3D clustering, Acoustic signature, Active matching, Agents, Alerting, Auto language detection, Auto query guidance, Boolean & legacy, Operations, Breaking news

clustering, Categorization, Collaboration, Community, Concept highlighting, Concept-query, Summarization, Conceptual retrieval, Context summarization, Cross-

modal suggest, Dynamic n-dimensional, Taxonomy generation, Dynamic XML, Consumption, Eduction, Exact phrase matching, Expertise location, Explicit

profiling, Face recognition, Field modulation, Frame analysis, Fuzzy matching, Hot clustering, Hyperlinking, Image analysis, Image association, Implicit profiling,

Keyword search, Mail object identification, Melody classification, Melody identification, Metadata recognition, Natural language retrieval, Object identification, Object

recognition, Ontology generation, Parametric refinement, Phrase spotting, Proper name identification, Query by example, Real-time aggregation, Routing,

Scene detection, Script alignment, Sentiment analysis, Soundex matching, Speaker identification, Speaker recognition, Spectographic analysis, Spell checking,

Tag reconciliation, Transcription, Video analysis, Voice printing, Word spotting, Work groups, XML tagging….

Repositories

Information

Types

Apps

500

Functions

IDOL Services Multimedia

Informatics

Enrichment

CaptureInteractionAnalytics

Discovery

Concept

CloudsActive

MatchingVisualization

SharePoint, Hadoop, Email,

ERP,CRM, DB, Data Warehouse, Jive, …

ACA

MediaBin

Connected LiveVault

TRIM

AeD

Data Protector

WorkSite

DigitalSafe

Connectors

CloudEnterprise

IDOLOS for Human Information

Page 34: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2013 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.34

Hadoop Plays1. Keyview used in Hadoop to extract text and metadata from data objects using map reduce

2. Connectors used to fill the data-lake from enterprise repositories

3. IDOL (in conjunction with 1 & 2) used to provide deep text analytics on data objects

Inquire“Search your data”

Investigate“Analyze your

data”

Interact“Personalize your

data”

Improve“Enhance your

data”

Page 35: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.35

HP IDOL for Hadoop – Potential use case

• HP Hadoop Connectors can ingest

data from other systems into Hadoop

• HP KeyView extracts text and

metadata from Hadoop data

• HP IDOL functions can be performed

on Hadoop data via SDK

Page 36: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Copyright © 2015 Autonomy Inc., an HP Company. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners.

Thank You

Page 37: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Robert MetzgerFlink committer

co-founder, data Artisans@rmetzger_

[email protected]

Apache

Flink

Page 38: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

What is Flink

Collection programming APIs for batch and real-time

streaming analysis

Backed by a very robust execution backend

• with true streaming capabilities,

• custom memory manager,

• native iteration execution,

• and a cost-based optimizer.

38

Page 39: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

The case for Flink

Performance and ease of use

• Exploits in-memory and pipelining, language-embedded logical APIs

Unified batch and real streaming

• Batch and Stream APIs on top of streaming engine

A runtime that "just works" without tuning

• C++ style memory management inside the JVM

Predictable and dependable execution

• Bird’s-eye view of what runs and how, and what failed and why

39

Page 40: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Example: WordCount

40

case class Word (word: String, frequency: Int)

val env = ExecutionEnvironment.getExecutionEnvironment

env.readTextFile(...).flatMap {line => line.split(" ").map(word => Word(word,1))} .groupBy("word").sum("frequency”).print()

env.execute()

Flink has mirrored Java and Scala APIs that offer the same

functionality, including by-name addressing.

Page 41: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Example: Window WordCount

41

case class Word (word: String, frequency: Int)

val env = StreamExecutionEnvironment.getExecutionEnvironment

val lines = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ").map(word => Word(word,1))} .window(Count.of(100)).every(Count.of(10)).groupBy("word").sum("frequency”).print()

env.execute()

Page 42: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Defining windows

Trigger policy• When to trigger the computation on current window

Eviction policy• When data points should leave the window

• Defines window width/size

E.g., count-based policy• evict when #elements > n

• start a new window every n-th element

Built-in: Count, Time, Delta policies

42

Page 43: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Flink API in a nutshell

map, flatMap, filter, groupBy,

reduce, reduceGroup,

aggregate, join, coGroup,

cross, project, distinct, union,

iterate, iterateDelta, ...

All Hadoop input formats are

supported

API similar for data sets and

data streams with slightly

different operator semantics

Window functions for data

streams

Counters, accumulators, and

broadcast variables

43

Page 44: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Flink stack

44

Flink Optimizer Flink Stream Builder

Common API

Scala API Java API

Python API(upcoming)

Graph API

(Gelly)

Apache

MRQL

Flink Local RuntimeEmbedded

environment(Java collections)

Local

Environment(for debugging)

Remote environment(Regular cluster execution)

Apache Tez

Data

storageHDFS Files S3 JDBC Flume

Rabbit

MQKafkaHBase …

Single node execution Standalone or YARN cluster

Page 45: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Technology inside Flink

Technology inspired by compilers +

MPP databases + distributed systems

For ease of use, reliable performance,

and scalability

case class Path (from: Long, to:Long)val tc = edges.iterate(10) {

paths: DataSet[Path] =>val next = paths

.join(edges)

.where("to")

.equalTo("from") {(path, edge) =>

Path(path.from, edge.to)}.union(paths).distinct()

next}

Cost-based

optimizer

Type extraction

stack

Memory

manager

Out-of-core

algos

real-time

streamingTask

schedulin

g

Recovery

metadata

Data

serialization

stack

Streaming

network

stack

...

Pre-flight

(client) Master

Workers

Page 46: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Notable runtime features

1. Pipelined data transfers

2. Management of memory

3. Native iterations

4. Program optimization

46

Page 47: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Pipelined data transfers

47

Page 48: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Staged (batch) execution

Romeo, Romeo, where art thou Romeo?

Loa

dLog

Searc

h for

str1

Searc

h for

str2

Searc

h for

str3

Grep 1

Grep 2

Grep 3

Stage 1:Create/cache Log

Subseqent stages:Grep log for matches

Caching in-memory and disk if needed

48

Page 49: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Pipelined execution

Romeo, Romeo, where art thou Romeo?

Loa

dLog

Searc

h for

str1

Searc

h for

str2

Searc

h for

str3

Grep 1

Grep 2

Grep 3

001100110011001100110011

Stage 1:Deploy and start operators

Data transfer in-

memory and disk if needed

49

Note: Log

DataSet is

never

“created”!

Page 50: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Pipelining in Flink

Currently the default mode of operation

• Much better performance in many cases – no need to materialize large data sets

• Supports both batch and real-time streaming

In the future pluggable

• Batch will use combination of blocking and pipelining

• Streaming will use pipelining

• Interactive will use blocking

50

Page 51: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Memory management

51

Page 52: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Memory management in Flink

public class WC {public String word;public int count;

}

empty

page

Pool of Memory Pages

Sorting,

hashing,

caching

Shuffling,

broadcasts

User code

objects

Ma

na

ged

Unm

an

aged

52

Flink contains its own memory management stack. Memory is

allocated, de-allocated, and used strictly using an internal buffer pool

implementation. To do that, Flink contains its own type extraction and

serialization components.

Page 53: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Configuring Flink

Per job

• Parallelism

System config

• Total JVM heap size (-Xmx)

• % of total JVM size for Flink runtime

• Memory for network buffers (soon not needed)

That's all you need. System will not throw an OOM exception to you.

53

Page 54: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Benefits of managed memory

More reliable and stable performance (less GC effects, easy to go to disk)

54

Page 55: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Native iterative processing

55

Page 56: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Example: Transitive Closure

56

case class Path (from: Long, to: Long)

val env = ExecutionEnvironment.getExecutionEnvironment

val edges = ...

val tc = edges.iterate (10) { paths: DataSet[Path] =>val next = paths.join(edges).where("to").equalTo("from") {

(path, edge) => Path(path.from, edge.to)}.union(paths).distinct()

next}

tc.print()env.execute()

Page 57: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Iterate natively

57

partial

solution partial

solution X

other

datasets

Y initial

solution

iteration

result

Replace

Step function

Page 58: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Iterate natively with deltas

58

partial

solution

delta

set X

other

datasets

Y initial

solution

iteration

result

workset A B workset

Merge deltas

Replace

initial

workset

Page 59: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Effect of delta iterations

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

40000000

45000000

1 6 11 16 21 26 31 36 41 46 51 56 61

# o

f e

lem

en

ts u

pd

ate

d

iteration

Page 60: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Iteration performance

60MapReduce

Page 61: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Closing

61

Page 62: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Flink roadmap for 2015

Unify batch and streaming

Machine learning library and Mahout

Graph processing library improvements

Interactive programs and Zeppelin

Logical queries and SQL

And many more

62

Page 63: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Thank you for your invitation

Check out the project website: http://flink.apache.org

[email protected] other mailinglists

Twitter: @ApacheFlink

Feedback & contributions welcome

63

Page 64: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Flink community

0

20

40

60

80

100

120

Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16

#unique contributors by git commits

(without manual de-dup)

Page 65: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

flink.apache.org

@ApacheFlink

Page 66: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Hadoop User Group

29th January 2015 @ HP

Text matching engine

Page 67: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

About Capgemini

With more than 130,000 people in over 40 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2013 global revenues of EUR 10.1 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization,

Capgemini has developed its own way of working, the Collaborative Business ExperienceTM, and draws on Rightshore®, its worldwide delivery model. Learn more about us at www.capgemini.com.

Rightshore® is a trademark belonging to Capgemini

About Capgemini

67© 2015 Capgemini. All rights reserved.

Page 68: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Capgemini Global BIM Service Line

Capgemini ‘s global reach with operations in 44

countries and a focus on BIM with over 9600

BIM practitioners.

A uniquely integrated approach to Information

Strategy based around the Capgemini

“Intelligence Enterprise”.

Deep Industry sector knowledge supported by

Sector Specific BIM offerings.

Capgemini’s best-in-class Rightshore®

capability for BIM for development and

management of BIM – 4000 BIM experts in

India CoE.

A unmatched (and vendor independent) depth

of technology experience. Capgemini works

with all the major BI software vendors to deliver

solutions appropriate to the customer’s needs.

850+ M EUR revenue in 2013

Europe:

South Africa

Argentina

Brazil

Mexico

United States

Canada

Saudi Arabia India

Australia

China

Morocco

Austria

Finalnd

France

Italy

Germany

Norway

Netherlands

Poland

Spain

Sweden

Switzerland

UK

68© 2015 Capgemini. All rights reserved.

Page 69: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Contact information

Please contact:

• Edmond [email protected]

• Mouloud [email protected]

• Nathalie [email protected]

69© 2015 Capgemini. All rights reserved.

Page 70: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

The matching process in 3 steps

7

0

6 months of applications(~200 000)

7 days of offers(~50 000)

Cleaning up documents

Vectorization of documents

~10 billons of possible combinations

Similarity computation

70© 2015 Capgemini. All rights reserved.

Page 71: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Cleaning Up the documents with Apache Lucene (UDF Hive)• Removing the useless words by :

• A French vocabulary analyzer (general useless words in analysis : le, la, … articles..) • A customized dictionary defined by users (specific useless words/regex such as: email, addresses,

numbers/dates ..)• Extracting roots of remaining words (stemming) :

• Get rids off gender and plural problems• Correct some possible misspelling in job applications

Formation en production mecanique bac+4 Experience en management en tant que responsable de service qualite :service de controle, de validation et de production de lavage de pieces clients; Sydeb Renault, GM Strasbourg,

ACMS Mecachrome Encadrement et supervision du personnel du service qualite : …

form production mecan bac experienc manag tant responsabl servic qualit servic control valid production lavag piec client sydeb renault gm strasbourg acm mecachrom encadr supervision personel servic qualit …

+

Cleaning up the documents

71© 2015 Capgemini. All rights reserved.

Page 72: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

1 – Corpus dictionary

Key: a.a: Value: 0Key: a.a.c: Value: 1…Key: form : Value: 1474…Key: production : Value: 15500 …

Normalized TFIDF vectors

Doc_1: { (w1; tfidf_1); (w2 ; tfidf_2); … } Doc_2: { (w_1; tfidf_1); (w_2 ; tfidf_2); … } …

Normalized Vectors

The documents vectorizationtransform texts inputs into

comparable quantifiable mathematical objects : vectors

Vector Basis(~ 1,2 million words/pairs of words)

2 – Relative weight

TFIDF(mot, doc)= TF / DF• TF (Term Frequency)• DF (Document Frequency)

Documents (Applications + Offers)

Weight word 1

Weight word 2

… Weight word X

Doc_ 1 0.53 0.93 … 0

Doc_2 0 0.89 … 0.12

… … … … …

Vector coordinatesTIDF

“Vectorization” of documents

72© 2015 Capgemini. All rights reserved.

Page 73: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Similarity coefficient = Cosine between 2 vectors

measure of the TFIDF angle ( positive ) between 0 et +1

In SQL :

Independent information Exact same information

90° 0°…

- id_CV- id_word- tfidf

Offer

- id_offer- id_word- tfidf

SELECT

id_offer

,id_cv

,SUM(cv.tfidf*offer.tfidf) cos_sim

FROM offer

INNER JOIN CV ON offer.id_mot = CV.id_mot

GROUP BY id_offer, id_cv

Application

id_word=

id_word

=1=1

Similarity process

73© 2015 Capgemini. All rights reserved.

Page 74: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Facts from the field…

JOB OFFER

Indeed offer Operator / Operator of the chemical

manufacturing In a company dedicated to the

production and packaging of chemicals such as

solvents, aerosols, greases for vehicle engines,

you'll be loads of different product mixes

Similarity = 17%

XXXX : Technician manufacturing Chemical industry

AREAS OF EXPERTISE :

Monitor compliance on instruments…

Similarity = 13%

YYYY : Operator chemistry

PROFESSIONAL SKILLS Manipulation measurement tools and

controls

Similarity = 9,7%

ZZZZ

Weighing and metering products ehiniiques manipulation tool

controle Ph meter, meter, microscope ...

Procedures for cleaning and disinfection...

Similarity > 12%: High confidence Similarity

9-12%: Moderate confidence Similarity 5-

9%: Risky match

Similarity < 5%: High risk match

74© 2015 Capgemini. All rights reserved.

Page 75: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

Thank you

Page 76: Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX

© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Merci 29 janvier 2015