hadoop at orange - sophiaconf2012

36
SophiaConf2012 [email protected] Distributed Calculus with Hadoop MapReduce inside Orange Search Engine mardi 10 juillet 12

Upload: ovarene

Post on 23-Jan-2015

1.042 views

Category:

Technology


1 download

DESCRIPTION

During SophiaConf2012, presentation of Hadoop at Orange

TRANSCRIPT

Page 1: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Distributed Calculus withHadoop MapReduce inside Orange Search Engine

mardi 10 juillet 12

Page 2: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

What is Big Data ?

mardi 10 juillet 12

Page 3: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

$ 5 billions (2012)

to$ 50 billions

(by 2017)

Forbes

mardi 10 juillet 12

Page 4: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

«Big Data is the new definitive source of competitive advantage across all

industries»

Jeff Kelly

mardi 10 juillet 12

Page 5: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Product success

«The days are over when you build a product once and it just works.

You have to take ideas, test them, iterate them, use data and analytics to understand what works and what doesn’t in order to be successful»

LivingSocial @PCWorld 2012

mardi 10 juillet 12

Page 6: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Big Data

mardi 10 juillet 12

Page 7: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

What about Hadoop ?

mardi 10 juillet 12

Page 8: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Beliefs

« We believe that by 2015, more than half the world’s data will be processed by Apache Hadoop »

HortonWorks @HadoopSummit 2012

mardi 10 juillet 12

Page 9: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Actors

mardi 10 juillet 12

Page 10: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Hadoop eco-system

mardi 10 juillet 12

Page 11: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Context at Orange(more than 2 years ago)

mardi 10 juillet 12

Page 12: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Orange Search Engine

http://www.orange.fr

http://www.lemoteur.fr

http://www.voila.fr

mardi 10 juillet 12

Page 13: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Search Engine Architecture

Internet

IndexationCollecte

du WEB24/24

(Crawl) Pré-calcul de score

Infrastructure de Recherche

750 millions dedocuments Français

5 milliards de documents

PageRank

mardi 10 juillet 12

Page 14: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Main issue

• PageRank calculus on billions nodes and 10s billions edges

• regularly failed ! (hardware ...)

• 4 to 8 weeks calculus

• unscalable

• failure rate aroud 80%

• One person full time to supervise

mardi 10 juillet 12

Page 15: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Answer

mardi 10 juillet 12

Page 16: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

PageRank portable to Hadoop / MapReduce ?• Simple programing model:

Map(in_k,in_v) => list(out_k,intermed_v)Reduce(out_k,intermed_v) => list(out_v)

• Scalable

• Batch Processing

• YES !

mardi 10 juillet 12

Page 17: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Hadoop Axioms

• System shall manage and heal himself

• Performance shall scale linearly

• Compute shall move to data

• Modular and extensible

mardi 10 juillet 12

Page 18: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Our install ?

mardi 10 juillet 12

Page 19: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Our install

HIVE PIG

MapReduce

HDFS

Zoo

Kee

per

Mahout

Oozie

Khi

ops

mardi 10 juillet 12

Page 20: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

HDFS ?

mardi 10 juillet 12

Page 21: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Hadoop - HDFS

0011010101110000111

Master

Nœud données

Nœud données

Nœud données Nœud

données Nœud données

Client

Read/Write Read

replication Bloc ops

COTS - replication - big blocksmaximize throughput - Metadata in RAM

mardi 10 juillet 12

Page 22: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Map Reduce ?

mardi 10 juillet 12

Page 23: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

MapReduce

cat | «your map» | sort -u | «your reduce»

Programming paradigm

mardi 10 juillet 12

Page 24: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

MapReduce

cat | «your map» | sort -u | «your reduce»

FrameWork

mardi 10 juillet 12

Page 25: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

MapReduce

cat | «your map» | sort -u | «your reduce»

your Job

mardi 10 juillet 12

Page 26: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

MapReduce Insides

mardi 10 juillet 12

Page 27: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Interfaces

• Java API

• Pipes

• Streaming (python, perl, C/C++, ...)

mardi 10 juillet 12

Page 28: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

PIG

• High level data analysis script language

• extensible via UDF

• Structure of a Pig script• load• filter• foreach | group by | join | your functions• order• store

mardi 10 juillet 12

Page 29: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

HIVE

• High level SQL-like query and analysis language

• extensible via UDF

• Structure of a Hive script• create table• load data• select ... from ...• insert | group by | join

mardi 10 juillet 12

Page 30: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Application domains ?

mardi 10 juillet 12

Page 31: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Projects

• Scoring

• User profiling

• Log analysis and statistics

• ... and many others to come

mardi 10 juillet 12

Page 32: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

ROI ?

mardi 10 juillet 12

Page 33: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

ROI

• Lines Of Code

• Development Time

• IT cost

less bug, automatic, scalable ...

10X gain

2X gain4X gain

mardi 10 juillet 12

Page 34: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

Perfect World ?★ YES• Run cost• Development cost• Scalable• Stable• Heterogenity

★ NO• SPOF (almost solved)

• Fastidious debugging• Localy non optimum• mono-site

mardi 10 juillet 12

Page 36: Hadoop at Orange - SophiaConf2012

SophiaConf2012 [email protected]

thanks to• Hadoop - Apache (http://hadoop.apache.org/)

• Khiops - Orange (http://www.khiops.com)

• Shauwn Connoly - HortonWorks (http://youtu.be/yPfysFAGv8s)

• Forbes - article(http://www.forbes.com/sites/siliconangle/2012/02/17/big-data-is-big-market-big-business/)

• Living Social (sentence)

• Terradata (Volumetry Graph)

• http://www.wordle.net/ (Words Cloud)

mardi 10 juillet 12