context from big data

20
Context from Big Data Startup Showcase IEEE Big Data Conference November 1, 2015 Santa Clara, CA Delroy Cameron, Data Scientist @urxtech | urx.com | [email protected]

Upload: urx

Post on 21-Jan-2017

7.809 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Context from Big DataStartup Showcase

IEEE Big Data ConferenceNovember 1, 2015

Santa Clara, CA

Delroy Cameron, Data Scientist

@urxtech | urx.com | [email protected]

PeopleURX has 40 people: 75%

product/eng, 25% business

CustomersURX partners with the world’s top publisher & advertisers.

FundingURX raised $15M from Accel, Google Ventures, and others

Who is URX?

URX is a mobile technology platform that focuses on publisher monetization, content distribution, and user engagement.

What problem does URX solve?

URX serves contextually relevant native ads.

URX interprets page context to dynamically determine the best message & action.

How does URX affect the mobile ecosystem?

Volume (Apps)Volume (web

pages)Variety (entities)

Why is this a Big Data problem?

Rhapsody(Music)

Fansided

(Sports)

Apple(Music, TV, Books)

Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/

1.6M Apps (Android)1.5M Apps (Apple Store)

How do we collect, store, and process the data needed to build our machine learning models?

1.Data Collection and Parsing2.Data Storage

• Persistent Storage• Search Index

3.Data Processing• Dictionary Building • Vectorization (Feature Vector Creation)

Important tasks

11GB XML dump (gzip file)15M pages (but only 4M articles) Wikitext Grammar

Wikipedia Corpus (English)

1. Data collection & parsing

https://dumps.wikimedia.org/enwiki/latest/

<page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility"/> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space=“preserve">

#REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}}</text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1> </revision> </page>

1. Data collection & parsing

https://dumps.wikimedia.org/enwiki/latest/

1. Data collection & parsing

sax library, generator20 secs/doc, 10 years

FullWikiParser (mediawikiparser)

sax library, generator200 docs/sec, ~ 21 hours

FastWikiParser (mwparserfromhell)

hbase, lxml parser6 docs/sec, ~ one month

HTMLWikiParser (URX Index)

multithreading, generator~ 3 hours

GensimWikiCorpusParser

1. pyspark (64 cores, 8GB RAM)2. wikihadoop

(StreamWikiDumpInputFormat)• split input file

3. mwparserfromhell• parse to raw text

4. ~20 minutes

wikipedia-parser

wik

iped

ia-in

dexe

r

datanode 1

Namenode

datanode 2

datanode n

.

.

.

HDFS Elasticsearch Index

ClusterNode1

ClusterNode 2

ClusterNode m

.

.

.

2. Data storage

wik

iped

ia-p

arse

r

(0 taylor) . . . (1999995 zion)

(1 alison) . . . (1999996 dozer)

(2 swift) . . . (1999997 tank)

(3 born) . . . (1999998 trinity)

(4 december) . . . (1999999 neo)

3. Data Processor (Dictionary building)

wikihadoop, StreamWikiDumpInputFormatdictionary, tfidfmodel~ 1 hour

Pyspark (Gensim)

multithreading, generatorcorpus, dictionary, tfidfmodel~ 6 hours

GensimWikiCorpusParser

Alias Candidate Entity f1 f2 … fn

Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34

wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42

wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31

wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23

wikipedia:John_Swift 0.00 0.19 … 0.56

4. Data Processor (Vectorization)

~ 350ms predict entity per alias

Gensim

~ 100ms predict entity per alias

Cython

WikipediaCorpus

corpus-parser

corpus-indexer

HDFS(Wikilinks)

WikilinksCorpus

XCorpus

Data Processor Dictionary TF-IDF Model

Machine Learning Module

HDFS (Wikipedia)

HDFS(X Corpus)

Elasticsearch1

Elasticsearch2

Elasticsearchn

1

2

3

4

56

7

Demo

Linked Entities1. http://en.wikipedia.org/wiki/Macgyver2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson4. http://en.wikipedia.org/wiki/Josh_Holloway5. http://en.wikipedia.org/wiki/NBC6. http://en.wikipedia.org/wiki/CBS7. http://en.wikipedia.org/wiki/James_Wan8. http://en.wikipedia.org/wiki/Netflix9. http://en.wikipedia.org/wiki/America_America

http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/

● Tuning pyspark jobs (64 cores, 8GB Driver RAM)

● Bringing down the elasticsearch cluster

● Rejoining the union after secession (elasticsearch

nodes)

● Text Cleaning (lowercasing, character encoding)

● Merging in Hadoop for dictionary creation

Things to watch out for

Thank [email protected]