context from big data
TRANSCRIPT
Context from Big DataStartup Showcase
IEEE Big Data ConferenceNovember 1, 2015
Santa Clara, CA
Delroy Cameron, Data Scientist
@urxtech | urx.com | [email protected]
PeopleURX has 40 people: 75%
product/eng, 25% business
CustomersURX partners with the world’s top publisher & advertisers.
FundingURX raised $15M from Accel, Google Ventures, and others
Who is URX?
URX is a mobile technology platform that focuses on publisher monetization, content distribution, and user engagement.
URX serves contextually relevant native ads.
URX interprets page context to dynamically determine the best message & action.
Volume (Apps)Volume (web
pages)Variety (entities)
Why is this a Big Data problem?
Rhapsody(Music)
Fansided
(Sports)
Apple(Music, TV, Books)
Source: The Statistics Portal - http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/
1.6M Apps (Android)1.5M Apps (Apple Store)
1.Data Collection and Parsing2.Data Storage
• Persistent Storage• Search Index
3.Data Processing• Dictionary Building • Vectorization (Feature Vector Creation)
Important tasks
11GB XML dump (gzip file)15M pages (but only 4M articles) Wikitext Grammar
Wikipedia Corpus (English)
1. Data collection & parsing
https://dumps.wikimedia.org/enwiki/latest/
<page> <title>AccessibleComputing</title> <ns>0</ns> <id>10</id> <redirect title="Computer accessibility"/> <revision> <id>631144794</id> <parentid>381202555</parentid> <timestamp>2014-10-26T04:50:23Z</timestamp> <contributor> <username>Paine Ellsworth</username> <id>9092818</id> </contributor> <comment>add [[WP:RCAT|rcat]]s</comment> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space=“preserve">
#REDIRECT [[Computer accessibility]] {{Redr|move|from CamelCase|up}}</text> <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1> </revision> </page>
1. Data collection & parsing
https://dumps.wikimedia.org/enwiki/latest/
1. Data collection & parsing
sax library, generator20 secs/doc, 10 years
FullWikiParser (mediawikiparser)
sax library, generator200 docs/sec, ~ 21 hours
FastWikiParser (mwparserfromhell)
hbase, lxml parser6 docs/sec, ~ one month
HTMLWikiParser (URX Index)
multithreading, generator~ 3 hours
GensimWikiCorpusParser
1. pyspark (64 cores, 8GB RAM)2. wikihadoop
(StreamWikiDumpInputFormat)• split input file
3. mwparserfromhell• parse to raw text
4. ~20 minutes
wikipedia-parser
wik
iped
ia-in
dexe
r
datanode 1
Namenode
datanode 2
datanode n
.
.
.
HDFS Elasticsearch Index
ClusterNode1
ClusterNode 2
ClusterNode m
.
.
.
2. Data storage
wik
iped
ia-p
arse
r
(0 taylor) . . . (1999995 zion)
(1 alison) . . . (1999996 dozer)
(2 swift) . . . (1999997 tank)
(3 born) . . . (1999998 trinity)
(4 december) . . . (1999999 neo)
3. Data Processor (Dictionary building)
wikihadoop, StreamWikiDumpInputFormatdictionary, tfidfmodel~ 1 hour
Pyspark (Gensim)
multithreading, generatorcorpus, dictionary, tfidfmodel~ 6 hours
GensimWikiCorpusParser
Alias Candidate Entity f1 f2 … fn
Taylor Swift wikipedia:Taylor_Swift 0.91 0.81 … 0.34
wikipedia:Taylor_Swift_(album) 0.42 0.10 … 0.42
wikipedia:1989_(Taylor_Swift_album) 0.71 0.23 … 0.31
wikipedia:Fearless_(Taylor_Swift_song) 0.13 0.22 … 0.23
wikipedia:John_Swift 0.00 0.19 … 0.56
4. Data Processor (Vectorization)
~ 350ms predict entity per alias
Gensim
~ 100ms predict entity per alias
Cython
WikipediaCorpus
corpus-parser
corpus-indexer
HDFS(Wikilinks)
WikilinksCorpus
XCorpus
Data Processor Dictionary TF-IDF Model
Machine Learning Module
HDFS (Wikipedia)
HDFS(X Corpus)
Elasticsearch1
Elasticsearch2
Elasticsearchn
1
2
3
4
56
7
Linked Entities1. http://en.wikipedia.org/wiki/Macgyver2. http://en.wikipedia.org/wiki/Neil_deGrasse_Tyson3. http://en.wikipedia.org/wiki/Richard_Dean_Anderson4. http://en.wikipedia.org/wiki/Josh_Holloway5. http://en.wikipedia.org/wiki/NBC6. http://en.wikipedia.org/wiki/CBS7. http://en.wikipedia.org/wiki/James_Wan8. http://en.wikipedia.org/wiki/Netflix9. http://en.wikipedia.org/wiki/America_America
http://zap2it.com/2015/10/5-reasons-cbs-macgyver-reboot-isnt-the-worst-idea-ever/
● Tuning pyspark jobs (64 cores, 8GB Driver RAM)
● Bringing down the elasticsearch cluster
● Rejoining the union after secession (elasticsearch
nodes)
● Text Cleaning (lowercasing, character encoding)
● Merging in Hadoop for dictionary creation
Things to watch out for
Getting started is easy.
Sign Up Download SDK Start Building
Visit http://urx.com/sign-up for more information.
Thank [email protected]