matthias kricke_martin grimmer_michael schmeißer - building a real time tweet map with flink in six...

22
Building a real time Tweet map with Flink in six weeks OSTMap Fast poc development with flink

Upload: flink-forward

Post on 11-Jan-2017

91 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Building a real time Tweet map with Flink in six weeksOSTMapFast poc development with flink

Page 2: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Proof of concept - an important tool in the industry

• PoC often necessary to show feasibility to customers

• touch several topics:• Scalability• Stream processing• Batch processing• Storage and querying of data

• OSTMap as example PoC

Page 3: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Goals for OSTMap

• Increase trust into big data technologies on customer side• It is easy to build an application

with current technologies• With almost no experience

• Teach students big data technologies• Recruiting

• Bring big data to the university

• Build a real time application to view recent geotagged tweets on a map

• Search for terms and users, show these tweets on a map

• Analytics:• First data science jobs• …

Page 4: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Industry in practice: IT-Ringvorlesung 2016

• A course at the University of Leipzig.• work on projects of local companies• six students• over a period of 6 weeks - no full time

invest• Weekly meetings• Github project: github.com/IIDP/OSTMap

Nico Graebling Vincent Märkl

Hans Dieter Pogrzeba

Christopher SchottChristopher Rost

Kevin Shrestha

Michael Schmeißer

Martin Grimmer

Matthias Kricke

OSTMap

Page 5: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

mgm technology partnersWe bring applications into production!

• Innovative software solution provider with application responsibility

• Specialist for highly scalable, transactional online applications

• Central lines of business: Insurance, E-Commerce, E-Government

• Founded in 1994

• 347 employees, 9 offices (2014)

• Revenue: 43,7 Mio € (2014)

• Part of Allgeier SE

Page 6: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

ScaDSCompetence center for scalable data services and solutions Dresden/Leipzig

• bundled Big Data research expertise of the TU Dresden and Leipzig University

• Drive Big Data innovations• Bring industry and science together• Knowledge exchange and transfer

Page 7: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Walking skeleton“A Walking Skeleton is a tiny implementation of the system that performs a small end-to-end function. It need not use the final architecture, but it should link together the main architectural components. The architecture and the functionality can then evolve in parallel.” - Alistair Cockburn

gif from http://blog.codeclimate.com/blog/2014/03/20/kickstart-your-next-project-with-a-walking-skeleton

Page 8: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Milestone 1read stream, store data as json file, show tweets, read data from json files

Page 9: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Milestone 2write to and read from accumulo, show tweets on map, full table scans, slow visualization

Page 10: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Milestone 3Term index, geotemporal index, ui improvements, clustering, …

Page 11: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

OSTMap – stream, batch, storage and querying

geotagged tweets

webservice

a) stream processing

b) batch processing

c) querying data

Page 12: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Stream processing of incoming data – first version

GeoTweetSource KeyGeneration RawTweetSinkDateExtraction

This enabled us to build a slow term search and a slow map search via full table scans.

time index

data for

Page 13: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Stream processing of incoming data – final version

TermIndexSink

GeoTweetSource KeyGeneration RawTweetSinkDateExtraction

Now we were able to build a faster term and map search and language frequency visualization.

time indexTermExtraction

(tokenizing)

UserExtraction

LanguageFrequencySink

LanguageExtraction

term index

language statistics

GeoTemporalIndexCreation

GeoTemporalIndexSink geotemporal index

data for

1 minute window

sum by language

Page 14: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Batch processing• Initial creation of the term index and

geotemporal index for already processed tweets• Data export• Other statistics like:

• Area/ tweet distance a user covers with his tweets

Page 15: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Storage

Table Row Column Family Column Qualifier Value

RawTweetData (TimeIndex)

timestamp, hash8b + 4b - - raw tweet json

TermIndex term field (user,text) RawTweetData key12b -

LanguageFrequency time bucketYYYYMMDDhhmm language-tag - tweet count

4b

Accumulo table design

Page 16: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Geotemporal Index for OSTMapGeo index

geo data

geohashes usedas row keys in accumulo

…3z6b6c6f6q9p9r9x9zd0d1d2d3d4d5d6…dg

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

partitioned by geohash (z curve)

function from 2d coordinate space to 1d key space

Row CF CQ

geohash RawTweetKey -

Page 17: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Geotemporal Index for OSTMapGeo index – querying?

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

partitioned by geohash

bounding box

calculate coverage of

bounding box

range: [9p]

calculate scan ranges from

coverage

range: [9r]

range: [d0,d1,d2,d3]

…3z6b6c6f6q9p9r9x9zd0d1d2d3d4d5d6…dg

accumulo iteratorsaccumulo iterators

accumulo iterators

result

Row CF CQ

geohash RawTweetKey lat/lon

Page 18: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

Geotemporal Index for OSTMapAdd some time!

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

partitioned by geohash,with timebuckets

…13z16b16c16f16q19p19r19x19z1d01d11d21d31d41d51d6…1dg

day

lon

lat

…23z26b26c26f26q29p29r29x29z2d02d12d22d32d42d52d6…2dg

Row CF CQ

day, geohash

RawTweetKey lat/lon

day 1 day 2 day i …

Page 19: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

Geotemporal Index for OSTMapWhat about Hotspots?

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

partitioned by geohash,with timebuckets

…13z16b16c16f16q19p19r19x19z1d01d11d21d31d41d51d6…1dg

day

lon

lat

…23z26b26c26f26q29p29r29x29z2d02d12d22d32d42d52d6…2dg

Row CF CQ

day, geohash RawTweetKey lat/lon

Page 20: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

Geotemporal Index for OSTMapWhat about Hotspots?

9z db dc df dg

9x d8 d9 dd de

9r d2 d3 d6 d7

9p d0 d1 d4 d5

3z 6b 6c 6f 6g

partitioned by geohash,with timebuckets

day

lon

lat

…12d212d312d4

……

Row CF CQ

sb, day, geohash

RawTweetKey lat/lon

…11d211d311d4

…02d202d302d4

……

…01d201d301d4

…22d222d322d4

……

…21d221d321d4

spreading byte

node 0

node 1

node 2

node n

• spreading byte = hash(tweet) % 255• reproducable• pre table splits in accumulo

Page 21: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

demo

Page 22: Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

Martin Grimmer grimmer[at]informatik.uni-leipzig.deMatthias Kricke kricke[at]informatik.uni-leipzig.de

www.mgm-tp.comwww.scads.de

Thank you

Michael Schmeißer michael.schmeisser[at]mgm-tp.com