redis for duplicate detection on real time stream
TRANSCRIPT
![Page 1: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/1.jpg)
for duplicate detection on real time stream
![Page 2: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/2.jpg)
whoami(1)15 years of experience, proud to be a programmerWrites software for information extraction, nlp, opinion mining (@scale ), and a lot of other buzzwordsImplements scalable architecturesMember of the JUG-Torino coordination team
[email protected] github.com/robfranktwitter.com/robfrankie linkedin.com/in/robfrankhttp://www.celi.it http://www.blogmeter.it
![Page 3: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/3.jpg)
AgendaWhat is it?Main featuresCachingCountersScriptingHow we use it
![Page 4: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/4.jpg)
From the site
Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps
and hyperloglogs.
![Page 5: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/5.jpg)
Who use itTwitterGithubYoupornPinterestGroupon
...
![Page 6: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/6.jpg)
Clients in every known language
Articles, books, presentations
On High Scalability every other day
Ecosystem
![Page 7: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/7.jpg)
Architecture
Single-threaded server
Yes: single threaded server
Remember that when you need to scale
Single Linux server can handle 500k req/s
![Page 8: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/8.jpg)
Main features
In memory K/V storeBut with durable persistenceMaster-slave async replicaTransactionsPub/SubServer side LUA scripting
![Page 9: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/9.jpg)
Main featuresKeys with TTLLRU evictionKeys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogsREDIS cluster on the go (3.0.0-rc1)
![Page 10: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/10.jpg)
K/V storeKey-value (KV) stores use the associative array (also known as a map or dictionary) as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection. (wikipedia)
![Page 11: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/11.jpg)
K/V store
Key
“plain text”
name robsurname frank
A C E D B F
A B C D E F
String/blobs/bitmaps
HashTable: Objects
Linked lists
Sets
![Page 12: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/12.jpg)
PersistenceConfigurable, two flavors
RDB: perfect for backupAOF: append only log, replayed at startup
Use AOF + RDB for rock solid persistenceAutomatic cache warm-up at startup!!Only RAM: switch off persistence
![Page 13: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/13.jpg)
Common use casesCacheQueueSession replicationIn memory indexesCentralized ID generation
![Page 14: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/14.jpg)
BasicsSET user:1 frankGET user:1 → frankEXISTS user:2 → 1
EXPIRE user:1 3600
INCR count:1 GET count:1 → 1
![Page 15: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/15.jpg)
BasicsKEYS user:* → user:1, user:2MSET user:1 frank user:2 coderMGET user:1 user:2 → frank, coder
HMSET userdetail:3 name rob surname frankHGETALL userdetail:3 → name::rob, surname:: frank
![Page 16: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/16.jpg)
TransactionsMULTIINCR counter:1INCR counter:2EXEC> 1> 1
WATCH counter:3val = GET counter:3val = val +1MULTISET counter:3 $valEXEC
![Page 17: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/17.jpg)
Atomic countersOperators for key increment
INCR counter:1 GET counter:1 → 1
INCRBY counter:1 9GET counter:1 → 10
![Page 18: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/18.jpg)
LUA scriptingServer side LUA scriptingA “sort of” stored procedureScripts are sandboxedAtomic execution ← bear in mind
![Page 19: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/19.jpg)
SCRIPT LOAD "return {KEYS[1],KEYS[2]}""3905aac1828a8f75707b48e446988eaaeb173f13"EVALSHA 3905aac1828a8f75707b48e446988eaaeb173f13 2 user:1 user:21) "user:1"2) "user:2"
LUA scripting
![Page 20: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/20.jpg)
Caching: server level
Configure REDIS as a cache
maxmemory 1024mbmaxmemory-policy allkeys-lru
all the keys will be evicted using an approximated LRU algorithm
![Page 21: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/21.jpg)
Caching: TTL on key
Set a timeout on a keySET doc:1 “mydoc.txt”EXIPRE doc:1 10
OrSETEX doc:1 10 “mydoc.txt”
![Page 22: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/22.jpg)
Demo
![Page 23: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/23.jpg)
Caching+
Atomic Counters+
Atomic LUA scripting
![Page 24: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/24.jpg)
Duplicate detectionReal time stream of documents from
the Internet20% to 50% of documents are duplicated
DUPLICATES ARE EVIL
And customers don’t pay for that :(
![Page 25: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/25.jpg)
Basic Scenario
Duplicatesdetector NLP StorageProducer
5M 3M 3M
Producer
Producer
![Page 26: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/26.jpg)
Avoid duplicated documentsAct on producers was
TOO HARD
Filter-out them before heavy document analysis (NLP)
![Page 27: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/27.jpg)
Documents“Documents” are from:
twitterfacebookgplusinstagramforumsblogs
![Page 28: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/28.jpg)
DocumentsEach kind of document has its own natural id
twitter: status idfacebook: post idforum: URLblog: URL
We don’t want this IDs inside our system
![Page 29: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/29.jpg)
Duplicate and id generation
Producer
2M
Producer
Producer
Duplicatedetector - ID generation
Analysis
Storage
3M3M
Duplicatedetector - ID generation
Analysis1M 1M
5M
![Page 30: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/30.jpg)
Map external keys to internal UIDGenerate an ID for each documentIDs are generated using daily named counters:
INCR day:20141028 → 12576INCR day:20141010 → 23412576
Cache generated IDtw_1234578688 → day:20141028;12576
![Page 31: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/31.jpg)
Map external keys to internal UIDDocuments are internally stored on different storage systems with their generated id
globalId→ 20141028:3456789
![Page 32: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/32.jpg)
OperationsNatural Keys are cached with TTL Documents out of time are parked in a staging areaDuplicated documents are usually dropped
![Page 33: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/33.jpg)
LRU cache, counters and LUALUA scripts are executed atomicallyWrote a simple script to:
return previous mapped idor generate id and store key and id in cache
EVALSHA “sha” 2 20141028 tw_1234566 → 20141028:123GET tw_1234566 → 20141028:123
![Page 34: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/34.jpg)
Demo
![Page 35: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/35.jpg)
DeploymentPre-production phaseSingle server70M keys in 10GB of RAMIn production with a simple M/S
![Page 36: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/36.jpg)
AlternativesPostgreSQL
sequence(s)table OR hstore
Hazelcast (we are java based)in memorywrite your own persistence
![Page 37: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/37.jpg)
Q/A
![Page 38: Redis for duplicate detection on real time stream](https://reader034.vdocuments.net/reader034/viewer/2022042504/55d578aebb61eba92f8b45de/html5/thumbnails/38.jpg)
References
http://redis.io/http://redis.io/commandshttp://stackoverflow.com/questions/tagged/redishttp://try.redis.io/