datasalt - bbva case study - extracting value from credit card transactions
TRANSCRIPT
Value extraction from BBVA credit card transactions
Case Study
104,000 employees 47 million customers
The idea
Extract value from
anonymized credit card transac5ons data & share it
Always: ü Impersonal ü Aggregated ü Dissociated ü Irreversible
Helping
Consumers
Sellers
Informed decision ü Shop recommenda5ons (by loca5on and by category) ü Best 5me to buy ü Ac5vity & fidelity of shop’s customers
Learning clients pa:erns ü Ac5vity & fidelity of shop’s customers ü Sex & Age & Loca5on ü Buying paIerns
Shop stats For different periods ü All, year, quarter, month, week, day
… and much more
The applica5ons
Customers
Internal use
Sellers
The challenges
Company silos
The amount of data
The costs
Security
Development flexibility/agility
Human failures
The plaOorm
S3 Data storage Elas5c Map Reduce Data processing
EC2 Data serving
The architecture
Hadoop
Distributed Filesystem ü Files as big as you want ü Horizontal scalability ü Failover
Distributed Compu5ng ü MapReduce ü Batch oriented
• Input files processed and converted in output files ü Horizontal scalability
Easier Hadoop Java API ü But keeping similar efficiency
Common design paIerns covered ü Compound records ü Secondary sor5ng ü Joins
Other improvements ü Instance based configura5on ü First class mul5ple input/output
Tuple MapReduce implementaDon for Hadoop
Tuple MapReduce
Pere Ferrera, Iván de Prado, Eric Palacios, Jose Luis Fernandez-‐Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE Interna6onal Conference on Data Mining Brussels, Belgium | December 10 – 13, 2012
Our evoluDon to Google’s MapReduce
Tuple MapReduce Sales difference between the most selling offices per each loca6on
Tuple MapReduce
Main constraint
ü Group by clause must be a subset of sort by clause
Indeed, Tuple MapReduce can be implemented on top of any MapReduce implementaDon
• Pangool -‐> Tuple MapReduce over Hadoop
Efficiency
hIp://pangool.net/benchmark.html
Similar efficiency to Hadoop
Voldemort
Distributed key/value store
Voldemort & Hadoop
Benefits ü Scalability & failover ü Upda5ng the database does not affect serving queries ü All data is replaced at each execu5on
• Providing agility/flexibility § Big development changes are not a pain
• Easier survival to human errors § Fix code and run again
• Easy to set up new clusters with different topologies
Basic sta5s5cs
Count Average Min Max Stdev
Easy to implement with Pangool/Hadoop ü One job, grouping by the dimension over which you want to
calculate the sta5s5cs.
CompuDng several Dme periods in the same job
ü Use the mapper for replica5ng each datum for each period ü Add a period iden5fier field in the tuple and include it in the
group by clause
Dis5nct count Possible to compute in a single job
ü Using secondary sor5ng by the field you want to dis5nct count on
ü Detec5ng changes on that field
Example
Shop Card
Shop 1 1234
Shop 1 1234
Shop 1 1234
Shop 1 5678
Shop 1 5678
Change +1
Change +1
2 dis5nct buyers for shop 1
ü Group by shop, sort by shop and card
Histograms Typically two-‐pass algorithm
ü First pass for detec5ng the minimum and the maximum and determine the bins ranges
ü Second pass to count the number of occurrences on each bin
AdaptaDve histogram
ü One pass ü Fixed number of bins ü Bins adapt
Op5mal histogram Calculate the be:er histogram that represents the original one using a limited number of flexible width bins
ü Reduce storage needs ü More representa5ve than fixed width ones -‐> beIer
visualiza5on
Op5mal histogram
Exact Algorithm Petri Kontkanen, Petri Myllym aki MDL Histogram Density EsDmaDon hIp://eprints.pascal-‐network.org/archive/00002983/
Too slow for producDon use
Op5mal histogram
Alterna5ve: Approximated algorithm
Random-‐restart hill climbing
1. Iterate N 5mes, keeping best solu5on 1. Generate a random solu5on 2. Iterate un5l no improvement
1. Move to next beIer possible movement
ü A solu5on is just a way of grouping exis5ng bins ü From a solu5on, you can move to some close
solu5ons ü Some are beIer: reduce the representa5on error
Algorithm
Op5mal histogram
Alterna5ve: Approximated algorithm
Random-‐restart hill climbing ü One order of magnitude faster ü 99% accuracy
Everything in one job
Basic staDsDcs -‐> 1 job
DisDnct count staDsDcs -‐> 1 job One pass histograms -‐> 1 job Several periods & shops -‐> 1 job
We can put all together so that compu5ng all sta5s5cs for all shops
fits into exactly one job
Shop recommenda5ons
Based on co-‐occurrences ü If somebody bought in shop A and in shop B, then a co-‐occurrence
between A and B exists ü Only one co-‐occurrence is considered although a buyer bought
several 5mes in A and B ü Top co-‐occurrences per each shop are the recommenda5ons
Improvements ü Most popular shops are filtered out because almost everybody buys
in them. ü Recommenda5ons by category, by locaDon and by both ü Different calcula5on periods
Shop recommenda5ons
Implemented in Pangool ü Using its coun5ng and joining capabili5es ü Several jobs
Challenges ü If somebody bought in many shops, the list of co-‐occurrences can
explode: • Co-‐occurrences = N * (N – 1), where N = # of dis5nct shops
where the person bought ü Alleviated by limi5ng the total number of dis5nct shops to consider
ü Only uses the top M shops where the client bought the most
Future ü Time aware co-‐occurrences. The client bought in A and B and he
did it in a close period of 5me.
Some numbers EsDmated resources needed with 1 year data
270 GB of stats to serve
24 large instances ~ 11 hours of execu5on
$3500 month ü Op5miza5ons s5ll possible ü Cost without the use of reserved instances ü Probably cheaper with an in-‐house Hadoop cluster
Conclusion It was possible to develop a Big Data soluDon for a Bank
ü With low use of resources ü Quickly ü Thanks to the use of technologies like Hadoop, Amazon Web
Services and NoSQL databases
The soluDon is ü Scalable ü Flexible/agile. Improvements easy to implement ü Prepared to stand human failures ü At a reasonable cost
Main advantage: doing always everything
Future: Splout Key/value datastores have limitaDons
ü Only accept querying by the key ü Aggrega5ons no possible ü In other words, we are forced to pre-‐compute everything
ü Not always possible -‐> data explode ü For this par5cular case, 5me ranges are fixed
Splout: like Voldemort but SQL! ü The idea: to replace Voldemort by Splout SQL ü Much richer queries: real-‐5me aggrega5ons, flexible 5me ranges ü It would allow to create some kind of Google Analy5cs for the
sta5s5cs discussed in this presenta5on ü Open Sourced!!!
hIps://github.com/datasalt/splout-‐db