datasalt - bbva case study - extracting value from credit card transactions

Value extraction from BBVA credit card transactions

Case Study

104,000 employees 47 million customers

The idea

Extract value from

anonymized credit card transac5ons data & share it

Always: ü  Impersonal ü  Aggregated ü  Dissociated ü  Irreversible

Helping

Consumers

Sellers

Informed decision ü  Shop recommenda5ons (by loca5on and by category) ü  Best 5me to buy ü  Ac5vity & fidelity of shop’s customers

Learning clients pa:erns ü  Ac5vity & fidelity of shop’s customers ü  Sex & Age & Loca5on ü  Buying paIerns

Shop stats For different periods ü  All, year, quarter, month, week, day

… and much more

The applica5ons

Customers

Internal use

Sellers

The challenges

Company silos

The amount of data

The costs

Security

Development flexibility/agility

Human failures

The plaOorm

S3 Data storage Elas5c Map Reduce Data processing

EC2 Data serving

The architecture

Hadoop

Distributed Filesystem ü  Files as big as you want ü  Horizontal scalability ü  Failover

Distributed Compu5ng ü  MapReduce ü  Batch oriented

•  Input files processed and converted in output files ü  Horizontal scalability

Easier Hadoop Java API ü  But keeping similar efficiency

Common design paIerns covered ü  Compound records ü  Secondary sor5ng ü  Joins

Other improvements ü  Instance based configura5on ü  First class mul5ple input/output

Tuple MapReduce implementaDon for Hadoop

Tuple MapReduce

Pere Ferrera, Iván de Prado, Eric Palacios, Jose Luis Fernandez-‐Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE Interna6onal Conference on Data Mining Brussels, Belgium | December 10 – 13, 2012

Our evoluDon to Google’s MapReduce

Tuple MapReduce Sales difference between the most selling offices per each loca6on

Tuple MapReduce

Main constraint

ü  Group by clause must be a subset of sort by clause

Indeed, Tuple MapReduce can be implemented on top of any MapReduce implementaDon

•  Pangool -‐> Tuple MapReduce over Hadoop

Efficiency

hIp://pangool.net/benchmark.html

Similar efficiency to Hadoop

Voldemort

Distributed key/value store

Voldemort & Hadoop

Benefits ü  Scalability & failover ü  Upda5ng the database does not affect serving queries ü  All data is replaced at each execu5on

•  Providing agility/flexibility §  Big development changes are not a pain

•  Easier survival to human errors §  Fix code and run again

•  Easy to set up new clusters with different topologies

Basic sta5s5cs

Count Average Min Max Stdev

Easy to implement with Pangool/Hadoop ü  One job, grouping by the dimension over which you want to

calculate the sta5s5cs.

CompuDng several Dme periods in the same job

ü  Use the mapper for replica5ng each datum for each period ü  Add a period iden5fier field in the tuple and include it in the

group by clause

Dis5nct count Possible to compute in a single job

ü  Using secondary sor5ng by the field you want to dis5nct count on

ü  Detec5ng changes on that field

Example

Shop Card

Shop 1 1234

Shop 1 1234

Shop 1 1234

Shop 1 5678

Shop 1 5678

Change +1

Change +1

2 dis5nct buyers for shop 1

ü  Group by shop, sort by shop and card

Histograms Typically two-‐pass algorithm

ü  First pass for detec5ng the minimum and the maximum and determine the bins ranges

ü  Second pass to count the number of occurrences on each bin

AdaptaDve histogram

ü  One pass ü  Fixed number of bins ü  Bins adapt

Op5mal histogram Calculate the be:er histogram that represents the original one using a limited number of flexible width bins

ü  Reduce storage needs ü More representa5ve than fixed width ones -‐> beIer

visualiza5on

Op5mal histogram

Exact Algorithm Petri Kontkanen, Petri Myllym aki MDL Histogram Density EsDmaDon hIp://eprints.pascal-‐network.org/archive/00002983/

Too slow for producDon use

Op5mal histogram

Alterna5ve: Approximated algorithm

Random-‐restart hill climbing

1.  Iterate N 5mes, keeping best solu5on 1.  Generate a random solu5on 2.  Iterate un5l no improvement

1.  Move to next beIer possible movement

ü  A solu5on is just a way of grouping exis5ng bins ü  From a solu5on, you can move to some close

solu5ons ü  Some are beIer: reduce the representa5on error

Algorithm

Op5mal histogram

Alterna5ve: Approximated algorithm

Random-‐restart hill climbing ü  One order of magnitude faster ü  99% accuracy

Everything in one job

Basic staDsDcs -‐> 1 job

DisDnct count staDsDcs -‐> 1 job One pass histograms -‐> 1 job Several periods & shops -‐> 1 job

We can put all together so that compu5ng all sta5s5cs for all shops

fits into exactly one job

Shop recommenda5ons

Based on co-‐occurrences ü  If somebody bought in shop A and in shop B, then a co-‐occurrence

between A and B exists ü Only one co-‐occurrence is considered although a buyer bought

several 5mes in A and B ü  Top co-‐occurrences per each shop are the recommenda5ons

Improvements ü Most popular shops are filtered out because almost everybody buys

in them. ü  Recommenda5ons by category, by locaDon and by both ü  Different calcula5on periods

Shop recommenda5ons

Implemented in Pangool ü  Using its coun5ng and joining capabili5es ü  Several jobs

Challenges ü  If somebody bought in many shops, the list of co-‐occurrences can

explode: •  Co-‐occurrences = N * (N – 1), where N = # of dis5nct shops

where the person bought ü  Alleviated by limi5ng the total number of dis5nct shops to consider

ü  Only uses the top M shops where the client bought the most

Future ü  Time aware co-‐occurrences. The client bought in A and B and he

did it in a close period of 5me.

Some numbers EsDmated resources needed with 1 year data

270 GB of stats to serve

24 large instances ~ 11 hours of execu5on

$3500 month ü  Op5miza5ons s5ll possible ü  Cost without the use of reserved instances ü  Probably cheaper with an in-‐house Hadoop cluster

Conclusion It was possible to develop a Big Data soluDon for a Bank

ü With low use of resources ü Quickly ü  Thanks to the use of technologies like Hadoop, Amazon Web

Services and NoSQL databases

The soluDon is ü  Scalable ü  Flexible/agile. Improvements easy to implement ü  Prepared to stand human failures ü  At a reasonable cost

Main advantage: doing always everything

Future: Splout Key/value datastores have limitaDons

ü  Only accept querying by the key ü  Aggrega5ons no possible ü  In other words, we are forced to pre-‐compute everything

ü  Not always possible -‐> data explode ü  For this par5cular case, 5me ranges are fixed

Splout: like Voldemort but SQL! ü  The idea: to replace Voldemort by Splout SQL ü  Much richer queries: real-‐5me aggrega5ons, flexible 5me ranges ü  It would allow to create some kind of Google Analy5cs for the

sta5s5cs discussed in this presenta5on ü  Open Sourced!!!

hIps://github.com/datasalt/splout-‐db

datasalt - bbva case study - extracting value from credit card transactions

Documents

card shop card shop

shop b

shop stats

decision shop recommenda5ons

classic mapreduce

periods shops

selling tuple mapreduce

single job