datasalt - bbva case study - extracting value from credit card transactions

30
Value extraction from BBVA credit card transactions Case Study

Upload: datasalt

Post on 23-Jun-2015

2.626 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Datasalt - BBVA case study - extracting value from credit card transactions

Value extraction from BBVA credit card transactions

Case  Study  

Page 2: Datasalt - BBVA case study - extracting value from credit card transactions

104,000  employees  47  million  customers  

Page 3: Datasalt - BBVA case study - extracting value from credit card transactions

The  idea  

Extract  value  from  

anonymized  credit  card  transac5ons  data  &  share  it      

Always:    ü  Impersonal  ü  Aggregated  ü  Dissociated  ü  Irreversible  

Page 4: Datasalt - BBVA case study - extracting value from credit card transactions

Helping  

Consumers  

Sellers  

Informed  decision  ü  Shop  recommenda5ons  (by  loca5on  and  by  category)  ü  Best  5me  to  buy  ü  Ac5vity  &  fidelity  of  shop’s  customers  

Learning  clients  pa:erns  ü  Ac5vity  &  fidelity  of  shop’s  customers  ü  Sex  &  Age  &  Loca5on  ü  Buying  paIerns  

Page 5: Datasalt - BBVA case study - extracting value from credit card transactions

Shop  stats   For  different  periods  ü  All,  year,  quarter,  month,  week,  day  

…  and  much  more  

Page 6: Datasalt - BBVA case study - extracting value from credit card transactions

The  applica5ons  

Customers  

Internal  use  

Sellers  

Page 7: Datasalt - BBVA case study - extracting value from credit card transactions

The  challenges  

Company  silos  

The  amount  of  data  

The  costs  

Security  

Development  flexibility/agility  

Human  failures  

Page 8: Datasalt - BBVA case study - extracting value from credit card transactions

The  plaOorm  

S3  Data  storage  Elas5c  Map  Reduce  Data  processing  

EC2  Data  serving  

Page 9: Datasalt - BBVA case study - extracting value from credit card transactions

The  architecture  

Page 10: Datasalt - BBVA case study - extracting value from credit card transactions

Hadoop  

Distributed  Filesystem  ü  Files  as  big  as  you  want  ü  Horizontal  scalability  ü  Failover    

Distributed  Compu5ng  ü  MapReduce  ü  Batch  oriented  

•  Input  files  processed  and  converted  in  output  files  ü  Horizontal  scalability    

Page 11: Datasalt - BBVA case study - extracting value from credit card transactions

Easier  Hadoop  Java  API  ü  But  keeping  similar  efficiency  

Common  design  paIerns  covered  ü  Compound  records  ü  Secondary  sor5ng  ü  Joins  

Other  improvements  ü  Instance  based  configura5on  ü  First  class  mul5ple  input/output  

Tuple  MapReduce  implementaDon  for  Hadoop  

Page 12: Datasalt - BBVA case study - extracting value from credit card transactions

Tuple  MapReduce  

Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐Marquez,  Giovanna  Di  Marzo  Serugendo:      Tuple  MapReduce:  Beyond  classic  MapReduce.      In  ICDM  2012:  Proceedings  of  the  IEEE  Interna6onal  Conference  on  Data  Mining    Brussels,  Belgium  |  December  10  –  13,  2012  

Our  evoluDon  to  Google’s  MapReduce  

Page 13: Datasalt - BBVA case study - extracting value from credit card transactions

Tuple  MapReduce   Sales  difference  between  the  most  selling  offices  per  each  loca6on  

Page 14: Datasalt - BBVA case study - extracting value from credit card transactions

Tuple  MapReduce  

Main  constraint  

ü  Group  by  clause  must  be  a  subset  of  sort  by  clause  

Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of  any  MapReduce  implementaDon  

•  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  

Page 15: Datasalt - BBVA case study - extracting value from credit card transactions

Efficiency  

hIp://pangool.net/benchmark.html  

Similar  efficiency  to  Hadoop  

Page 16: Datasalt - BBVA case study - extracting value from credit card transactions

Voldemort  

Distributed  key/value  store  

Page 17: Datasalt - BBVA case study - extracting value from credit card transactions

Voldemort  &  Hadoop  

Benefits  ü  Scalability  &  failover  ü  Upda5ng  the  database  does  not  affect  serving  queries  ü  All  data  is  replaced  at  each  execu5on  

•  Providing  agility/flexibility    §  Big  development  changes  are  not  a  pain  

•  Easier  survival  to  human  errors  §  Fix  code  and  run  again  

•  Easy  to  set  up  new  clusters  with  different  topologies    

Page 18: Datasalt - BBVA case study - extracting value from credit card transactions

Basic  sta5s5cs  

Count   Average   Min   Max   Stdev  

Easy  to  implement  with  Pangool/Hadoop  ü  One  job,  grouping  by  the  dimension  over  which  you  want  to  

calculate  the  sta5s5cs.  

CompuDng  several  Dme  periods  in  the  same  job  

ü  Use  the  mapper  for  replica5ng  each  datum  for  each  period  ü  Add  a  period  iden5fier  field  in  the  tuple  and  include  it  in  the  

group  by  clause    

Page 19: Datasalt - BBVA case study - extracting value from credit card transactions

Dis5nct  count  Possible  to  compute  in  a  single  job  

ü  Using  secondary  sor5ng  by  the  field  you  want  to  dis5nct  count  on  

ü  Detec5ng  changes  on  that  field    

Example  

Shop   Card  

Shop  1   1234  

Shop  1   1234  

Shop  1   1234  

Shop  1   5678  

Shop  1   5678  

Change  +1  

Change  +1  

2  dis5nct  buyers  for  shop  1  

ü  Group  by  shop,  sort  by  shop  and  card  

Page 20: Datasalt - BBVA case study - extracting value from credit card transactions

Histograms  Typically  two-­‐pass  algorithm  

ü  First  pass  for  detec5ng  the  minimum  and  the  maximum  and  determine  the  bins  ranges  

ü  Second  pass  to  count  the  number  of  occurrences  on  each  bin  

AdaptaDve  histogram    

ü  One  pass  ü  Fixed  number  of  bins  ü  Bins  adapt    

Page 21: Datasalt - BBVA case study - extracting value from credit card transactions

Op5mal  histogram  Calculate  the  be:er  histogram  that  represents  the  original  one  using  a  limited  number  of  flexible  width  bins  

ü  Reduce  storage  needs  ü More  representa5ve  than  fixed  width  ones  -­‐>  beIer  

visualiza5on  

Page 22: Datasalt - BBVA case study - extracting value from credit card transactions

Op5mal  histogram  

Exact  Algorithm  Petri  Kontkanen,  Petri  Myllym  aki    MDL  Histogram  Density  EsDmaDon    hIp://eprints.pascal-­‐network.org/archive/00002983/  

Too  slow  for  producDon  use  

Page 23: Datasalt - BBVA case study - extracting value from credit card transactions

Op5mal  histogram  

Alterna5ve:  Approximated  algorithm  

Random-­‐restart  hill  climbing    

1.  Iterate  N  5mes,  keeping  best  solu5on  1.  Generate  a  random  solu5on  2.  Iterate  un5l  no  improvement  

1.  Move  to  next  beIer  possible  movement  

ü  A  solu5on  is  just  a  way  of  grouping  exis5ng  bins  ü  From  a  solu5on,  you  can  move  to  some  close  

solu5ons  ü  Some  are  beIer:  reduce  the  representa5on  error    

Algorithm  

Page 24: Datasalt - BBVA case study - extracting value from credit card transactions

Op5mal  histogram  

Alterna5ve:  Approximated  algorithm  

Random-­‐restart  hill  climbing    ü  One  order  of  magnitude  faster  ü  99%  accuracy    

Page 25: Datasalt - BBVA case study - extracting value from credit card transactions

Everything  in  one  job  

Basic  staDsDcs  -­‐>  1  job  

DisDnct  count  staDsDcs  -­‐>  1  job  One  pass  histograms  -­‐>  1  job  Several  periods  &  shops  -­‐>  1  job  

We  can  put  all  together  so  that  compu5ng  all  sta5s5cs  for  all  shops  

fits  into  exactly  one  job      

Page 26: Datasalt - BBVA case study - extracting value from credit card transactions

Shop  recommenda5ons  

Based  on  co-­‐occurrences  ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence  

between  A  and  B  exists  ü Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought  

several  5mes  in  A  and  B  ü  Top  co-­‐occurrences  per  each  shop  are  the  recommenda5ons  

Improvements  ü Most  popular  shops  are  filtered  out  because  almost  everybody  buys  

in  them.  ü  Recommenda5ons  by  category,  by  locaDon  and  by  both  ü  Different  calcula5on  periods  

Page 27: Datasalt - BBVA case study - extracting value from credit card transactions

Shop  recommenda5ons  

Implemented  in  Pangool  ü  Using  its  coun5ng  and  joining  capabili5es  ü  Several  jobs  

Challenges  ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can  

explode:  •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  dis5nct  shops  

where  the  person  bought  ü  Alleviated  by  limi5ng  the  total  number  of  dis5nct  shops  to  consider  

ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most    

Future  ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he  

did  it  in  a  close  period  of  5me.  

Page 28: Datasalt - BBVA case study - extracting value from credit card transactions

Some  numbers  EsDmated  resources  needed  with  1  year  data  

270  GB  of  stats  to  serve  

24  large  instances  ~  11  hours  of  execu5on  

$3500  month  ü  Op5miza5ons  s5ll  possible  ü  Cost  without  the  use  of  reserved  instances  ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  

Page 29: Datasalt - BBVA case study - extracting value from credit card transactions

Conclusion  It  was  possible  to  develop  a  Big  Data  soluDon  for  a  Bank  

ü With  low  use  of  resources  ü Quickly  ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web  

Services  and  NoSQL  databases  

The  soluDon  is  ü  Scalable  ü  Flexible/agile.  Improvements  easy  to  implement  ü  Prepared  to  stand  human  failures  ü  At  a  reasonable  cost  

Main  advantage:  doing  always  everything  

Page 30: Datasalt - BBVA case study - extracting value from credit card transactions

Future:  Splout  Key/value  datastores  have  limitaDons  

ü  Only  accept  querying  by  the  key  ü  Aggrega5ons  no  possible  ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything  

ü  Not  always  possible  -­‐>  data  explode  ü  For  this  par5cular  case,  5me  ranges  are  fixed  

Splout:  like  Voldemort  but  SQL!  ü  The  idea:  to  replace  Voldemort  by  Splout  SQL  ü  Much  richer  queries:  real-­‐5me  aggrega5ons,  flexible  5me  ranges  ü  It  would  allow  to  create  some  kind  of  Google  Analy5cs  for  the  

sta5s5cs  discussed  in  this  presenta5on  ü  Open  Sourced!!!  

hIps://github.com/datasalt/splout-­‐db