enterprise data problems in investment banks “bigdata ... · pdf filenathan marz...

25
SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS CHARLES CAI ASHWANI ROY 8 March 2013 1 Enterprise Data Problems in Investment Banks “BigData” History and Trend – Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. Hypothetical Solution using Lambda Architecture Where “BigData” Industry is Going? 3548

Upload: phungngoc

Post on 17-Mar-2018

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS CHARLES CAI ASHWANI ROY

8 March 2013

1

Enterprise Data Problems in Investment Banks

“BigData” History and Trend – Driven by Google

CAP Theorem for Distributed Computer System

Open Source Building Blocks: Hadoop, Solr, Storm..

Hypothetical Solution using Lambda Architecture

Where “BigData” Industry is Going? 3548

Page 2: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Presenter: Charles Cai

¨  Charles Cai makes a living by designing and implementing trading and risk systems for investment banks.

¨  Currently a Chief Front Office Technical Architect in a global energy trading firm.

¤ Twitter: @caidong ¤ Linkedin: charlescai

Page 3: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Presenter: Ashwani Roy

¨  Ashwani Roy – Masters in Finance Student at London Business School and VP at a Tier 1 Investment Bank.

¨  Love to mix programming and Applied Mathematics to solve difficult problems in Investment Banking

¤ Twitter: @Ashwani_Roy ¤ Linkedin: ashwaniroy

Page 4: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

3548

Page 5: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Why Finance Industry should care ?

¨  We care because of ¤ Compliance requirements

¤ Risk Management

¤ Pricing

¤ Rise of Machines (Ecommerce)

¤ Cost Cutting

¤ BTW: Twitter is also part of Market Data

Page 6: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Sample Interest Model / Simulations

Page 7: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

A quick Monte Carlo Demo

¨  Demo – Computing this is functional Some Terminology

¨  PV = present value = Cash flows discounted to current time

¨  Delta = change in price / change in interest rate ¨  Gamma .. Vega .. Rho .. Theta .. Vanna …. And

other Greeks

Page 8: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Monte Carlo Simulations -Results

¨  <results> = func<I,j,k…… ¨  Parallelize computation with mappers ¨  Save results and run reducers ¨  [[ trade: 1 curveid: Orig PV:100 Delta:200]{ to OLAP}

..[ trade: 1 curveid: Sim1 PV:100 Delta:200] {big data}

..[ trade: 1 curveid: Sim2 PV:99 Delta:220]{big data} ] ]

Page 9: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Compliance

¤  Dodd-Frank requires >= five years records ¤  Fast Disaster recovery requirements (Tapes backup not acceptable) ¤  All Bloomberg and other chats to be saves in quick reportable form ¤  … Many more in Basel 3 and Dodd Frank Act

You need to #  get chats for [email protected] and [email protected] #  from the 5 years Bloomberg and Reuters log of a global investment bank

of 1TB(assume 1MB/Day/Trader * 220 trading days * 1000 traders* 5 years)

#  for all EURUSD swaps only ….. Additional filters and aggregation requirements

Page 10: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Big Data Industry History: Google’s Papers

10

Page 11: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Google’s Big Data Papers: 2003 – 2006

11

GFS – Google File System

• 2003 • Distributed file

system • 3 x copies • Commodity

machines • Colossus (2012)

MapReduce

• 2004 •  Input à Map à

Partition à Compare à Shuffle à Sort à Reduce à Output

BigTable

• 2006 • Distributed Key-

Value column-family based database

Page 12: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Hadoop Distributed File System (HDFS)

¨  http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png

Page 13: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Google’s MapReduce Programming Model

13

Page 14: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Apache Hbase: Column Family Distributed K-V Store

Page 15: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Google’s Big Data Papers 2: 2010 - now

15

Percolator

• 2010 • Incremental update/compute

•  built on BigTable • Adds transactions, locks, notifications

• SPFs: “Stream Processing Frameworks” + underlying database

Dremel

• 2010 • Online analytics and visualization

• SQL like language for structured data

• Each row is JSON object – in protobuf format

• Column based • Spanner (2012), BigQuery, F1

Pregel

• 2010 • Scalable graph computing • Worker threads à nodes à parallel “superstep” à messages à nodes à Aggregator/Combiners (global statistics)

• PageRank, shortest path, bipartite matching

Impala

Tez/Stinger

Microsoft Trinity

Page 16: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Unstructured Data: Index/Search Engine

¨  Github Code Search: 17 TB

Page 17: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Apache Lucene/SOLR

¨  Open Source Indexing and Search Engine

¨  4,000+ Enterprise users ¤  IBM, HP, Cisco

¤  Apple, Linkedin

¤  Wikipedia

¤  CNet, Sky

¤  Twitter

Page 18: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

What’s Next for Hadoop? Real-time!

Nathan Marz

Page 19: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Some more use cases

¤ Save money to save your jobs

¤ Save money to your firm can do more

¤ E Commerce is norm…

¤ Market sentiment analysis cannot be relied on using

“Bloomberg's sentiment analysis” only

¤  .. Add some more

Page 20: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

“Lambda Architecture” – Nathan Marz, BackType/Twitter

¨  query = func (data, ...)

20

•  Real-time ticks, events… •  Historical (all history data

points) •  Curated/cleansed curves… •  Derived curves… •  Back-testing models… •  …

•  Technical analysis… •  Alerts… •  Join across data sources (e.g.

correlation among weather / energy) •  Curating/cleanse curves… •  Derive curves, building models… •  Back-testing models… •  Visualization of the above! •  …

•  Excel / VBA •  Java, C#/F#... •  MatLab •  3rd party ETL Tools •  R •  …

Page 21: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Batch  Layer  (Hadoop) Servicing  Layer

Speed  Layer  (Storm)

RDBMS/DW  +  Full-­‐text  Search  +  Graph  Database

QFD  1

Batch  recompute

All  data  (HDFS/HBase)

QFD  1

Tableau/Spotfire

Excel/Apps

MDX/DW(T+ 1)

New  data  stream

Process  stream

Precompute  Views  (MapReduce)

Realtime  increment  

QFD  2

QFD  2

QFD  NBatch  views  (HDFS/Impala)

QFD  NRealtime  views  (Apache  Hbase)

Metadata  /  C lassification  /  Curation

Automation  /  Agg regation  /  Centralization

RDBMS Graph  Database

Merge

Full-­‐text  Search

COTS  Reporting  Tools

Ad-­‐hock  Analysis/Writeback:  Java/C#,R/Clojure,  HIVE/PIG,  Talend/3rd  

party,  ...

Alerts

Visualization

Qua

lity  /  Access  /  

Man

ipulation

Access  /  Centralization  /  Manipulation

Acquisition

Visualization

Lambda  Architecture :  query  =  func  (data,  ...)

Page 22: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Online resources and alternative stacks

¨  An Introduction to Data Science.PDF – Free e-book on Data Science with R under Creative Commons Licenses

¨  Berkeley Data Analytics Stack (Open Source: Mesos – cluster management, Spark/Streaming – cluster computing, Shark-SQL/DW)

¨  Learning Statistics with R, Free Big Data Education: Advanced Data Science

¨  DataStax Enterprise (Apache C*/Cassandra, Apache Hadoop, Apache Solr…)

¨  An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop and Splout SQL

¨  Nathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture

¨  Open source clustered Lucene: elasticsearch used by GitHub (17 TB code)

Page 23: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

Distributed Computing System: CAP Theorem

https://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overview http://en.wikipedia.org/wiki/CAP_theorem http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

Consistency •  all nodes see the same data at

the same time

Availability •  a guarantee that every request

receives a response about whether it was successful or failed

Partition tolerance •  the system continues to operate

despite arbitrary message loss or failure of part of the system

Page 24: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

“Lambda Architecture”: Enterprise Data

•  Quality of data •  Ways to improve

data quality •  Discover hidden

business insights

•  Data sources •  Data formats (./

semi-/non-structured…)

•  Speed of change •  Speed of reaction

•  Data size •  Retention granular

level…

Volume Velocity

Value Variety

Page 25: Enterprise Data Problems in Investment Banks “BigData ... · PDF fileNathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ! Open source clustered Lucene: elasticsearch

“Lambda Architecture” – Nathan Marz, BackType/Twitter

¨  Design Principle: ¤  Human fault-tolerance

¤  Immutability

¤  Pre-computation

¨  Lambda Architecture: ¤  Batch Layer

¤  Serving Layer

¤  Speed Layer

¨  Technology Stack ¤  Apache Hadoop/HBase/Cloudera Impala

¤  Twitter Storm