maximising data value a vendor’s perspective · • large scale looking to simplify path to...
TRANSCRIPT
Maximising data valueA vendor’s perspective
Executive summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �03
Opportunity � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �04
• Addressing market challenges � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �05
• Getting to grips with new data sources and techniques� � � � � � � � � � � � �06
• The Buy Side is increasing its focus on alternative data � � � � � � � � � � � � �07
• Capitalising on the market’s diverse data needs � � � � � � � � � � � � � � � � � � �08
Vision execution � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �10
• The information value chain � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 11
• Developing a signals factory proposition � � � � � � � � � � � � � � � � � � � � � � � � � � 12
• Mapping the data estate � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 13
• Effective monetisation of data assets � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 14
• Improving all facets of data quality with machine learning� � � � � � � � � � � 15
• The four lines of defence for data quality � � � � � � � � � � � � � � � � � � � � � � � � �16
Contents
02
Maximising data value | A vendor’s perspective
Our analyses is intended to be a starting point for understanding the advantages
and implications of using machine learning to maximise the value of alternative data�
We hope to empower data vendors to provide additional utility for producers and
consumers from alternative data.
Investors are increasing their focus on alternative data to maximise returns. The associated technology to fully harness alternative data remains a significant barrier to all but the largest firms. How can data vendors capitalise on the market’s diverse data needs?
Executive summary
Addressing market challenges
Alternative data is increasingly important in alpha generation, revealing multiple untapped customer engagement models for data vendors servicing investors�
Developing a signals factory
Creating and maintaining a signals factory proposition requires a diverse talent pool as a foundation, well designed processes and a high‑end technology stack.
Effective monetisation of data assets
Comprehensive mapping of the estate, the embedding an appropriate governance model and a suitable value monetisation strategy is required to maximise data value.
Improving data quality with machine learning
Machine learning techniques can enhance quality by both automating existing tasks and extending monitoring and prediction to previously resistant quality dimensions�
03
Maximising data value | A vendor’s perspective
Opp
ortunity
04
Maximising data value | A vendor’s perspective
Quantitative investment strategies and vendor solutions with alpha generation capabilities are becoming critical component to the return of the buy and sell side’s ROE to pre crisis levels�
Addressing market challenges
Sources: Macro trends database, January 2019; The Shift From Active to Passive Investing: Potential Risks to Financial Stability?, Federal Reserve System 2019; Deloitte Global Cost Survey, 2019; Alternative data for investment decisions: Today’s innovation could be tomorrow’s requirement. Deloitte Centre for Financial Services, 2018
Systematic/quant Investors, typically building their own analytics
Who
• Hedge Funds • Sophisticated Buy Side Firms
Key challenges
• Access to good quality raw data or to curated alternative data
• Maintaining access to cutting edge technology and algorithms
Customer needs
• Co‑location of analytics and data • Simplified access to data and computation • Simplified, but bespoke, data access
Interested in derived analytics and more intuitive solutions
Who
• Large Sell Side (GSIBs) • Traditional Buy Side Firms
Key challenges
• Reducing technology costs associated with efficient research tools
• Retention and expansion of innovation talent
Customer needs
• Simplified access to data and computation • Curated Signals • Simplified, but bespoke, data access
Most intuitive solutions needed. Limited technology and programming capability
Who
• Smaller Sell Side (DSIBs) • Small Buy Side + Family offices
Key challenges
• Reducing technology costs associated with efficient research tools
• Building/maintaining an edge against passive benchmark returns
Customer needs
• Simplified access to data and computation • Curated Signals • Sophisticated, but low maintenance/build cost analytics platforms
• Elastic access to analytics and associated data science talent
Sophisticated but ultra small scale with a focus on highly scalable business models
Who
• Alternative Data Providers • Signal Factories
Key challenges
• Simplified access to data • Ability and agility to scale • Support of cutting edge algorithms and alternative data sets
Customer needs
• Simplified access to data and computation • Simplified, but bespoke, data access • Sophisticated, but low maintenance/build cost analytics platforms
• Marketplace creation
Sophisticated quants Traditional quants Traditional investors Fintechs
05
Maximising data value | A vendor’s perspective
Investors are increasingly spending on alternative data, but building data science and engineering teams, and the associated analytics platforms to fully harness such diverse data, remains a significant barrier for all but the largest firms.
Getting to grips with new data sources and techniques
Market trendsBuy side spend on alternative data has increased over the previous 3 years and is expected to continue to grow:
• Poor active investment performance is driving shift to passive products and fee compression • Active investing strategies are starting to require more diverse data to generate strong alpha and beta predictive signals
• Savings from bundling of data streams are not currently possible due to the segmentation of the data providers market but are becoming highly desirable
Total Buy Side spend on alternative data ($m)
Barriers to entrySetting a data science/engineering team capable of harnessing alternative data signals can be both expensive and time consuming:
• A diverse talent pool, typically not found within existing functions, is required to find, analyse, model and productionise alternative insights
• The technology infrastructure required to integrate alternative and traditional datasets further increases costs, serving as a pre‑requisite for analysis
• Processes that are not optimally engineered (e.g. by inappropriate staffing) often lead to technical debt, production failures and associated costs
• Alternative data variability makes proactive quality monitoring and remediation an issue in which significant resources are invested
Minimum Data Team
Average buy side spend on datasets ($k)
AUM 2016 2017 2018E 2019E
<$1bn 35 63 107 158
$1bn – $10bn 213 340 506 764
>$10bn 1,288 1,954 3,104 4,041
Buy Side Avg. 841 1,267 2,005 2,640
Annual Salaries of Associated Talent
Role Entry Level Salary Bonus
Data Analyst $80k‑$100k ~25%
Data Scientist $80k‑$100k ~40%
Data Scout $70k‑$90k ~15%
Data Engineer $80k‑$110k ~30%
Head of Data $250k‑$1,000k ~100%
2020E2019E2018E20172016
$1,708
$1,088
$656$400
$232
Source: Alternativedata�org
Head of Data
Data Scientist
Data Analyst
Data Scout
Data Analyst
Data Engineer
Data Analyst
Anticipated minimum spend between $1m‑$2m p/a, dependent
on technology maturity, existing talent and complexity of ambition�
06
Maximising data value | A vendor’s perspective
The majority of buy side believe alternative data will positively impact their investment performance� Deloitte has surveyed over 100 investment managers (IMs) and has observed significant technological, talent and risk challenges that integrating such diverse data presents�
The Buy Side is increasing its focus on alternative data
Minimal Impact
Some impact, firms that utilise alternative data early may see some temporary advantages
Alternative data leaders will see sustained advantages in some asset classes
Alternative data represents a secular change in IM and expertise in this area will separate winners and losers over the next 5 years
No part of the strategic plan
Considering it, but no action at this point
Currently using alternative data in a test environment
Using alternative data to augment portfolio management decisions
IM firms’ opinion about the impact of alternative data on investment processes: What is your organisation’s status for utilising alternative data?
Do you think utilisation of alternative data (or not) presents new or different risks to IM firms?
No, it’s business as usual
Our existing risk mgmt. framework can be adapted in the normal course of business to handle alternative data
A fresh look at the risks associated with this development is appropriate
The issues presented by alternative data are significant – our firms needs to refresh the risk mgmt. framework to assess them
Other
Proprietary platforms and processes
Use of alternative data aggregators or brokers to facilitate acquisition
Use of alternative data crowd-sourced insights supplied by vendors
Use of insights developed by sell-side analysts from alternative data
More than on of these
Our organisation’s adoption journey for alternative data utilisation will likely or already includes:
10%
40%
42%
8%13%
9%
49%
29%
46%
15%
9%
14%
16%8%
11
%
51%
15%
15%
Source: c. 110 responses from IM firms from the polls conducted during the Alternative Data Dbrief session on April 24, 2018. Data has been cleaned to exclude blank and ‘Don’t Know/Not Applicable’ responses
07
Maximising data value | A vendor’s perspective
Multiple untapped value‑add customer engagement models exist for data vendors. Centralising the technology, talent and data processing to support these models presents multiple synergies for data vendors while also unlocking utility benefits for both producers and consumers.
Capitalising on the market’s diverse data needs
Minimally refined data supplied directly to customers. State of the art provides:
• Connected data, via a single point of access, and the ability to customise the data feed to a client’s specific requirements
• Cleansed data with appropriate imputation and normalised data concepts and entities
Focal customersSophisticated quants who build their own analytics and associated platforms e�g:
• Large Sell Side institutions • Quantitatively advanced Hedge Funds
Example playersData Vendors:
• IHS Markit • Bloomberg • Refinitiv • Euroclear (developing) • Deutsche Börse (developing)
Flexible cloud infrastructure (and platforms) provisioned with simplified access to Data. State of the art provides:
• Simplified access to data, while improving usage monitoring • Co‑located cloud infrastructure capable of supporting ultra low latency algorithmic decisions (and reducing comms infra costs)
• Access to cloud based elastic/burst computing capabilities and a variety of price point storage solutions
Focal customersAs per DaaS, with greater focus on latency dependent trading strategies�
• Large scale seeking ultra low latency
Example playersData Vendors:
• Refinitiv • Other Examples: • Google, Amazon, Microsoft (without co‑lo)
Analytics data platform built upon IaaS/PaaS with pre‑built environments for large scale. State of the art provides:
• Simplified access to data processing, providing off‑the‑rack data platform solutions that can be readily accessed
• App store engagement model that fosters agile fintech ecosystem
Focal customers • Mid‑Scale unable/unwilling to develop complex, data‑processing centric, cloud platforms
• Large scale looking to simplify path to innovationFintechs seeking lean data science focused operating model
1. Data as a service 2. Infra/Platform as a service 3. Analytics as a service
Example playersData Vendors: • No comprehensive/deep offerings within the major players
Other Examples: • Generic analytics vendors e.g. SAS, Cloudera, Pivotal
08
Maximising data value | A vendor’s perspective
Multiple untapped value‑add customer engagement models exist for data vendors. Centralising the technology, talent and data processing to support these models presents multiple synergies for data vendors while also unlocking utility benefits for both producers and consumers.
Capitalising on the market’s diverse data needs (Continued)
Combining AaaS model with a diverse data science talent pool. State of the art provides:
• Access to seasoned users of the AaaS platform • Access to rare skill sets such as graph theory, natural language processing, image processing etc� able to generate signals from data outside customer competencies
• Ultra flexible staffing model minimising overheads for R&D efforts
Focal customers
• Mid‑Scale unable/unwilling to develop complex, data‑processing centric, cloud platforms
• Large scale looking to simplify path to innovation
Example playersData Vendors:
• No major players (however Quandl model is similar)
Other Examples: • Prof. services; e.g. Deloitte, Accenture, BCG Gamma etc.
Pre‑generated signals that are sold to clients at a premium. State of the art provides:
• Pre‑built signals targeting market segments and use cases; where alternative data is used a series of robust quality checks
• Support for 3rd party vendors (i.e. those employing AaaS) to sell signals
• Utilise spare capacity within the Managed Analytics service
Focal customers
• Dependent on nature and pricing strategy of signal • Smaller Scale Wealth Managers
Example playersData Vendors:
• IHS Markit (Research Signals service ~60 clients) • Refinitiv (white papers on signals) • Quandl
4. Managed analytics service 5. Signal as a service
09
Maximising data value | A vendor’s perspective
Vision execution
10
Maximising data value | A vendor’s perspective
In order to realise maximum value for the combined data assets a combination of prioritisation, enhancement and consumer analysis is required, together with a sophisticated pricing structure that reflects the value of data assets to the end users.
The information value chain
Thorough risk assessment is required throughout the value chain to ensure that the data stored within the vendor and delivered to customers is regulatory compliant, technologically robust and ethically sound.
Collecting diverse, market relevant, data from a variety of traditional and alternative sources�
Data sourcing Data cleansing Signal factories Sales Distribution
Significant improvements in data quality require monitoring, cleansing and imputation using a variety of techniques and must be applied throughout data’s lifecycle, from capture to archive�
The most significant value is generated from curating signals from raw data� A go to market strategy that both provides signals and the facility to create more signals is a true differentiator.
Unrefined data pricing is relatively straightforward with commoditised pricing the norm. Signals data presents a larger challenge, with the value of the signal inversely proportionate to the usage of the signal. Sophisticated pricing strategies, such as automated auctions must be employed to maximise the asset’s value to both data vendors and their customers�
A distributions strategy that minimises unnecessary latency by collocating elastic compute and data capabilities is necessary�
11
Maximising data value | A vendor’s perspective
Creating and maintaining a signals factory requires a diverse talent pool as a foundation, well designed processes and a high end technology stack.
Developing a signals factory proposition
A diverse talent pool is required to both build and maintain the data engineering and analytics platform, but also to support client engagement in a managed service model:
• Product managers – likely to serve as scrum (or equivalent) managers
• Data & machine learning engineers – expected to both build and maintain the platform infrastructure, and productise models developed within the data science pool
• Data scientists – including image, NLP and network specialists, in addition to more traditional finance quant analysts
• Business analysts – expected to contain financial analysts capable of analysing and translating customer requests into data science problems
• Customer success managers – ensuring reliable, high quality, outcomes are delivered to all customers
A robust and well maintained technology platform is critical to a signal factory success with a partnership with a cloud supplier is likely to be a pre‑requisite. Key considerations include:
• Building in a cloud native fashion, to take full advantage of elastic storage and compute capabilities
• Support for a variety of data storage paradigms (e.g. graph, key value, columnar, relational)
• Seamless integration of exploration tools (e.g. Jupyter Notebooks, Tableau, Qlik)
• Model management frameworks, to simplify the promotion to production of models (likely to involve containerisation)
• State of the art cyber security to both ring‑fence sensitive client data and prevent external attacks
• Support for diverse hardware, including SSD, GPUs, FPGAs etc.
Building a frictionless service model for third parties to either access the signal factory platform and the data science talent that supports relies upon robust governance of data, technology and talent:
• Strong data governance and stewardship, to ensure that data management is scalable without the need to scale effort
• Clear duty segregation to minimise key person risks and bottlenecks
• Pod models to support autonomy and agility
The utility benefits of the platform are fully realised when a diverse set of consumers has adopted the platform. Key areas of focus are:
• Encouraging data exploration (and hence increased consumption of data)
• Supporting open source, where possible, to build customer’s stake in the platform and its capabilities
• Reduce friction for peer‑to‑peer resale, fostering smaller (likely fintech) signal generating consumers
• Close monitoring and enhancement of user experience
People Technology Process Marketplace
12
Maximising data value | A vendor’s perspective
Maximising the value of a data estate requires a comprehensive mapping of the estate and embedding an appropriate governance model prioritised by the estimated market value of data.
Mapping the data estate
Map
• Map all traditional assets and associated dictionaries
• Document existing distribution and storage approaches
• Identify data owners and stewards
Quality
• Assess data quality within assets, focusing on Clarity & Uniqueness, Validity & Consistency, Timeliness & Completeness and the Accuracy, Credibility & Confidence of the data sources
Internal
• Define and share explicit valuation methodology encompassing collection, usage, storage coverage and governance
External
• Define appropriate pricing strategy for assets (see slide 14 ‘Effective Monetisation of Data Assets’)
Value
Market Map & Gap
• Map the existing data assets to the current and potential consumers
• Match the demand for analytics services with the investment in data assets
Network Maximisation
• Close gaps in coverage to realise network benefits of connected data
• Enhance depth of assets with proven value
Entities
• Reconcile duplicative data assets and cleanse where appropriate to drive data efficiency and minimise the risk of divergent and/or conflicting data
Access
• Document the access patterns available per data source, focusing on the
• Consistency of the store front and the customisation ability of the customer
• Link data together to realise network valuation benefits
SimplifyDefine
Plan
13
Maximising data value | A vendor’s perspective
As a non‑depletable and non‑degradable asset data represents a unique valuation challenge, particularly pertinent in financial markets where the greater usage of an asset crowds out value.
Effective monetisation of data assets
* Profit Maximization Auction and Data Management in Big Data Markets, Nanyang Technological University 2017
A qualitative approach is likely required to support a benchmark to measure/complement other approaches. Considerations include:
• Cost of development • Data quality of signal (degree of imputation etc�) • Depth and breadth of signal coverage • Value of other similar signal assets • Uniqueness of the dataset/signal
1. Qualitative
• Constraining the number of consumers of highvalue data feeds is a useful heuristic to prevent over‑exploitation
• Dependent on the use case artificial latency constraints can be used (or combined with other techniques) to support multiple consumers without over‑exploitation and consequent erosion of alpha generation opportunities
2. License & latency
While complex profit sharing mechanisms create feedback within a pricing system that incentivises both vendor and consumer to maximise the value of a given signal asset.
Significant complexities exist within:
• Implementation, for example the negotiation of the degree of profit share, exposure in the event of signal failures
• Monitoring the agreed terms of the share
3. Profit sharing
Allowing the market to set define the price of an asset through an auction shows promise as a theoretically optimal solution*, but presents significant design and implementation challenges:
• How to support multiple customers exploiting different use cases
• Auction frequency • Auction technology (e.g. automated agent based approach vs offline manual)
4. Auction
Value maximisation strategies
14
Maximising data value | A vendor’s perspective
Throughout data’s usage lifecycle machine learning techniques can enhance quality by both automating existing tasks and extending monitoring and prediction to previously resistant quality dimensions�
Improving all facets of data quality with machine learning
Data quality challenge Machine learning efficacy
• Clarity: is there sufficient data definition clarity to support decision making with the data?
• Uniqueness: is there of a single source of truth, both globally and within a given data set?
Clarity & Uniqueness
• Data definitions should be profiled and monitored at scale with natural language processing
• Entity and data point resolution should be automated using probabilistic frameworks
Validity & Consistency
• Validity: is the data internally structurally sound, with datatype requirements obeyed throughout dimensions?
• Consistency: is the data externally structurally sound, with no impossible combinations of data attributes?
• Machine learning derived validity rules should complement existing manual approaches
• Frequent pattern mining techniques should propose and monitor credible data combinations, complementing existing MSCW frameworks
Timeliness & Completeness
• Timeliness: is the data available at the required time for a given application?
• Completeness: is data missing irrespective of time?
• Regardless of the root cause (atemporal/temporal) for absent data, machine learning techniques support effective imputation of missing data
Accuracy, Credibility & Confidence
• Accuracy: is the data an accurate reflection of the real world event(s) it describes?
• Credibility & Confidence: is the data credible and what confidence level can be attributed to the data, given its context (including any transformations it has undergone)?
• Accuracy should be measured to ensure that data is not just ‘grammatically correct’ but ‘semantically correct’
• Outlier analysis, both atemporal and temporal, should monitor the credibility of data
• Confidence measures for this accuracy are a critical, and underutilised component in a holistic data quality framework
15
Maximising data value | A vendor’s perspective
Significant improvements in data quality require monitoring using a variety of techniques and must be applied throughout data’s lifecycle, from capture to archive�
The four lines of defence for data quality
1. Rules based analysisTransparent rules that describe known
business scenarios
2. Probabilistic analysisExtending coverage to known unknowns
and unknown unknowns
3. ImputationAddressing missing data caused by either
timeliness of completeness issues
4. Time series monitoringLong term monitoring of credibility of data
and its evolution over time
Strengths & weaknesses Strengths & weaknesses
Machine learning techniques
Strengths & weaknesses
Machine learning techniques
Strengths & weaknesses
Machine learning techniquesMachine learning techniques
Strengths
• Low tech solution with low barrier of entry • Mature technology landscape with multiple vendors
• Transparent solution which harnesses depth of existing business knowledge (avoiding a ‘cold start’)
Weaknesses
• Effective rule transparency can be hampered by unmanageable volume of rules
• Unsophisticated approach not a credible solution to solving complex, often probabilistic, multidimensional issues
Strengths
• Extends DQ coverage into more sophisticated probabilistic issues by removing rule articulation constraint
• Supports sophisticated outlier analysis to flag unusual datapoints
Weaknesses
• Reduced transparency owing to complex probabilistic models
Strengths
• Minimises the effect of absent data by providing a richer (and more correct) version of the data for downstream modelling
Weaknesses
• High complexity, especially in time series data • Large degrees of imputation can lead to undesirable uniformity in data and hence must be monitored over time (e.g. through the creation of additional metadata and reporting that surfaces issues)
Strengths
• Supports monitor single and multi‑dimensional data over time and hence is able to identify unusual trends (e.g. unexpectedly variable or uniform data) that may trigger downstream model updates
• Harnesses and respects the time sensitive nature of data streams
Weaknesses
• Relative complexity, with in house solutions presenting a maintenance burden and third party solutions costing a premium
• CART • FP Growth • RuleFit • Expert Systems
• Bayesian Networks • Variational AutoEncoders • Iso‑Forests • Traditional Stats
• GANs • Bayesian Networks • Spectral Analysis • Gaussian Mixture Models
• Prophet • LSTMs • Econometrics • Anodot (example vendor solution)
16
Maximising data value | A vendor’s perspective
This publication has been written in general terms and we recommend that you obtain professional advice before acting or refraining from action on any of the contents of this publication. Deloitte LLP accepts no liability for any loss occasioned to any person acting or refraining from action as a result of any material in this publication.
Deloitte LLP is a limited liability partnership registered in England and Wales with registered number OC303675 and its registered office at 1 New Street Square, London EC4A 3HQ, United Kingdom.
Deloitte LLP is the United Kingdom affiliate of Deloitte NSE LLP, a member firm of Deloitte Touche Tohmatsu Limited, a UK private company limited by guarantee (“DTTL”). DTTL and each of its member firms are legally separate and independent entities. DTTL and Deloitte NSE LLP do not provide services to clients. Please see www�deloitte�com/about to learn more about our global network of member firms.
© 2020 Deloitte LLP. All rights reserved.
Designed and produced by 368 at Deloitte. J19113