democratizing data science in the enterprise
TRANSCRIPT
Democratizing Data Science in the Enterprise
Better Title: The NO BS Guide to Getting Insights from your
Business Data
About Me
• Hackerpreneur• Founder of Tellago • Founder of KidoZen• Board member• Advisor: Microsoft, Oracle• Angel Investor• Speaker, Author
http://jrodthoughts.comhttps://twitter.com/jrdothoughts
Agenda
• A brief history of data science• Democratizing data science in the enterprise• Building a great data science infrastructure• Solving the last mile usability challenge
Key Takeaways
• How to build data science solutions in the real world without breaking the bank?
• What technologies can help?• Myths and realities of data science solutions
Data Science….Still Magic?
It ’s not a trick, it ’s an illusion.
Any sufficientlyadvanced technology isindistinguishable from
magic.— Arthur C. Clarke
1. create technology:people who are not experts canuse it easily with little difficultyand trust the output
2. make it “sufficiently advanced”
“data science”
d. conway, 2010
1
Basic Research
Applied Research
WorkingPrototype
Quality Code
Tool orService
Maybe someday, someone can use this.
I might be able to use this.
I can use this (sometimes).
Software engineers can use this.
People can use this.
The Wizard….The Data Scientist
Fred Benenson@fredbenenson,n -
Fol lowing
IMHO the majority of data work boils down to
3 th ings:
1. Counting stuff
2. Figuring out the denominator
3. The reproducibility of 1 & 2• *
RETWEETS
32FAVORITES
28
12:33 PM - 21 Aug 2013
They’re hot these days…
1
1
2
2
“data science”
jobs, jobs, jobs
2
“data science”
jobs, jobs, jobs
2
Where do they come from?
“data science”
ancient history: 2001
“The Future of Data Analysis,”
W.
1962
John Tukey
introduces:
“Exploratory data anlaysis”
2
Tukey 1965, via John Chambers
TUKEY BEGAT S WHICH BEGAT R
30hackNYDS.key -
Thursday:June.18
Tukey 1972
3
? 1972
3
Jerome H. Friedman
3
TUKEY BEGAT ESL
3
TUKEY BEGAN VDQI
3
Tukey 1977
3
TUKEY BEGAT EDA
3
fast forward -> 2001
3
Data Science in the Enterprise
Seems like magic…
But it boils down to 2 factors….
Data Science Success Factors in the Enterprise
• Building a great data science infrastructure
• Solving the last mile problem
Tricks to build a great data science infrastructure
Trick#1: Centralized Data Aggregation…
Goals & Challenges
• Correlate data from disparate data sources
• Enable a centralized data store for your enterprise
• Incorporate new information sources in an agile way
• Traditional multi-dimensional data warehouses are difficult to modify
• They are designed around a specific set of questions (schema-first)
• Challenges to incorporate semi-structure and unstructured data
I would like to… But…
Centralized Data Aggregation: Best Practices
• Implement an enterprise data lake
• Rely on big data DW platforms such as Apache Hive
• Use a federated architecture efficiently partitioned for different business units
• Establish SQL as the common query language
• Leverage in-memory computing to optimize query performance
Centralized Data Aggregation: Technologies & Vendors
Trick#2: Data Discovery…
Goals & Challenges
• Organically discover data sources relevant to my job
• Help others discover data more efficiently
• Collaborate with colleagues about specific data sources
• Business users typically don’t have access to the data lake
• There is no corporate data repository
• There is no search and metadata repository
I would like to… But…
Data Discovery: Best Practices
• Implement a corporate data catalog
• The data catalog should be the user interface to interact with the corporate data lake
• Copy ideas from data catalogs in the internet
• Provide rich metadata experience in your data catalog
• Extend your data lake with search capabilities
Data Discovery: Technologies & Vendors
Trick#3: Establish a Common Query Language…
Goals & Challenges
• Query data from different business systems in a consistent way
• Correlate information from different line of business systems
• Reuse queries as new sources of information
• Different business systems use different protocols to query data
• I need to learn a new query language to interact with my big data infrastructure
• Queries over large data sources can be SLOW
I would like to… But…
Query Language: Best Practices
• Standardize on SQL as the language query business data
• Implement a SQL interface for your data lake
• Correlate data sources using simple SQL joins
• Materialize query results in your data lake for future reuse
• Invest in in-memory technologies to optimize performance
Query Language: Technologies & Vendors
Trick#4: Focus on Data Quality…
Goals & Challenges
• Trust corporate data for my applications
• Actively merge new and historical data
• Integrate new data back into line of business systems
• Data in line of business systems in poorly curated
• Some data records need to be validated or cleanse
• Some data records need to be enriched with additional data points
I would like to… But…
Data Quality: Best Practices
• Implement a data quality process
• Leverage your data catalog as the main user interface to control data quality
• Trust the wisdom of the crowds to manage data quality
• Provide a great user experience to data quality
Data Quality: Technologies & Vendors
Trick#5: Understand your data….
Goals & Challenges
• Execute efficient queries against my corporate data
• Discover patterns and trends about business data sources
• Rapidly adapt to new data sources added to our business processes
• There is no simple way to understand corporate data sources
• We rely on users to determine which queries to execute
• New data patterns and trends often go undetected
I would like to… But…
Understanding your Data : Best Practices
• Leverage machine learning algorithms to understand business data sources
• Leverage clustering algorithms to detect interesting patterns from your business data
• Leverage classification algorithms to place data records in well-defined groups
• Leverage statistical distribution algorithms to reveal interesting information about your data
Understanding your Data : Technologies & Vendors
Trick#6: Predict…
Goals & Challenges
• Efficiently predict well-known variables in my business data
• Adapt results to future predictions
• Take actions based on the predicted outcomes
• Our analytics are based on after-the-fact reports
• Traditional predictive analytics technologies don’t work well with semi-structured and unstructured data
• Traditional predictive analytics require complex infrastructure
I would like to… But…
Predict : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive analytics algorithms
• Leverage classification and clustering algorithms as the main mechanisms to train predictions
• Expose predictions to other applications for future reuse
Predict : Technologies & Vendors
Trick#7: Take Actions…
Goals & Challenges
• Not have to read a report to take actions on my business data
• Model automatic actions based on well-defined data rules
• Evaluate the effectiveness of the rules and adapt
• Data results are mostly communicated via reports and dashboards
• There is no interface to design rules against business data
• Actions are implemented based on human interpretation of data
I would like to… But…
Take Actions : Best Practices
• Implement a modern predictive analytics platform
• Leverage the data lake as the main source of information to predictive analytics algorithms
• Leverage classification and clustering algorithms as the main mechanisms to train predictions
• Expose predictions to other applications for future reuse
Take Actions: Technologies & Vendors
Trick#8: Embrace developers…
Goals & Challenges
• Leverage data analyses in new applications
• Help developers embrace corporate data infrastructure
• Expose data analyses to new mediums such as mobile or IOT
• Data results are mostly communicated via reports and dashboards
• Data analysis efforts are typically led by non-developers
• There is no easy way to organically discover and reuse corporate data sources
I would like to… But…
Leverage Developers: Best Practices
• Expose data sources and analyses via APIs
• Leverage industry standards to integrated with third party tools
• Provide data access samples and SDKs for different environments such as mobile and IOT clients
• Incorporate developer’s feedback into your data sources
Take Actions: Technologies & Vendors
Trick#9: Real time data is different…
Goals & Challenges
• Process large volumes or real time data
• Aggregate real time and historical data
• Detect and filter conditions in my real time data before it goes into corporate systems
• There is no infrastructure to query real time data
• We process real time and historical data using the same models
• Large data volumes affect performance
I would like to… But…
Real Time Data Processing: Best Practices
• Implement a stream analytics platform
• Model queries over real time data streams
• Add the results of the aggregated queries into the data lake
• Replay data streams to simulate real time conditions
Real Time Data Processing: Technologies & Vendors
Solving the last mile problem
Trick#1: Killer user experience…
Create a Killer User Experience
• Design matters
• Invest on a easy way for users to interact with corporate data source
• Leverage modern UX principles that work cross channels(mobile, web)
• Make data discoverable
• Leverage metadata
• Facilitate collaboration
Trick#2: Test test test…
Test Test Test
• Incorporate test models into your data sources
• Simulate real world conditions at the data level
• Assume everything will fail
Trick#3: Integrate with existing tools…
Integrate with Third Party Tools
• Integrate your data lake with mainstream tools like Tableau or Excel
• Use industry standards so that data sources can be incorporated
Trick#5: Collaborate…
Collaborate
• Integrate data sources with modern messaging and collaboration tools: Slack, Yammer etc
• Distribute updates via emails, push notifications, SMSs
Other things to consider
• On-premise, cloud or hybrid?
• Apply agile development practices to your data science infrastructure
• Infrastructure is cool but usability is more important
Summary
• Data science is not magic, is an illusion
• Implementing data science in the enterprise is about solving two problems• Building a great data infrastructure
• Solving the last mile usability challenge
• Today this can be done with commodity technology
• Data scientists are just “people “ ;)
THANKSJesus Rodriguez
https://twitter.com/jrdothoughts
http://jrodthoughts.com/