monitoring anomalies in experimentation platform
TRANSCRIPT
Monitoring Anomalies in Experimentation Platform
Deepak Vasthimal – MTS @ eBayConnect Me - https://www.linkedin.com/in/whatisdeepakAvailable on eBay Tech Blog - https://goo.gl/6bUbE9
2
Overview of Experimentation (A/B Test) Platform
• A/B Testing is comparing different experiences and measure their performance.
• Variance could be in UI, Components, Algorithms etc.
• Measure Bankable metrics and non-bankable (activity click rates).
• Enables data driven decisions.
• Avoid making releases features on intuitions to over 150 million users.
Monitoring Anomalies in the Experimentation Platform
3
Experimentation Reporting• 1500+ experiments
• Migrated from Teradata/SQL to Hadoop using Scala.
• Process 100s of TBs of data daily on Hadoop cluster with around 400 M/R jobs.
• 200+ metrics generated daily using batch system.
• Built using a mix of open source technologies like Scala, Scoobi, Hive, Hadoop and proprietary tech like Teradata and MicroStrategy.
Monitoring Anomalies in the Experimentation Platform
4
High Level Flow
Monitoring Anomalies in the Experimentation Platform
5
Anomalies with Experiments• Traffic corruption – Traffic between test and control is skewed by UID/GUID.
• Tag corruption – Data loss/corrupted during logging & transfer to HDFS.
• GUID Reset – Browser cookie.
• Cache refresh - eBay application servers maintain caches of experiment configurations. A software or hardware glitch can cause corruption of cache.
Monitoring Anomalies in the Experimentation Platform
6
Monitoring Anomalies• Identify & Categorize anomalies within experiments using Teradata/Hive.
• Store identified anomalies in HDFS and route to InfluxDB (TSDB)
•Visualize using Grafana. •I was introduced to Grafana through SpaceX tweet by Torkel.
Monitoring Anomalies in the Experimentation Platform
7
Reason we chose Grafana• Visually pleasing graphs.• Easy setup. • In Built Query Editor (SQL & UI)• Instantly change dashboards using duplicate dashboards/panels feature.
• InfluxDB for its ability to be setup in minutes when compared to Graphite/Prometheus .
• Entire pipeline took couple of days.
Monitoring Anomalies in the Experimentation Platform
8
Home Page
Monitoring Anomalies in the Experimentation Platform
9
DrillDown
Monitoring Anomalies in the Experimentation Platform
10
Search
Monitoring Anomalies in the Experimentation Platform
11
Scale• InfluxDB (v 0.11-1) is installed on a single node with 45 GB of memory.
• Grafana (v 3.0.2) is installed on a single node with 45 GB of memory.
• 2000 points are ingested daily which is minuscule.
• Currently have around 10 months of historical data.
Monitoring Anomalies in the Experimentation Platform
12
Grafana <3 is spreading.
• Performance analysis of MapReduce jobs by Experimentation platform with job counters.
• Monitor Elastic Search Cluster (10k+ points per second).
• Anomaly detection in Tracking data (10k+ points per minute).
• Each use case stores data in InfluxDB.
Monitoring Anomalies in the Experimentation Platform
13
Thank You
Monitoring Anomalies in the Experimentation Platform