importance of ‘centralized event collection’ and bigdata platform for analysis !

20
Importance of ‘Centralized Event collection’ and BigData platform for Analysis ! ~/Piyush Manager, Website Operations at MakeMyTrip DevOpsDays India, Bangalore - 2013

Post on 18-Oct-2014

3.210 views

Category:

Technology


3 download

DESCRIPTION

DevOpsDays India, Bangalore - 2013

TRANSCRIPT

Page 1: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

~/PiyushManager, Website Operations at MakeMyTrip

DevOpsDays India, Bangalore - 2013

Page 2: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

What to expect:

MakeMyTrip data challenges! Event Data a.k.a. Logs & Log Analysis Why Centralized Logging …for systems and applications ! Capturing Events: Why structured data emitted from apps for

machines is a better approach! Data Service Platform : DSP – Why ? Inputs: Data for DSP Top Architecture Considerations Top level key tasks Tools Arsenal and API Management and Service Cloud

Page 3: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

• Multi-DC/colocation setup• Different type of data sources : internal/ external(structured, semi-

structured, unstructured))– Online Transaction Data Store– ERP– CRM

• Email Behavior / Survey results

– Web Analytics– Logs

• Web• Application• User Activity logs

– Social Media– Inventory / Catalog– Data residing in excel files– Monitoring Metric Data :

• Graphite (Time-series whisper), • Splunk , ElasticSearch (Logstash)

– Many other different sources

• Storing and Analyzing Huge Event Data !

MakeMyTrip data challenges …!

Page 4: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

• Aggregate web usage data and transactional data to generate one view• Process multiple GB's-TB’s of data every day • Serve more than a million data services API request / day• Ensure business continuity as more and more reliance on MyDSP increases• Store Terabytes of historical data• Meshing transactional (online and offline) data with consumer behavior

and derive analytics• Build flexible data ingestion platform to manage many data feeds from

multiple data sources

Some challenges …!

Page 5: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Flow of an Event

Page 6: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Event Data a.k.a. Logs

• Event Data -> set of chronologically sequenced data records that capture information about an event !

• Virtually every form of system produces event data – Capture it from all components and both client and server side events!

• You may call logs as the footprint generated by any activity with the system/app.

• Event Data has different characteristics from data stored in traditional data warehouses

– Huge Volume: Event data accumulates rapidly and often must be stored for years; many organizations are managing hundreds of terabytes and some are managing petabytes.

– Format: Because of the huge variety of sources, event data is unstructured and semi structured.

– Velocity – New event data is constantly coming in– Collection : Event data is difficult to collect because of broadly dispersed systems and

networks.– Time-stamped : Event data is always inserted once with a time-stamp. It never changes.

Page 7: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

• Logs are one of the most useful things when it comes to analysis; in simple terms Log analysis is making sense out of system/app-generated log messages (or just LOGS). Through logs we get insights into what is happening into the system.

• Help root cause analysis that occurs after any incident.• Personalize User Experience Analyzing Web Usage Data

“Security Req“:• Traditionally some compliance requirements too of : Log Management

/SEM+ SIM => SIEM• For Data Security – to have one centralized platform for collecting ALL

events (Logs) , correlate them and have real time intelligent visibility.• To not just monitor network, OS , devices etc. but ALL applications ,

business processes too.

Log Analysis

Page 8: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Why Centralized Logging …for systems and applications !

• Need for Centralized Logging is quiet important nowadays due to:-– growth in number of applications,– distributed architecture (Service Oriented Architecture)– Cloud based apps– number of machines and infrastructure size is increasing day by day.

• This means that centralized logging and the ability to spot errors in a distributed systems & applications has become even more “valuable” & “needed”.And most importantly

– be able to understand the customers and how they interact with websites;– Understanding Change: whether using A/B or Multivariate experiments or tweak /

understand new implementations.

Page 9: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

• Need for standardization:-– Developers assume that the first level consumer of a log message is a human and they

only know what information is needed to debug an issue.Logs are not just for humans!The primary consumers of logs are shifting from humans to computers. This means log formats should have a well-defined structure that can be parsed easily and robustly.Logs change!If the logs never changed, writing a custom parser might not be too terrible. The engineer would write it once and be done. But in reality, logs change.Every time you add a feature, you start logging more data, and as you add more data, the printf-style format inevitably changes. This implies that the custom parser has to be updated constantly, consuming valuable development time.

• Suggested Approach : “Logging in JSON Format”

– Just to keep it simple and generic for any Application the approach recommended is to {Key: Value} , JSON Log Format (structured/semi-structured).

– This approach will be helpful for easy parsing and consumption, which would be irrespective of whatever technology/tools we choose to use!

Capturing Events: Why structured data emitted from apps for machines is a better approach!

Page 10: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

• Use timestamps for every event• Use unique identifiers (IDs) like Transaction ID / User ID / Session ID or may be

append unique user Identification (UUID) number to track unique users.• Log in text format / means Avoid logging binary information!• Log anything that can add value when aggregated, charted, or further analyzed.• Use categories: like “severity”: “WARN”, INFO, WARN, ERROR, and DEBUG.• The 80/20 Rule: %80 or of our goals can be achieved with %20 of the work, so don’t

log too much • NTP synced same date time / timezone on every producer and collector

machine(#ntpdate ntp.example.com).• Reliability: Like video recordings … you don’t’ want to lose the most valuable shoot

… so you record every frame and then later during analysis; you may throw away rest of the stuff…picking your best shoot / frame. Here also – logs as events are recorded & should be recorded with proper reliability so that you don’t’ lose any important and usable part of it like the important video frame.

• Correlation Rules for various event streams to generated and minimize alerts/events.

• Write Connectors for integrations

Key things to keep in mind/ Rules

Page 11: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Data Service Platform : DSP

Why we need a data services platform ?- Integration Layer to bring data from more

sources in less time- Serve various components – applications

and also to Monitoring systems etc.

Page 12: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Inputs : Data – what data to include

• Clickstream / Web Usage Data– User Activity Logs

• Transactional Data Store• Off-line

– CRM– Email Behavior -> Logs/ Events

Page 13: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Top Architecture Considerations• Non blocking data ingestion• UUID Tagged Events / messages• Load balanced data processing across data centers• Use of memory based data storage for real-time data systems• Easy scalable, HA - highly available and easy to maintain large historical

data sets• Data caching to achieve low latency• To ensure Business Continuity , parallel process between two different

data centers• Use of Centralized service cloud for API management , security

(authentication, authorization), metering and integration

Page 14: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Top level key tasks for User Activity Logging & Analysis

1. Data Collection of both Client-Side and Server-Side user activity streams• Tag every Website visitor with UUID similar to the System UUID’s• Collect the activity streams on BigData Platform for Analysis through Kafka Queues & NoSQL data

stores

2. Near real-time Data Processing • Preprocessing / Aggregations

• Filtering etc.

• Pattern Discovery along with the already available cooked data from point 4• Clustering/Classification/association discovery/Sequence Mining

3. Rule Engine / recommendations algorithms• Rule Engine : Building effective business rule engine / Correlate Events• Content-based filtering / Collaborative Filtering

4. Batch Processing / post processing using Hadoop Ecosystem• Analysis & Storing Cooked data in NoSQL data store

5. Data Services (Web-services)• RESTful API’s to make the data/insights consumable through various data services

6. Reporting/Search interface & Visualization for Product Development teams and other business owners.

Page 15: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Data System

• every event : Data !Lets’ store everything!

• Precompute ViewQuery = function (data)

• Batch Layer : Hadoop M/R• Speed Layer : Storm NRT Computation• Serving Layer

Layered Architecture:

Page 16: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Page 17: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Clickstream / User Activities Capture : Data is-> “Events”

• Tag every Website visitor with UUID using Apache module - Done– https://github.com/piykumar/modified_mod_cookietrack– Cookie : UUID like 24617072-3124-674f-4b72-675746562434.1381297617597249

• JSON Messages like

{"timestamp": "2012-12-14T02:30:18","facility": "clientSide","clientip": "123.123.123.123","uuid": "24617072-3124-5544-2f61-695256432432.1379399183414528","domain": "www.example.com","server": "abc-123","request": "/page/request","pagename": "funnel:example com:page1","searchKey": "1234567890_","sessionID": "11111111111111","event1": "loading","event2": "interstitial display banner","severity": "WARN","short_message": "....meaning short message for aggregation...","full_message": "full LOG message","userAgent": "...blah...blah..blah...","RT": 2}

Page 18: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Tools Arsenal• ETL : Talend• BI : SpagoBI & QlikView• Hadoop : Hortonworks• NRT Computation: Twitter Storm• Document-Oriented NoSQL DB : Couchbase• Distributed Search: ElasticSearch• Log Collection: Flume, Logstash, Syslog-NG• Distributed messaging system : Kafka , RabbitMQ• NoSQL : Cassandra, Redis, Neo4J (Graph)• API Management : WSO2 API Manager, 3Scale /Nginx• Programming Languages : Java , Python, R

Page 19: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

API Management and Data Services Cloud

• 3Scale / Nginx , WSO2: API Manager etc– For centralized distributed repository to serve API’s and provides

throttling,meetring, Security features etc.

• Inject building a data services layer in Culture and make sure what ever components you create you have some way to chain it in the pipeline or call in independently.

Page 20: Importance of ‘Centralized Event collection’ and BigData platform for Analysis !

DevOpsDays India 2013 : ~/Piyush

Thanks!

Questions – If Any !

~/Piyush@piykumarhttp://piyush.me