elasticsearch + cascading for scalable log processing

26
DRIVING INNOVATION THROUGH DATA Elasticsearch Meetup, Oct 30 2014 LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH

Upload: cascading

Post on 12-Jul-2015

234 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Elasticsearch + Cascading for Scalable Log Processing

DRIVING INNOVATION THROUGH DATA

Elasticsearch Meetup, Oct 30 2014

LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH

Page 2: Elasticsearch + Cascading for Scalable Log Processing

• Making sense of large amounts of [semi|un]structured data • What type of log file data? ‣ Syslog ‣ Web log files (Apache, Nginix, WebTrends, Omniture) ‣ POS transactions ‣ Advertising impressions (Doubleclick DART, OpenX, Atlas) ‣ Twitter firehose (yes, it’s a log file!)

• Anything with a timestamp and data

WHAT IS LOG FILE ANALYTICS?

2

Page 3: Elasticsearch + Cascading for Scalable Log Processing

LOGSTASH ARCHITECTURE

3

h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7

• Data  collec*on  is  flexible  • Lots  of  input/output  plugins  • Grok  filtering  is  easy  • Kibana  UI  is  a?rac*ve  

Page 4: Elasticsearch + Cascading for Scalable Log Processing

• Provide richer log-processing capabilities • Integrate & correlate with other information ‣ Large list of integration adapters

• Analyze large volumes of log data • Capture & retain unfiltered log data • Operationalize your log-processing application

WHAT CAN WE DO WITH CASCADING + LOGSTASH?

4

Page 5: Elasticsearch + Cascading for Scalable Log Processing

GET TO KNOW CONCURRENT

5

Leader in Application Infrastructure for Big Data

• Building enterprise software to simplify Big Data application development and management

Products and Technology

• CASCADINGOpen Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month

• DRIVEN Enterprise data application management for Big Data apps

Proven — Simple, Reliable, Robust

• Thousands of enterprises rely on Concurrent to provide their data application infrastructure.

Founded: 2008 HQ: San Francisco, CA

CEO: Gary Nakamura CTO, Founder: Chris Wensel

www.concurrentinc.com

Page 6: Elasticsearch + Cascading for Scalable Log Processing

Cascading Apps

CASCADING - DE-FACTO STANDARD FOR DATA APPS

6

New Fabrics

ClojureSQL Ruby

StormTez

Supported Fabrics and Data Stores

Mainframe DB / DW Data Stores HadoopIn-Memory

• Standard for enterprise data app development

• Your programming language of choice

• Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …

Page 7: Elasticsearch + Cascading for Scalable Log Processing

CASCADING 3.0

7

“Write once and deploy on your fabric of choice.”

• The Innovation — Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query planner.

• Cascading 3.0 will support — Local In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm

Enterprise Data Applications

MapReduceLocal In-Memory

Apache Tez, Storm,

Computation Fabrics

Page 8: Elasticsearch + Cascading for Scalable Log Processing

… AND INCLUDES RICH SET OF EXTENSIONS

8

http://www.cascading.org/extensions/

Page 9: Elasticsearch + Cascading for Scalable Log Processing

DEMO: WORD COUNT EXAMPLE WITH CASCADING

9

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

configuration

integration// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

processing

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

scheduling

// connect the taps, pipes, etc., into a flow definitionFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// create the FlowFlow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of WorkwcFlow.complete(); // <<-- Runs jobs on Cluster

Page 10: Elasticsearch + Cascading for Scalable Log Processing

• Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical

• Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct)

• Aggregations ‣ Count, Average, etc

SOME COMMON PATTERNS

10

filter

filter

function

functionfilterfunctiondata

PipelineSplit Join

Merge

data

Topology

Page 11: Elasticsearch + Cascading for Scalable Log Processing

THE STANDARD FOR DATA APPLICATION DEVELOPMENT

11

www.cascading.org

Build data apps that are

scale-free

Design principals ensure best practices at any scale

Test-Driven Development

Efficiently test code and process local files before

deploying on a cluster

Staffing Bottleneck

Use existing Java, SQL, modeling skill sets

Operational Complexity

Simple - Package up into one jar and hand to

operations

Application Portability

Write once, then run on different computation

fabrics

Systems Integration

Hadoop never lives alone. Easily integrate to existing

systems

Proven application development framework for building data apps

Application platform that addresses:

Page 12: Elasticsearch + Cascading for Scalable Log Processing

• Java API

• Separates business logic from integration

• Testable at every lifecycle stage

• Works with any JVM language

• Many integration adapters

CASCADING

12

Process Planner

Processing API Integration APIScheduler API

Scheduler

Apache Hadoop

Cascading

Data Stores

ScriptingScala, Clojure, JRuby, Jython, Groovy

Enterprise Java

Page 13: Elasticsearch + Cascading for Scalable Log Processing

BUSINESSES DEPEND ON US

13

• Cascading Java API

• Data normalization and cleansing of search and click-through logs for

use by analytics tools, Hive analysts

• Easy to operationalize heavy lifting of data in one framework

Page 14: Elasticsearch + Cascading for Scalable Log Processing

BROAD SUPPORT

14

Hadoop ecosystem supports Cascading

Page 15: Elasticsearch + Cascading for Scalable Log Processing

Confidential

CASCADING DEPLOYMENTS

1515

Page 16: Elasticsearch + Cascading for Scalable Log Processing

OPERATIONAL EXCELLENCE

From Development — Building and Testing• Design & Development • Debugging • Tuning

To Production — Monitoring and Tracking• Maintain Business SLAs • Balance & Controls • Application and Data Quality • Operational Health • Real-time Insights

Visibility Through All Stages of App Lifecycle

16

Page 17: Elasticsearch + Cascading for Scalable Log Processing

DRIVEN ARCHITECTURE

Page 18: Elasticsearch + Cascading for Scalable Log Processing

• Easily comprehend, debug, and tune your data applications

• Get rich insights on your application performance

• Monitor applications in real-time • Compare app performance with

historical (previous) iterations

DEEPER VISUALIZATION INTO YOUR HADOOP CODE

18

Debug and optimize your Hadoop applications more effectively with Driven

Page 19: Elasticsearch + Cascading for Scalable Log Processing

• Quickly breakdown how often applications execute based on their tags, teams, or names

• Immediately identify if any application is monopolizing cluster resources

• Understand the utilization of your cluster with a timeline of all applications running

GET OPERATIONAL INSIGHTS WITH DRIVEN

19

Visualize the activity of your applications to help maintain SLAs

Page 20: Elasticsearch + Cascading for Scalable Log Processing

• Easily keep track of all your applications by segmenting them with user-defined tags

• Segment your applications for trending analysis, cluster analysis, and developing chargeback models

• Quickly breakdown how often applications execute based on their tags, teams, or names

ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY

20

Segment your applications for greater insights across all your applications

Page 21: Elasticsearch + Cascading for Scalable Log Processing

• Invite others to view and collaborate on a specific application

• Gain visibility to all the apps and their owners associated with each team

• Simply manage your teams and the users assigned to them

COLLABORATE WITH TEAMS

21

Utilize teams to collaborate and gain visibility over your set of applications

Page 22: Elasticsearch + Cascading for Scalable Log Processing

• Identify problematic apps with their owners and teams

• Search for groups of applications segmented by user-defined tags

• Compare specific applications with their previous iterations to ensure that your application can meet its SL

MANAGE PORTFOLIO OF BIG DATA APPLICATIONS

22

Fast, powerful, rich search capabilities enable you to easily find the exact set of applications that you’re looking for

Page 23: Elasticsearch + Cascading for Scalable Log Processing

• Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster

DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS

23

Page 24: Elasticsearch + Cascading for Scalable Log Processing

• Logstash provides a flexible and a robust way to collect log data; Grok lets you parse logs without coding. Kibana UI is attractive to analyze the information

• Cascading is the de-facto framework for building Big Data (Hadoop) applications and processing data at scale

• Cascading+Logstash let’s you develop applications to collect and process large volumes of data

• With Driven, you can put your mission critical log-processing applications in production and monitor SLAs

TAKE AWAY POINTS

24

Page 25: Elasticsearch + Cascading for Scalable Log Processing

CONTACT INFORMATION

Supreet [email protected]

650-868-7675 (m) @supreet_online

Page 26: Elasticsearch + Cascading for Scalable Log Processing

DRIVING INNOVATION THROUGH DATASupreet Oberoi

THANK YOU