analyzing log data with spark and looker

41
May 18, 2016 Analyzing Log Data with Spark and Looker

Upload: looker

Post on 07-Jan-2017

585 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Analyzing Log Data with Spark and Looker

May 18, 2016

Analyzing Log Data with Spark and Looker

Page 2: Analyzing Log Data with Spark and Looker

Housekeeping

• We will do Q&A at the end.

• You should see a box on the right

side of your screen.

• There is a button marked “Q&A” on

the bottom menu.

• We will be recording this

• We will send you the recording

tomorrow.

• We will also send you the slides

tomorrow.

RecordingQ&A

Page 3: Analyzing Log Data with Spark and Looker

Scott HooverData Scientist

Looker

Daniel MintzChief Data Evangelist

Looker

Meet Our Presenters

Page 4: Analyzing Log Data with Spark and Looker

4

AGENDA

1.2.3.

Overview

Quick intro to Looker

Creating a pipeline with Spark

4.5.6.

Analyzing Log Data

Deriving insight from Log Data

Q&A

Page 5: Analyzing Log Data with Spark and Looker

Looker makes it easy for everyone

to find, explore and understand

the data that drives your business.

2

Page 6: Analyzing Log Data with Spark and Looker

THE TECHNICAL PILLARS THAT MAKE IT POSSIBLE

100% In Database

Leverage all your dataAvoid summarizing or

moving it

Modern Web Architecture

Access from anywhereShare and collaborate

Extend to anyone

LookML Intelligent Modeling Layer

Describe the dataCreate reusable and

shareable business logic

9

Page 7: Analyzing Log Data with Spark and Looker

7

Creating a Pipeline with Spark

Page 8: Analyzing Log Data with Spark and Looker

Who is this webinar for?

Creating a Pipeline with Spark

Page 9: Analyzing Log Data with Spark and Looker

Who is this webinar for?

● You have some exposure to Spark.

Creating a Pipeline with Spark

Page 10: Analyzing Log Data with Spark and Looker

Who is this webinar for?

● You have some exposure to Spark.

● You have log data (with an interest in processing it at scale).

Creating a Pipeline with Spark

Page 11: Analyzing Log Data with Spark and Looker

Creating a Pipeline with Spark

Who is this webinar for?

● You have some exposure to Spark.

● You have log data (with an interest in processing it at scale).

● Terminal doesn’t terrify you.

Page 12: Analyzing Log Data with Spark and Looker

Creating a Pipeline with Spark

Who is this webinar for?

● You have some exposure to Spark.

● You have log data (with an interest in processing it at scale).

● Terminal doesn’t terrify you.

● You know some SQL.

Page 13: Analyzing Log Data with Spark and Looker

Alright—just in case: What is Spark?

Creating a Pipeline with Spark

Page 14: Analyzing Log Data with Spark and Looker

Alright—just in case: What is Spark?

● In-memory, distributed compute framework.

Creating a Pipeline with Spark

Page 15: Analyzing Log Data with Spark and Looker

Alright—just in case: What is Spark?

● In-memory, distributed compute framework.

● Drivers to access Spark: spark-shell, JDBC, submit application jar.

Creating a Pipeline with Spark

Page 16: Analyzing Log Data with Spark and Looker

Creating a Pipeline with Spark

Alright—just in case: What is Spark?

● In-memory, distributed compute framework.

● Drivers to access Spark: spark-shell, JDBC, submit application jar.

● Core concepts: ○ Resilient distributed dataset (RDD)○ Dataframe○ Discretized Stream (DStream)

Page 17: Analyzing Log Data with Spark and Looker

Why is Spark of interest to us?

Creating a Pipeline with Spark

Page 18: Analyzing Log Data with Spark and Looker

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

Creating a Pipeline with Spark

Page 19: Analyzing Log Data with Spark and Looker

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

Creating a Pipeline with Spark

Page 20: Analyzing Log Data with Spark and Looker

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC

Creating a Pipeline with Spark

Page 21: Analyzing Log Data with Spark and Looker

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC

● Learning curve is relatively forgiving.

Creating a Pipeline with Spark

Page 22: Analyzing Log Data with Spark and Looker

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC

● Learning curve is relatively forgiving.

● Looker can issue SQL to Spark via JDBC!

Creating a Pipeline with Spark

Page 23: Analyzing Log Data with Spark and Looker

Let’s review a couple of proposed architectures.

Creating a Pipeline with Spark

Page 24: Analyzing Log Data with Spark and Looker

Creating a Pipeline with Spark

Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/Put

Raw

logs

/text

Par

quet

/OR

C

JDBC

Metastore

Stream

MapReduceRaw logs

extern

al

table

Page 25: Analyzing Log Data with Spark and Looker

Creating a Pipeline with Spark

Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/PutRaw logs

Par

quet

/OR

C

JDBC

Metastore

Stream

Stream

extern

al

table

Page 26: Analyzing Log Data with Spark and Looker

Sidebar: Why the emphasis on Parquet and ORC?

Creating a Pipeline with Spark

Page 27: Analyzing Log Data with Spark and Looker

Let’s step through the essential components of this pipeline.

Creating a Pipeline with Spark

Page 28: Analyzing Log Data with Spark and Looker

# name the components of agentagent.sources = terminalagent.sinks = logger sparkagent.channels = memory1 memory2

# describe sourceagent.sources.terminal.type = execagent.sources.terminal.command = tail -f /home/hadoop/generator/logs/access.log

# describe logger sinkagent.sinks.logger.type = logger

# describe spark sinkagent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSinkagent.sinks.spark.hostname = localhostagent.sinks.spark.port = 9988

# channel buffers events in memory (used with logger sink)agent.channels.memory1.type = memoryagent.channels.memory1.capacity = 10000agent.channels.memory1.transactionCapacity = 1000

# channel buffers events in memory (used with spark sink)agent.channels.memory2.type = memoryagent.channels.memory2.capacity = 10000agent.channels.memory2.transactionCapacity = 1000

# tie source and sinks with respective channelsagent.sources.terminal.channels = memory1 memory2agent.sinks.logger.channel = memory1agent.sinks.spark.channel = memory2

Creating a Pipeline with Spark

Page 29: Analyzing Log Data with Spark and Looker

2.174.143.4 - - [09/May/2016:05:56:03 +0000]"GET /department HTTP/1.1" 200 1226"-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT

5.1; Trident/4.0; GTB6.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2)" "USER=0;NUM=9"

Creating a Pipeline with Spark

Page 30: Analyzing Log Data with Spark and Looker

scala> case class Foo(bar: String)

scala> val instanceOfFoo = Foo("i am a bar")

scala> instanceOfFoo.barres0 > String = i am a bar

Creating a Pipeline with Spark

Page 31: Analyzing Log Data with Spark and Looker

case class LogLine ( ip_address: String, identifier: String, user_id: String, created_at: String, method: String, uri: String, protocol: String, status: String, size: String, referer: String, agent: String, user_meta_info: String )

Creating a Pipeline with Spark

Page 32: Analyzing Log Data with Spark and Looker

val ip_address = "([0-9\\.]+)" // 2.174.143.4 val identifier = "(\\S+)" // - val user_id = "(\\S+)" // - val created_at = "(?:\\[)(.*)(?:\\])" // 09/May/2016:05:56:03 +0000 val method = "(?:\")([A-Z]+)" // GET val uri = "(\\S+)" // /department val protocol = "(\\S+)(?:\")" // HTTP/1.1 val status = "(\\S+)" // 200 val size = "(\\S+)" // 1226 val referer = "(?:\")(\\S+)(?:\")" // - val agent = "(?:\")([^\"]*)(?:\")" // Mozilla/4.0 ... val user_meta_info = "(.*)" // USER=0;NUM=9

val lineMatch = s"$ip_address\\s+$identifier\\s+$user_id\\s+$created_at\\s+$method\\s+$uri\\s+$protocol\\s+$status\\s+$size\\s+$referer\\s+$agent\\s+$user_meta_info".r

Creating a Pipeline with Spark

Page 33: Analyzing Log Data with Spark and Looker

def extractValues(line: String): Option[LogLine] = { line match { case lineMatch(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info, _*) => return Option(LogLine(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info)) case _ => None } }

Creating a Pipeline with Spark

Page 34: Analyzing Log Data with Spark and Looker

def main(argv: Array[String]): Unit = {

// create streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration))

// ingest data val stream = FlumeUtils.createPollingStream(ssc, host, port)

// extract message body from avro serialized flume events val mapStream = stream.map(event => new String(event.event.getBody().array()))

// parse log into LogLine class and write parquet to hdfs/s3 mapStream.foreachRDD{ rdd => rdd.map{ line => extractValues(line).get }.toDF() // convert rdd to dataframe with schema .coalesce(1) // consolidate to single partition .write // serialize .format("parquet") // write parquet .mode("append") // don’t overwrite files; append instead .save("hdfs://some.endpoint:8020/path/to/logdata.parquet") }

}

Creating a Pipeline with Spark

Page 35: Analyzing Log Data with Spark and Looker

create external table logdata.event ( ip_address string , identifier string , user_id string , created_at timestamp , method string , uri string , protocol string , status int , size int , referer string , agent string , user_meta_info string)stored as parquetlocation 'hdfs://some.endpoint:8020/path/to/logdata.parquet';

Creating a Pipeline with Spark

Page 36: Analyzing Log Data with Spark and Looker

Live demo of streaming pipeline.

Creating a Pipeline with Spark

Page 37: Analyzing Log Data with Spark and Looker

Example streaming application can be found here:https://github.com/looker/spark_log_data

Creating a Pipeline with Spark

Page 38: Analyzing Log Data with Spark and Looker

Analyzing Log Data with Looker Demo

11

Page 39: Analyzing Log Data with Spark and Looker

Q&A

12

Page 40: Analyzing Log Data with Spark and Looker

THANK YOU FOR JOINING

Recording and slides will be posted.

We will email you the links tomorrow.

See you next time!

Next from Looker Webinars: Data Stack Considerations:

Build vs. Buy at Tout on June 2

See how Looker works on your data.

Visit looker.com/free-trial or email [email protected].

Page 41: Analyzing Log Data with Spark and Looker

THANK YOU!