analyzing log data with spark and looker

Post on 07-Jan-2017

585 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

May 18, 2016

Analyzing Log Data with Spark and Looker

Housekeeping

• We will do Q&A at the end.

• You should see a box on the right

side of your screen.

• There is a button marked “Q&A” on

the bottom menu.

• We will be recording this

• We will send you the recording

tomorrow.

• We will also send you the slides

tomorrow.

RecordingQ&A

Scott HooverData Scientist

Looker

Daniel MintzChief Data Evangelist

Looker

Meet Our Presenters

4

AGENDA

1.2.3.

Overview

Quick intro to Looker

Creating a pipeline with Spark

4.5.6.

Analyzing Log Data

Deriving insight from Log Data

Q&A

Looker makes it easy for everyone

to find, explore and understand

the data that drives your business.

2

THE TECHNICAL PILLARS THAT MAKE IT POSSIBLE

100% In Database

Leverage all your dataAvoid summarizing or

moving it

Modern Web Architecture

Access from anywhereShare and collaborate

Extend to anyone

LookML Intelligent Modeling Layer

Describe the dataCreate reusable and

shareable business logic

9

7

Creating a Pipeline with Spark

Who is this webinar for?

Creating a Pipeline with Spark

Who is this webinar for?

● You have some exposure to Spark.

Creating a Pipeline with Spark

Who is this webinar for?

● You have some exposure to Spark.

● You have log data (with an interest in processing it at scale).

Creating a Pipeline with Spark

Creating a Pipeline with Spark

Who is this webinar for?

● You have some exposure to Spark.

● You have log data (with an interest in processing it at scale).

● Terminal doesn’t terrify you.

Creating a Pipeline with Spark

Who is this webinar for?

● You have some exposure to Spark.

● You have log data (with an interest in processing it at scale).

● Terminal doesn’t terrify you.

● You know some SQL.

Alright—just in case: What is Spark?

Creating a Pipeline with Spark

Alright—just in case: What is Spark?

● In-memory, distributed compute framework.

Creating a Pipeline with Spark

Alright—just in case: What is Spark?

● In-memory, distributed compute framework.

● Drivers to access Spark: spark-shell, JDBC, submit application jar.

Creating a Pipeline with Spark

Creating a Pipeline with Spark

Alright—just in case: What is Spark?

● In-memory, distributed compute framework.

● Drivers to access Spark: spark-shell, JDBC, submit application jar.

● Core concepts: ○ Resilient distributed dataset (RDD)○ Dataframe○ Discretized Stream (DStream)

Why is Spark of interest to us?

Creating a Pipeline with Spark

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

Creating a Pipeline with Spark

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

Creating a Pipeline with Spark

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC

Creating a Pipeline with Spark

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC

● Learning curve is relatively forgiving.

Creating a Pipeline with Spark

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC

● Learning curve is relatively forgiving.

● Looker can issue SQL to Spark via JDBC!

Creating a Pipeline with Spark

Let’s review a couple of proposed architectures.

Creating a Pipeline with Spark

Creating a Pipeline with Spark

Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/Put

Raw

logs

/text

Par

quet

/OR

C

JDBC

Metastore

Stream

MapReduceRaw logs

extern

al

table

Creating a Pipeline with Spark

Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/PutRaw logs

Par

quet

/OR

C

JDBC

Metastore

Stream

Stream

extern

al

table

Sidebar: Why the emphasis on Parquet and ORC?

Creating a Pipeline with Spark

Let’s step through the essential components of this pipeline.

Creating a Pipeline with Spark

# name the components of agentagent.sources = terminalagent.sinks = logger sparkagent.channels = memory1 memory2

# describe sourceagent.sources.terminal.type = execagent.sources.terminal.command = tail -f /home/hadoop/generator/logs/access.log

# describe logger sinkagent.sinks.logger.type = logger

# describe spark sinkagent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSinkagent.sinks.spark.hostname = localhostagent.sinks.spark.port = 9988

# channel buffers events in memory (used with logger sink)agent.channels.memory1.type = memoryagent.channels.memory1.capacity = 10000agent.channels.memory1.transactionCapacity = 1000

# channel buffers events in memory (used with spark sink)agent.channels.memory2.type = memoryagent.channels.memory2.capacity = 10000agent.channels.memory2.transactionCapacity = 1000

# tie source and sinks with respective channelsagent.sources.terminal.channels = memory1 memory2agent.sinks.logger.channel = memory1agent.sinks.spark.channel = memory2

Creating a Pipeline with Spark

2.174.143.4 - - [09/May/2016:05:56:03 +0000]"GET /department HTTP/1.1" 200 1226"-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT

5.1; Trident/4.0; GTB6.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2)" "USER=0;NUM=9"

Creating a Pipeline with Spark

scala> case class Foo(bar: String)

scala> val instanceOfFoo = Foo("i am a bar")

scala> instanceOfFoo.barres0 > String = i am a bar

Creating a Pipeline with Spark

case class LogLine ( ip_address: String, identifier: String, user_id: String, created_at: String, method: String, uri: String, protocol: String, status: String, size: String, referer: String, agent: String, user_meta_info: String )

Creating a Pipeline with Spark

val ip_address = "([0-9\\.]+)" // 2.174.143.4 val identifier = "(\\S+)" // - val user_id = "(\\S+)" // - val created_at = "(?:\\[)(.*)(?:\\])" // 09/May/2016:05:56:03 +0000 val method = "(?:\")([A-Z]+)" // GET val uri = "(\\S+)" // /department val protocol = "(\\S+)(?:\")" // HTTP/1.1 val status = "(\\S+)" // 200 val size = "(\\S+)" // 1226 val referer = "(?:\")(\\S+)(?:\")" // - val agent = "(?:\")([^\"]*)(?:\")" // Mozilla/4.0 ... val user_meta_info = "(.*)" // USER=0;NUM=9

val lineMatch = s"$ip_address\\s+$identifier\\s+$user_id\\s+$created_at\\s+$method\\s+$uri\\s+$protocol\\s+$status\\s+$size\\s+$referer\\s+$agent\\s+$user_meta_info".r

Creating a Pipeline with Spark

def extractValues(line: String): Option[LogLine] = { line match { case lineMatch(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info, _*) => return Option(LogLine(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info)) case _ => None } }

Creating a Pipeline with Spark

def main(argv: Array[String]): Unit = {

// create streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration))

// ingest data val stream = FlumeUtils.createPollingStream(ssc, host, port)

// extract message body from avro serialized flume events val mapStream = stream.map(event => new String(event.event.getBody().array()))

// parse log into LogLine class and write parquet to hdfs/s3 mapStream.foreachRDD{ rdd => rdd.map{ line => extractValues(line).get }.toDF() // convert rdd to dataframe with schema .coalesce(1) // consolidate to single partition .write // serialize .format("parquet") // write parquet .mode("append") // don’t overwrite files; append instead .save("hdfs://some.endpoint:8020/path/to/logdata.parquet") }

}

Creating a Pipeline with Spark

create external table logdata.event ( ip_address string , identifier string , user_id string , created_at timestamp , method string , uri string , protocol string , status int , size int , referer string , agent string , user_meta_info string)stored as parquetlocation 'hdfs://some.endpoint:8020/path/to/logdata.parquet';

Creating a Pipeline with Spark

Live demo of streaming pipeline.

Creating a Pipeline with Spark

Example streaming application can be found here:https://github.com/looker/spark_log_data

Creating a Pipeline with Spark

Analyzing Log Data with Looker Demo

11

Q&A

12

THANK YOU FOR JOINING

Recording and slides will be posted.

We will email you the links tomorrow.

See you next time!

Next from Looker Webinars: Data Stack Considerations:

Build vs. Buy at Tout on June 2

See how Looker works on your data.

Visit looker.com/free-trial or email discover@looker.com.

THANK YOU!

top related