analyzing log data with spark and looker

May 18, 2016

Analyzing Log Data with Spark and Looker

Housekeeping

• We will do Q&A at the end.

• You should see a box on the right

side of your screen.

• There is a button marked “Q&A” on

the bottom menu.

• We will be recording this

• We will send you the recording

tomorrow.

• We will also send you the slides

tomorrow.

RecordingQ&A

Scott HooverData Scientist

Looker

Daniel MintzChief Data Evangelist

Looker

Meet Our Presenters

AGENDA

1.2.3.

Overview

Quick intro to Looker

Creating a pipeline with Spark

4.5.6.

Analyzing Log Data

Deriving insight from Log Data

Looker makes it easy for everyone

to find, explore and understand

the data that drives your business.

THE TECHNICAL PILLARS THAT MAKE IT POSSIBLE

100% In Database

Leverage all your dataAvoid summarizing or

moving it

Modern Web Architecture

Access from anywhereShare and collaborate

Extend to anyone

LookML Intelligent Modeling Layer

Describe the dataCreate reusable and

shareable business logic

Creating a Pipeline with Spark

Who is this webinar for?

● You have some exposure to Spark.

● You have log data (with an interest in processing it at scale).

● Terminal doesn’t terrify you.

● You know some SQL.

Alright—just in case: What is Spark?

● In-memory, distributed compute framework.

● Drivers to access Spark: spark-shell, JDBC, submit application jar.

● Core concepts: ○ Resilient distributed dataset (RDD)○ Dataframe○ Discretized Stream (DStream)

Why is Spark of interest to us?

● APIs galore: Java, Scala, Python and R.

● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX

● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC

● Learning curve is relatively forgiving.

● Looker can issue SQL to Spark via JDBC!

Let’s review a couple of proposed architectures.

Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/Put

Metastore

Stream

MapReduceRaw logs

extern

Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/PutRaw logs

Metastore

Stream

extern

Sidebar: Why the emphasis on Parquet and ORC?

Let’s step through the essential components of this pipeline.

# name the components of agentagent.sources = terminalagent.sinks = logger sparkagent.channels = memory1 memory2

# describe sourceagent.sources.terminal.type = execagent.sources.terminal.command = tail -f /home/hadoop/generator/logs/access.log

# describe logger sinkagent.sinks.logger.type = logger

# describe spark sinkagent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSinkagent.sinks.spark.hostname = localhostagent.sinks.spark.port = 9988

# channel buffers events in memory (used with logger sink)agent.channels.memory1.type = memoryagent.channels.memory1.capacity = 10000agent.channels.memory1.transactionCapacity = 1000

# channel buffers events in memory (used with spark sink)agent.channels.memory2.type = memoryagent.channels.memory2.capacity = 10000agent.channels.memory2.transactionCapacity = 1000

# tie source and sinks with respective channelsagent.sources.terminal.channels = memory1 memory2agent.sinks.logger.channel = memory1agent.sinks.spark.channel = memory2

2.174.143.4 - - [09/May/2016:05:56:03 +0000]"GET /department HTTP/1.1" 200 1226"-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT

5.1; Trident/4.0; GTB6.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2)" "USER=0;NUM=9"

scala> case class Foo(bar: String)

scala> val instanceOfFoo = Foo("i am a bar")

scala> instanceOfFoo.barres0 > String = i am a bar

case class LogLine ( ip_address: String, identifier: String, user_id: String, created_at: String, method: String, uri: String, protocol: String, status: String, size: String, referer: String, agent: String, user_meta_info: String )

val ip_address = "([0-9\\.]+)" // 2.174.143.4 val identifier = "(\\S+)" // - val user_id = "(\\S+)" // - val created_at = "(?:\\[)(.*)(?:\\])" // 09/May/2016:05:56:03 +0000 val method = "(?:\")([A-Z]+)" // GET val uri = "(\\S+)" // /department val protocol = "(\\S+)(?:\")" // HTTP/1.1 val status = "(\\S+)" // 200 val size = "(\\S+)" // 1226 val referer = "(?:\")(\\S+)(?:\")" // - val agent = "(?:\")([^\"]*)(?:\")" // Mozilla/4.0 ... val user_meta_info = "(.*)" // USER=0;NUM=9

val lineMatch = s"$ip_address\\s+$identifier\\s+$user_id\\s+$created_at\\s+$method\\s+$uri\\s+$protocol\\s+$status\\s+$size\\s+$referer\\s+$agent\\s+$user_meta_info".r

def extractValues(line: String): Option[LogLine] = { line match { case lineMatch(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info, _*) => return Option(LogLine(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info)) case _ => None } }

def main(argv: Array[String]): Unit = {

// create streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration))

// ingest data val stream = FlumeUtils.createPollingStream(ssc, host, port)

// extract message body from avro serialized flume events val mapStream = stream.map(event => new String(event.event.getBody().array()))

// parse log into LogLine class and write parquet to hdfs/s3 mapStream.foreachRDD{ rdd => rdd.map{ line => extractValues(line).get }.toDF() // convert rdd to dataframe with schema .coalesce(1) // consolidate to single partition .write // serialize .format("parquet") // write parquet .mode("append") // don’t overwrite files; append instead .save("hdfs://some.endpoint:8020/path/to/logdata.parquet") }

create external table logdata.event ( ip_address string , identifier string , user_id string , created_at timestamp , method string , uri string , protocol string , status int , size int , referer string , agent string , user_meta_info string)stored as parquetlocation 'hdfs://some.endpoint:8020/path/to/logdata.parquet';

Live demo of streaming pipeline.

Example streaming application can be found here:https://github.com/looker/spark_log_data

Analyzing Log Data with Looker Demo

THANK YOU FOR JOINING

Recording and slides will be posted.

We will email you the links tomorrow.

See you next time!

Next from Looker Webinars: Data Stack Considerations:

Build vs. Buy at Tout on June 2

See how Looker works on your data.

Visit looker.com/free-trial or email discover@looker.com.

THANK YOU!

analyzing log data with spark and looker

Technology

datasets using databricks analyzing massive genomics · •...

flirting 101 – are you a looker, listener or feeler? ·...

de amparo 196-2019. promovido por don shane kenneth looker...

look & tell -...

Журнал looker

spark sql and dataframes spark graphx spark mlib spark...

#174: atom, gamespace, broadband, looker, and more tech ·...

apache ignite and apache spark - gridgain systems · ignite...

spark & spark sql

analyzing astronomical data with apache spark

how to become a looker

blondie, doe and looker - wordpress.com · 02.09.2014 ·...

data modeling in looker

snowplow analytics and looker at oyster.com

lyftron and looker · lyftron enables instant access to...

ngk spark pÚü6s resistor type spark plugs spark plugs...

analyzing time series data with apache spark and cassandra

navigating through the 4 basic looker dashboards (for...

how weebly use looker and snowplow to do funnel analysis

analyzing web access logs using spark with hadoop ·...