analyzing log data with spark and looker

May 18, 2016

Analyzing Log Data with Spark and Looker

Housekeeping

• We will do Q&A at the end.

• You should see a box on the right

side of your screen.

• There is a button marked “Q&A” on

the bottom menu.

• We will be recording this

• We will send you the recording

tomorrow.

• We will also send you the slides

tomorrow.

RecordingQ&A

Scott HooverData Scientist

Looker

Daniel MintzChief Data Evangelist

Looker

Meet Our Presenters

4

AGENDA

1.2.3.

Overview

Quick intro to Looker

Creating a pipeline with Spark

4.5.6.

Analyzing Log Data

Deriving insight from Log Data

Q&A

Looker makes it easy for everyone

to find, explore and understand

the data that drives your business.

2

THE TECHNICAL PILLARS THAT MAKE IT POSSIBLE

100% In Database

Leverage all your dataAvoid summarizing or

moving it

Modern Web Architecture

Access from anywhereShare and collaborate

Extend to anyone

LookML Intelligent Modeling Layer

Describe the dataCreate reusable and

shareable business logic

9

7

Creating a Pipeline with Spark

Who is this webinar for?



● You have some exposure to Spark.




● You have log data (with an interest in processing it at scale).






● Terminal doesn’t terrify you.





● Terminal doesn’t terrify you.

● You know some SQL.

Alright—just in case: What is Spark?



● In-memory, distributed compute framework.




● Drivers to access Spark: spark-shell, JDBC, submit application jar.





● Drivers to access Spark: spark-shell, JDBC, submit application jar.

● Core concepts: ○ Resilient distributed dataset (RDD)○ Dataframe○ Discretized Stream (DStream)

Why is Spark of interest to us?



● APIs galore: Java, Scala, Python and R.




● Rich ecosystem of applications: ○ Spark SQL○ Streaming○ MLlib○ GraphX





● Hooks into the broader Hadoop ecosystem:○ Filesystem: S3, HDFS○ Movement: Kafka, Flume, Kinesis○ Database: HBase, Cassandra and JDBC






● Learning curve is relatively forgiving.






● Learning curve is relatively forgiving.

● Looker can issue SQL to Spark via JDBC!


Let’s review a couple of proposed architectures.



Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/Put

Raw

logs

/text

Par

quet

/OR

C

JDBC

Metastore

Stream

MapReduceRaw logs

extern

al

table


Data Sources

Ingestion

Storage

Compute BI/Viz./Analytics

RPC/REST/Pull

RPC/PutRaw logs

Par

quet

/OR

C

JDBC

Metastore

Stream

Stream

extern

al

table

Sidebar: Why the emphasis on Parquet and ORC?


Let’s step through the essential components of this pipeline.


# name the components of agentagent.sources = terminalagent.sinks = logger sparkagent.channels = memory1 memory2

# describe sourceagent.sources.terminal.type = execagent.sources.terminal.command = tail -f /home/hadoop/generator/logs/access.log

# describe logger sinkagent.sinks.logger.type = logger

# describe spark sinkagent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSinkagent.sinks.spark.hostname = localhostagent.sinks.spark.port = 9988

# channel buffers events in memory (used with logger sink)agent.channels.memory1.type = memoryagent.channels.memory1.capacity = 10000agent.channels.memory1.transactionCapacity = 1000

# channel buffers events in memory (used with spark sink)agent.channels.memory2.type = memoryagent.channels.memory2.capacity = 10000agent.channels.memory2.transactionCapacity = 1000

# tie source and sinks with respective channelsagent.sources.terminal.channels = memory1 memory2agent.sinks.logger.channel = memory1agent.sinks.spark.channel = memory2


2.174.143.4 - - [09/May/2016:05:56:03 +0000]"GET /department HTTP/1.1" 200 1226"-" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT

5.1; Trident/4.0; GTB6.3; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; InfoPath.2)" "USER=0;NUM=9"


scala> case class Foo(bar: String)

scala> val instanceOfFoo = Foo("i am a bar")

scala> instanceOfFoo.barres0 > String = i am a bar


case class LogLine ( ip_address: String, identifier: String, user_id: String, created_at: String, method: String, uri: String, protocol: String, status: String, size: String, referer: String, agent: String, user_meta_info: String )


val ip_address = "([0-9\\.]+)" // 2.174.143.4 val identifier = "(\\S+)" // - val user_id = "(\\S+)" // - val created_at = "(?:\\[)(.*)(?:\\])" // 09/May/2016:05:56:03 +0000 val method = "(?:\")([A-Z]+)" // GET val uri = "(\\S+)" // /department val protocol = "(\\S+)(?:\")" // HTTP/1.1 val status = "(\\S+)" // 200 val size = "(\\S+)" // 1226 val referer = "(?:\")(\\S+)(?:\")" // - val agent = "(?:\")([^\"]*)(?:\")" // Mozilla/4.0 ... val user_meta_info = "(.*)" // USER=0;NUM=9

val lineMatch = s"$ip_address\\s+$identifier\\s+$user_id\\s+$created_at\\s+$method\\s+$uri\\s+$protocol\\s+$status\\s+$size\\s+$referer\\s+$agent\\s+$user_meta_info".r


def extractValues(line: String): Option[LogLine] = { line match { case lineMatch(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info, _*) => return Option(LogLine(ip_address, identifier, user_id, created_at, method, uri, ..., user_meta_info)) case _ => None } }


def main(argv: Array[String]): Unit = {

// create streaming context val ssc = new StreamingContext(conf, Seconds(batchDuration))

// ingest data val stream = FlumeUtils.createPollingStream(ssc, host, port)

// extract message body from avro serialized flume events val mapStream = stream.map(event => new String(event.event.getBody().array()))

// parse log into LogLine class and write parquet to hdfs/s3 mapStream.foreachRDD{ rdd => rdd.map{ line => extractValues(line).get }.toDF() // convert rdd to dataframe with schema .coalesce(1) // consolidate to single partition .write // serialize .format("parquet") // write parquet .mode("append") // don’t overwrite files; append instead .save("hdfs://some.endpoint:8020/path/to/logdata.parquet") }

}


create external table logdata.event ( ip_address string , identifier string , user_id string , created_at timestamp , method string , uri string , protocol string , status int , size int , referer string , agent string , user_meta_info string)stored as parquetlocation 'hdfs://some.endpoint:8020/path/to/logdata.parquet';


Live demo of streaming pipeline.


Example streaming application can be found here:https://github.com/looker/spark_log_data


https://github.com/looker/spark_log_data

Analyzing Log Data with Looker Demo

11

Q&A

12

THANK YOU FOR JOINING

Recording and slides will be posted.

We will email you the links tomorrow.

See you next time!

Next from Looker Webinars: Data Stack Considerations:

Build vs. Buy at Tout on June 2

See how Looker works on your data.

Visit looker.com/free-trial or email [email protected].

THANK YOU!

analyzing log data with spark and looker

Technology