kafka for dbas

41
1 © Cloudera, Inc. All rights reserved. Apache Kafka for Oracle DBAs What is Kafka Why should you care How to learn Kafka

Upload: gwen-chen-shapira

Post on 18-Jul-2015

1.433 views

Category:

Software


1 download

TRANSCRIPT

1© Cloudera, Inc. All rights reserved.

Apache Kafka for Oracle DBAsWhat is KafkaWhy should you careHow to learn Kafka

2© Cloudera, Inc. All rights reserved.

• Oracle DBA

• Turned Oracle Consultant

• Turned Hadoop Solutions Architect

• Turned Developer

Committer on Apache Sqoop

Contributor to Apache Kafka and Apache Flume

About me

3© Cloudera, Inc. All rights reserved.

Apache Kafka is a publish-subscribe messaging rethought as a distributed commit log.

An Optical Illusion

4© Cloudera, Inc. All rights reserved.

• Redo log as an abstraction

• How redo logs are useful

• Pub-sub message queues

• How message queues are useful

• What exactly is Kafka

• How do people use Kafka

• Where can you learn more

We’ll talk about:

5© Cloudera, Inc. All rights reserved.

Redo Log:

The most crucial structure for recovery operations … store all changes made to the database as they occur.

6© Cloudera, Inc. All rights reserved.

Important Point

The redo log is the only reliable source of information about current state of the database.

7© Cloudera, Inc. All rights reserved.

Redo Log is used for

• Recover consistent state of a database

• Replicate the database (Dataguard, Streams, GoldenGate…)

• Update materialized logs (well, it’s a log anyway)

If you look far enough into archive logs – you can reconstruct the entire database

8© Cloudera, Inc. All rights reserved.

What if…

You built an entire data storage system that is just a transaction log?

9© Cloudera, Inc. All rights reserved.

Kafka can log

• Transactions from any database

• Clicks from websites

• Application logs (ERROR, WARN, INFO…)

• Metrics– cpu, memory, io

• Audit events

• And any system can read those logs: Hadoop, alerts, dashboards, databases.

10© Cloudera, Inc. All rights reserved.

Only one thing is missing

Q: How do you query a redo log?

A: Not very efficiently

Sometimes we just need the events – no need to query.

Other times, we need to load the results into a database.

While messages are in transit – we can do all kinds of transformations.

11© Cloudera, Inc. All rights reserved.

12© Cloudera, Inc. All rights reserved.

Publish-Subscribe Message Queue

13© Cloudera, Inc. All rights reserved.

Raise your hand if this sounds familiar

“My next project was to get a working Hadoop setup…

Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “

--Jay Kreps, Kafka PMC

14© Cloudera, Inc. All rights reserved.14

Client Source

Data Pipelines Start like this.

15© Cloudera, Inc. All rights reserved.15

Client Source

Client

Client

Client

Then we reuse them

16© Cloudera, Inc. All rights reserved.16

Client Backend

Client

Client

Client

Then we add consumers to the existing sources

Another Backend

17© Cloudera, Inc. All rights reserved.17

Client Backend

Client

Client

Client

Then it starts to look like this

Another Backend

Another Backend

Another Backend

18© Cloudera, Inc. All rights reserved.18

Client Backend

Client

Client

Client

With maybe some of this

Another Backend

Another Backend

Another Backend

19© Cloudera, Inc. All rights reserved.

Queues decouple systems: Both statically and in time

20© Cloudera, Inc. All rights reserved.

This is where we are trying to get

20

Source System Source System Source System Source System

Kafka decouples Data Pipelines

Hadoop Security SystemsReal-time

monitoringData Warehouse

Kafka

Producers

Brokers

Consumers

Kafka decouples Data Pipelines

21© Cloudera, Inc. All rights reserved.

Important notes:

• Producers and Consumers don’t need to know about each other

• Performance issues on Consumers don’t impact Producers

• Consumers are protected from herds of Producers

• Lots of flexibility in handling load

• Messages are available for anyone –lots of new use cases, monitoring, audit, troubleshooting

http://www.slideshare.net/gwenshap/queues-pools-caches

22© Cloudera, Inc. All rights reserved.

So… What is Kafka?

23© Cloudera, Inc. All rights reserved.

Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system.

In turn this solves part of a much harder problem:

Communication and integration between components of large software systems

Click to enter confidentiality information

24© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.

•Messages are organized into topics

•Producers push messages

•Consumers pull messages

•Kafka runs in a cluster. Nodes are called brokers

The Basics

25© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.

Topics, Partitions and Logs

26© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.

Each partition is a log

27© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.

Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2

28© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

29© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

30© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.

Consumers

31© Cloudera, Inc. All rights reserved.

Why is Kafka better than other MQ?

• Can keep data forever

• Scales very well – high throughputs, low latency, lots of storage

• Scales to any number of consumers

32© Cloudera, Inc. All rights reserved.

How do people use Kafka?

• As a message bus

• As a buffer for replication systems (Like AdvancedQueue in Streams)

• As reliable feed for event processing

• As a buffer for event processing

• Decouple apps from database (both OLTP and DWH)

33© Cloudera, Inc. All rights reserved.

Need More Kafka?

• https://kafka.apache.org/documentation.html

• My video tutorial: http://shop.oreilly.com/product/0636920038603.do

• http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/

• Try with Cloudera Manager:http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/topics/kafka_install.html

34© Cloudera, Inc. All rights reserved.

One more thing...

35© Cloudera, Inc. All rights reserved.

Schema is a MUST HAVE for data integration

Click to enter confidentiality information

36© Cloudera, Inc. All rights reserved.

Kafka only stores Bytes – So where’s the schema?

• People go around asking each other:“So, what does the 5th field of the messages in topic Blah contain?”

• There’s utility code for reading/writing messages that everyone reuses

• Schema embedded in the message

• A centralized repository for schemas

• Each message has Schema ID

• Each topic has Schema ID

Click to enter confidentiality information

37© Cloudera, Inc. All rights reserved.

I Avro

• Define Schema

• Generate code for objects

• Serialize / Deserialize into Bytes or JSON

• Embed schema in files / records… or not

• Support for our favorite languages… Except Go.

• Schema Evolution

• Add and remove fields without breaking anything

Click to enter confidentiality information

38© Cloudera, Inc. All rights reserved.

Replicating from Oracle to Kafka?Don’t lose the schema!

39© Cloudera, Inc. All rights reserved.

Schemas are Agile

• Leave out MySQL and your favorite DBA for a second

• Schemas allow adding readers and writers easily

• Schemas allow modifying readers and writers independently

• Schemas can evolve as the system grows

• Allows validating data soon after its written

• No need to throw away data that doesn’t fit!

Click to enter confidentiality information

40© Cloudera, Inc. All rights reserved.Click to enter confidentiality information

41© Cloudera, Inc. All rights reserved.

Thank you@[email protected]