accumulo summit 2015: using fluo to incrementally process data in accumulo [api]

Click here to load reader

Post on 15-Jul-2015

157 views

Category:

Technology

2 download

Embed Size (px)

TRANSCRIPT

  • Mike Walch

    Using Fluo to incrementally process data in Accumulo

  • Problem: Maintain counts of inbound links

    fluo.io

    github.com

    apache.org

    nytimes.com

    Website

    fluo.iogithub.comapache.orgnytimes.com

    # Inbound Links

    032

    0

    Example DataExample Graph

  • Solution 1 - Maintain counts using batch processing

    Website

    fluo.iogithub.comapache.orggithub.comnytimes.comapache.org

    # Inbound

    +1-1

    +1-1

    +1+1

    Link count change log

    Website

    fluo.iogithub.comapache.orgnytimes.com

    # Inbound

    +1-23

    +65 +105

    Last Hour Aggregates

    Website

    fluo.iogithub.comapache.orgnytimes.com

    # Inbound

    531,385,1922,528,190

    53,395,000

    Website

    fluo.iogithub.comapache.orgnytimes.com

    # Inbound

    541,385,1692,528,255

    53,395,105

    Historical

    Latest Counts

    MapReduce

    MapReduce

    WebCrawler

    Internet

    WebCache

  • Solution 2 - Maintain counts using Fluo

    Website

    fluo.iogithub.comapache.orgnytimes.com

    # Inbound

    531,385,1922,528,190

    53,395,000

    Fluo Table

    +1

    -1WebCrawler

    Internet

    WebCache

  • Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo

    # InboundLinks

    Update every hour using

    MapReduce

    Update in real-timeusing Fluo

    Website Distribution

    nytimes.com

    github.com

    fluo.io

  • Fluo 101 - Basics- Provides cross-row transactions and snapshot isolation which

    makes it safe to do concurrent updates - Allows for incremental processing of data- Based on Googles Percolator paper- Started as a side project by Keith Turner in 2013- Originally called Accismus- Tested using synthetic workloads- Almost ready for production environments

  • Fluo 101 - Accumulo vs Fluo- Fluo is a transactional API built on top of Accumulo

    - Fluo stores its data in Accumulo

    - Fluo uses Accumulo conditional mutations for transactions

    - Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp

    - Each Fluo application runs its own processes- Oracle allocates timestamps for transactions

    - Workers run user code (called observers) that perform transactions

  • Fluo 101 - Architecture

    Accumulo

    HDFS

    Zookeeper

    YARN

    Client Cluster

    Fluo Client for App 1

    Fluo Clientfor App 1

    Fluo Clientfor App 2

    Fluo Application 2Fluo Application 1

    Fluo Worker

    Observer1 Observer2

    Fluo Oracle

    Fluo Worker

    ObserverA

    Fluo Oracle

    Fluo Worker

    Observer1 Observer2

    Table1 Table2

  • Fluo 101 - Client APIUsed by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc)

    public void addDocument(FluoClient fluoClient, String docId, String content) {

    TypeLayer typeLayer = new TypeLayer(new StringEncoder());

    try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {

    if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } }}

  • Fluo 101 - Observers- Developers can write observers that are triggered when a column is

    modified and run by Fluo workers.- Best practice: Do work/transactions in observers over client code

    public class DocumentObserver extends TypedObserver {

    @Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here }

    @Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); }}

  • Example Fluo Application

    - Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time

    - Fluo client performs two actions:1. Add document to table 2. Mark document for deletion

    - Which triggers two observers: - Add Observer - increase word and document counts- Delete Observer - decrease counts and clean up

  • Add first document to table

    Fluo Table

    Row

    d : doc1

    Column

    doc

    Value

    my first hello world

    Fluo Client

    Client Cluster

    AddObserver

    DeleteObserver

  • An observer increments word counts

    Fluo Table

    Row

    d : doc1

    w : firstw : hellow : myw : world

    total : docs

    Column

    doc

    cntcntcntcnt

    cnt

    Value

    my first hello world

    1111

    1Fluo Client

    Client Cluster

    AddObserver

    DeleteObserver

  • A second document is added

    Fluo Table

    Row

    d : doc1d : doc2

    w : firstw : hellow : myw : secondw : world

    total : doc

    Column

    docdoc

    cntcntcntcntcnt

    cnt

    Value

    my first hello worldsecond hello world

    12112

    2

    Fluo Client

    Client Cluster

    AddObserver

    DeleteObserver

  • First document is marked for deletion

    Fluo Table

    Row

    d : doc1d : doc1d : doc2

    w : firstw : hellow : myw : secondw : world

    total : doc

    Column

    docdeletedoc

    cntcntcntcntcnt

    cnt

    Value

    my first hello world

    second hello world

    12112

    2

    Fluo Client

    Client Cluster

    AddObserver

    DeleteObserver

  • Observer decrements counts and deletes document

    Fluo Table

    Row

    d : doc1d : doc1d : doc2

    w : firstw : hellow : myw : secondw : world

    total : doc

    Column

    docdeletedoc

    cntcntcntcntcnt

    cnt

    Value

    my first hello world

    second hello world

    11111

    1

    Fluo Client

    Client Cluster

    AddObserver

    DeleteObserver

  • Things to watch out for...- Collisions occur when two transactions update the same data at the

    same time- Only one transaction will succeed. Others need to be retried.

    - Some OK but too many can slow computation

    - Avoid collisions by not updating same row/column on every transaction

    - Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update

    - Result is different than if transactions were serialized

    - Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.

  • How does Fluo fit in?

    Higher

    Large JoinThroughput

    Lower

    Slower Processing Latency Faster

    Batch Processing

    MapReduce, Spark

    Incremental Processing

    Fluo, Percolator

    Stream Processing

    Storm

  • Dont use Fluo if...1. You want to do ad-hoc analysis on your data

    (use batch processing instead)

    2. Your incoming data is being joined with a small data set(use stream processing instead)

  • Use Fluo if...1. If you want to maintain a large scale computation

    using a series of small transaction updates

    2. Periodic batch processing jobs are taking too long to join new data with existing data

  • Fluo Application Lifecycle1. Use batch processing to seed computation with historical data

    2. Use Fluo to process incoming data and maintain computation in real-time

    3. While processing, Fluo can be queried and notifications can be made to user

  • Major Progress

    2010 2013 2014 2015

    Google releases Percolator paper

    Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus)

    Fluo can process transactions

    1.0.0-alpha released

    Oracle and worker can be run in YARN

    Changed project name to Fluo

    1.0.0-beta releasing soon

    Solidified Fluo Client/Observer API

    Automated running Fluo cluster on Amazon EC2

    Multi-application support

    Improved how observer notifications are found

    Created Stress Test

  • Fluo Stress Test- Motivation: Needed test that stresses Fluo

    and is easy to verify for correctness

    - The stress test computes the number of unique integers by building a bitwise trie

    - New integers are added at leaf nodes

    - Observers watch all nodes, create parents, and percolate total up to root node

    - Test runs successfully if count at root is same a number of leaf nodes

    - Multiple transactions can operate on same nodes causing collisions

    1110

    11xx = 3

    1100

    10xx = 0 01xx = 1 00xx = 1

    xxxx = 5

    0101 00011110

  • Easy to run Fluo1. On machine with Maven+Git, clone the fluo-dev and fluo repos

    2. Follow some basic configuration steps

    3. Run the following commands

    Its just as easy to run a Fluo cluster on Amazon EC2

    fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballsfluo-dev setup # Sets up locally Accumulo, Hadoop, etcfluo-dev deploy # Build Fluo distribution and deploy locallyfluo new myapp # Create configuration for myapp Fluo applicationfluo init myapp # Initialize myapp in Zookeeperfluo start myapp # Start the oracle and worker processes of myapp in YARNfluo scan myapp # Print snapshot of data in Fluo table of myapp

  • Fluo Ecosystem

    fluoMain Project Repo

    fluo-quickstartSimple Fluo

    example

    fluo-stressStresses Fluo on

    cluster

    fluo-io.github.io

    Fluo project website

    phrasecountIn-depth Fluo

    example

    fluo-deployRun Fluo on EC2

    cluster

    fluo-devHelps developers

    run Fluo locally

  • Future Direction- Primary focus: Release production-ready 1.0 release with stable API

    - Other possible work:- Fluo-32: Real world example application

    - Possibly using CommonCrawl data

    - Fluo-58: Support writing observers in Python- Fluo-290: Support running Fluo on Mesos- Fluo-478: Automatically scale up & down Fluo workers based on

    workload

  • Get involved!1. Experiment with Fluo

    - API has stabilized- Tools and development process make it easy- Not recommended for production yet (wait for 1.0)

    2. Contribute to Fluo- ~85 open issues on GitHub- Review-then-commit process