accumulo summit 2015: accumulo in-depth: building bulk ingest [sponsored]

Download Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]

Post on 15-Jul-2015

103 views

Category:

Technology

1 download

Embed Size (px)

TRANSCRIPT

  • Building Bulk ImportEric Newton

    SW Complete, Inc.

  • Ingest 101

    The life of a mutation: Send to server Write to Write-Ahead Log Store in memory Write memory to a file Merge, re-write as needed

    At least two writes, small in-memory sort

  • Bulk Ingest

    Accumulo: heavy ingest Use Map-Reduce efficiency Pre-sort incoming data Hand whole sorted files to accumulo One write Larger sorts

  • Version 1

    importDirectory(String dir, String failDir)Client computes the servers that need the file:

    Analysis of file Moves directory under Accumulo Retry logic (to handle splits, failures)

  • Version 1: problems

    Limited to the clients computational power Permission: client had to be all-knowing Files could be added to servers many times Defer file collection while bulk importing Clients can fail

  • Version 1.1

    Clients hand bulk imports to the master Fixed permission problems Added a bulk import test to the Random

    Walk test suit

  • The Test

    Create a sorted file with a lot of 1s:12345678 -> 1

    Create an identical file, with lots of -1s:12345678 -> -1

    Add a summation iterator over the tableVerify: every entry should be zero:

    12345678 -> 0

  • Random Walk 101

    Randomly: Import files in random order Split the files into random sizes Split tablets and random points Kill tablet servers (agitate)

    Under loadAt scale

  • Version 2.0

    Master only coordinates file processing Distribute work to tablet servers Master distributes files to tablet servers Tablet serves

    Analyze files for assignments Retry Communicate with destination tablet servers

  • FilesFilesFiles

    Command Flow

    Client Master TabletServer

    TabletServer

    TabletServer

    TabletServer

  • Problems

    Problems solved: Permissions controlled by Master File processing is distributed

    Bulk import tested heavilyConsistency

    Not so much

  • Version 3.0

    Problem: file imported more than once Repeated reloading RPC timeouts Tablet migration Tablet split

    Add flags to metadata table to prevent: file garbage collection repeated imports

  • Version 3.0

    Problem Solved Reduced Name Node ops Reduced trash laying around from failed imports Imports not repeated

    Does it stand up to the Random Walk Test? Not so much

  • Death by Slow ThreadPlease take this file

    OK looks good, never saw this beforeSleep

    Hey, Please take this file, againOK looks good, never saw this beforeThanks!Compact!Wakeup! Time to import that file from the 1st request!

  • Zookeeper to the Rescue

    Add a 3rd party negotiator Define a session Add a file only while session is active Store session in zookeeper Get agreement about the session at each critical point,

    including metadata table updatesSession guides clean-up of markersSession closes only after all agree

  • Session

    Take this file, for session 123OK, working on session 123Never saw this file beforeSleep

    Repeat, file for session 123Never saw this file before

    Anybody working on session 123?

  • Session

    Yes, Im processing session 123!OK, finish up.

    Double check on session 123, import fileAnybody working on session 123?

    Whats session 123?Remove markers in metadata

  • Bulk Import

    Problem solved: Distributed processing Permissions Files imported once, and only once Markers are cleaned up in the face of failures

    Performance Not so much

  • Master as BottleneckBulk import thousands of filesEvery 15 minutesMaster renames files and puts them under /accumulo

    BottleneckOne master competes for NN ops with N tablet servers

  • Master, Go Faster

    Add configurable thread poolPush more move requests to the NNCompete more fiercely for resources

  • Bulk Import

    More efficient data ingestEasy: just a file to the right tabletsHard: consistency in a distributed systemTesting is your friendNothing prepares you for large-scale problems!