apache accumulo 1.8.0 overview

of 13 /13
Apache Accumulo 1.8.0 Overview Josh Elser Apache Accumulo Meetup Group 2016/06/27

Upload: josh-elser

Post on 16-Apr-2017




5 download

Embed Size (px)


Page 1: Apache Accumulo 1.8.0 Overview

Apache Accumulo 1.8.0 OverviewJosh ElserApache Accumulo Meetup Group2016/06/27

Page 2: Apache Accumulo 1.8.0 Overview

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Accumulo 1.8.0

First release candidate in the works A “minor” release, but significantly more work required than a “patch” release

– ContinuousIngest and verification– RandomWalk

Long time coming..

Page 3: Apache Accumulo 1.8.0 Overview

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Semantic Versioning

Defines a set of rules for software projects to adhere to across different versions. Clear understanding on compatibility Rules are defined in terms of a “public API”

– Defined by the project adopting SemVer

Major– Incompatible changes, deprecations removed

Minor– Backwards-compatible features added

Patch– Backwards-compatible bug-fixes only (no features)

http://semver.org - major.minor.patch

Page 4: Apache Accumulo 1.8.0 Overview

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Accumulo and Semantic Versioning Apache Accumulo defines a public API

– Made up of Java classes, defined by packages– The goal is to describe how user code should function across releases– Recursively, all public types in (excluding impl, thrift, or crypto)

• org.apache.accumulo.core.{client,data,security}• org.apache.accumulo.minicluster

Other concerns for compatibility too– RPC classes– Persistent data (RFiles and ZooKeeper)

Not comprehensive!– Not all user facing code is yet included in the public API

• Monitoring UIs and data• Start/stop scripts• The Accumulo Shell

Page 5: Apache Accumulo 1.8.0 Overview

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Accumulo and Semantic Versioning

Is it guaranteed that your application from 1.7.1 work against 1.8.0?

What about a 1.6.5 application?

Are you guaranteed to be able to roll back an upgrade from 1.8.0 to 1.7.1?

Is it guaranteed that your 1.8.0 application work against 1.7.0?


Page 6: Apache Accumulo 1.8.0 Overview

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Notable changes currentlystaged for Apache Accumulo 1.8.0

Page 7: Apache Accumulo 1.8.0 Overview

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

System Administrator Changes [ACCUMULO-925] - Launch scripts should use a PIDfile

– New script: start-daemon.sh– Encapsulates only the things that need to happen on the machine starting a process

• No SSH’ing– Support for PID files to track processes– Rotating .out and .err files on start

• Critical for delayed JVM layer issues

Page 8: Apache Accumulo 1.8.0 Overview

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Performance! [ACCUMULO-3423] - Speed up write-ahead log (WAL) roll-overs

– Changes how references to WALs are stored by Accumulo– Reduces the number of writes when switching to a new WAL– Uses ZooKeeper to track the state, copies into tablet row before recovery starts– 10-30% faster over previous implementation (while exacerbating the problem)

[ACCUMULO-1124] - Optimize index size in RFile– RFiles have “data” and “index” blocks; index from RowID to data block containing that RowID– Large RowIDs bloat the index (e.g. inverted URL)– Fewer index blocks can be cached– Related work: [ACCUMULO-4164] and [ACCUMULO-4314]

Page 9: Apache Accumulo 1.8.0 Overview

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New Features [ACCUMULO-3913] - Add per table sampling

– Helpful in running analytics over some percentage of the total data– Can automatically create samples during compaction or on the fly using Iterators– Configurable hashing to ensure consistency across “index” and “data” tables

• No dangling references index records or unreachable data records– Consider snapshot’ing a sample of a table. After compaction, just a “normal” table

[ACCUMULO-4187] - Rate limiting of major compactions– Compactions can strain system resources: hardware, JVM and HDFS– Normally, desirable to process compactions as fast as possible– Can negatively affect low-latency workloads– Configure a limit in bytes per seconds that a TabletServer should process during compaction

Page 10: Apache Accumulo 1.8.0 Overview

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New Features (pt.2) [ACCUMULO-3948] - Enable A/B testing of scan iterators on a table

– Classpath context is a definition of JARs which the TabletServer should dynamically load– Configuration allows a context to be specified when using a [Batch]Scanner– Multiple implementations of the same SKVIterator classes can co-exist– Useful in testing new implementations of iterators on real data before switching production

[ACCUMULO-626] - Create an iterator fuzz tester– Writing SKVIterators is notoriously difficult– Many common pitfalls and gotcha’s, often not appearing until “real” use– A testing framework codifies these edge cases and can automatically test iterators

• Similar to ”security fuzzing”– Users must provide data sets and the expected outcome from using their SKVIterator– A supplement to unit testing and MiniAccumuloCluster, not a replacement– Test cases implicitly encourage good design of SKVIterators

Page 11: Apache Accumulo 1.8.0 Overview

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

New APIs [ACCUMULO-2883] - Add API to fetch current locations of Tablets

– Long-standing feature request (order of years)– Extremely useful for distributed execution engines for locality aware computation

• Apache Hive, Presto, Apache Drill, Apache Spark, etc– Smart placement can reduce client <--> Accumulo network traffic

• Locality with Accumulo Tablets also implies locality with HDFS data (over time)

[ACCUMULO-4165] - Create a user level API for RFile– Example of a “glaring” hole in the public API– Only stable way to create an RFile is via MapReduce– Provides a supported API for reading and writing RFiles– Simplifies implementation and use of RFile access internally too

Page 12: Apache Accumulo 1.8.0 Overview

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Changes to be wary of [ACCUMULO-3409] - Move default ports out of ephemeral range

– Traditional ephemeral range on Linux: [32768, 61000]– Transient connections can prevent processes from starting– Monitor HTTP port moves from 50095 to 9995

[ACCUMULO-4077] - Upgrade to Apache Thrift 0.9.3– Thrift is used by Accumulo for RPCs– Serialized messages are compatible (with caveats) across releases, but Java classes are not– A massive pain for downstream integrations– If you require a different version of Thrift and want to use Accumulo 1.8.0

• Shade+Relocate your version of Thrift in your application• Upgrade to Apache Thrift 0.9.3

Page 13: Apache Accumulo 1.8.0 Overview

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Thank YouEmail: [email protected]: @josh_elserMailing list: [email protected]