accumulo summit 2015: attempting to answer unanswerable questions: key management in accumulo for...

of 29 /29
1 © Cloudera, Inc. All rights reserved. Key Management in Accumulo for Encryption at Rest Anthony Young-Garner

Author: accumulo-summit

Post on 15-Jul-2015

116 views

Category:

Technology


7 download

Embed Size (px)

TRANSCRIPT

Key Management in Accumulo for Encryption at Rest

Key Management in Accumulofor Encryption at RestAnthony Young-Garner# Cloudera, Inc. All rights reserved.Accumulo 1.6 introduced support for encryption in transit and at rest. These capabilities increase the protection Accumulo provides against the threat of unauthorized data access. Michael Allen covered these features and topics in a presentation at last year's Accumulo Summit. However, at that point in time, there were some unanswered questions around key management. Today, I'd like to highlight where advances in the underlying Hadoop platform and HDFS start to provide some clearer choices and better answers to these questions.Past and future threats, a refresherAccumulo ClusterHDFS ClusterClient MachinesZookeeper Cluster

Sven(User)Jo(User)Molly(Network Admin)William(Accumulo Admin)Halim(HDFS Admin)

Trusted Zone (implicit)Image design credit: Michael Allen, see slide 7# Cloudera, Inc. All rights reserved.Last year, Michael Allen introduced us to a set of threat vectors and a cast of characters. As a reminder, Accumulo's built-in access control mechanism prevent unauthorized data access to users and processes participating in the Accumulo processing paths. However, the design of the system also creates an implicit zone of trust composed of users and processes that are not participants in the Accumulo processing paths but who do have visibility into portions of those processing paths. In particular, there are user roles with administrator privileges within the larger data storage and network environments in which Accumulo operates who can see data in the clear over the wire between Accumulo clients and servers, and RFiles and WAL files at rest on HDFS. Thus, whether by side effect or intention, these user roles must be considered trusted users by the system unless steps can be taken to push them out of the Trusted Zone.This is not theoretical (1 of 3)[[email protected] lib]# accumulo shell -u rootPassword: **************

Shell - Apache Accumulo Interactive Shell- - version: 1.6.0-cdh5.1.4- instance name: accumulo- instance id: cce72c83-826a-41bf-a11f-a8aecebeebaf- - type 'help' for a list of available commands-

[email protected] table1> scan -s public,privatealice properties:age [public] 48alice properties:ssn [private] 123-45-6789bob properties:age [public] 51bob properties:ssn [private] [email protected] table1> quitAccumulo user with proper visibility authorizations accessing data.# Cloudera, Inc. All rights reserved.The threat is not theoretical. Here we see a properly authorized accumulo user accessing table data on what is intended to be a secure production cluster. Within accumulo, the data is protected from unauthorized access by visibility rules defined with identity theft rather than age discrimination in mind. But regardless of the naming choices made in defining the visibility labels, the table data is only visible to authenticated users with specific authorization for these specific labels (public and private). Another accumulo user who has not been specifically authorized to view data with the public visibility label will not be able to see the age data in table1 and an accumulo user who has not been specifically authorized to view data with the private visibility label will not be able to see the social security number data in table 1.This is not theoretical (2 of 3)[[email protected] ~]$ hadoop distcp \ hdfs://secure-1:8020/accumulo/tables/3/default_tablet/F000018w.rf \ hdfs:// insecure-1:8020/tmp/table1_export_dest/HDFS administrator copying RFile from a cluster on which he has no privileges to one on which he does.# Cloudera, Inc. All rights reserved.But whatever are the niceties of privacy, our HDFS administrator is really curious to know Alice's age. And after all, if he really wasn't supposed to see the data on the secure cluster it wouldn't be so easy for him to copy it to an insecure cluster over which he has full control.This is not theoretical (3 of 3)[[email protected] ~]# accumulo shell -u rootPassword: **************

Shell - Apache Accumulo Interactive Shell- - version: 1.6.0-cdh5.1.4- instance name: accumulo- instance id: ebfe2e64-ba12-4231-8261-3a89115046ed- - type 'help' for a list of available commands- [email protected]> importtable table1_copy /tmp/[email protected] table1> setauths -u root -s public,[email protected]> scan -t table1_copy -s public,privatealice properties:age [public] 48alice properties:ssn [private] 123-45-6789bob properties:age [public] 51bob properties:ssn [private] [email protected]> quitHDFS admin reading unauthorized data.# Cloudera, Inc. All rights reserved.Once the unencrypted data is on an insecure cluster under the curious HDFS administrator's control, he can give himself whatever privileges he needs to view the data.

Proper auditing might alert others to the HDFS administrator's actions. We and other vendors offer auditing functionality that, properly configured, would detect, report and possibly even send alerts in response to the suspect distcp invocation. But even so, detection after the fact does not change the fact of the unauthorized data access and therefore the only available actions at that point are reactions and damage control. And this is the least sophisticated method. An OS administrator with rights on the underlying file system could copy from the local filesystem rather than using distcp and use a hexdump to read the data. A network administrator might sniff the data off the wire while it is in transit between Accumulo client and server or between servers in the cluster.Past and future threats, where we left offAccumulo ClusterHDFS ClusterClient MachinesZookeeper Cluster

Sven(User)Jo(User)Molly(Network Admin)William(Accumulo Admin)Halim(HDFS Admin)

Trusted Zone (implicit)Image design credit: Michael Allen, see slide 7# Cloudera, Inc. All rights reserved.The encryption capabilities in Accumulo 1.6 addresses these threats. The introduction of SSL encryption between Accumulo clients and servers and between Accumulo servers and servers pushes those with network sniffing privileges out of the implicit zone of trust. And introducing encryption of Accumulo data before writing to persistent storage pushes HDFS administrators out of the zone of trust. The SSL functionality is straightforward but before continuing, Ill talk a bit more about the way in which Accumulo encryption of data at rest works.Accumulo SecretKeyEncryptionStrategyAccumulo encryption at rest encrypts each RFile and WAL file with a data encryption key (DEK)Data encryption keys are encrypted with a key encryption key (KEK)Data is secure at rest and in transitKey encryption key is stored in HDFS (default implementation)See Michael Allen's "Past and Future Threats: Encryption and Security in Accumulo" presentation from Accumulo Summit 2014 for more detail on message encryption (SSL) and data encryption support in Accumulo 1.6http://accumulosummit.com/archives/2014/program/talks/ # Cloudera, Inc. All rights reserved.The collection of Accumulo 1.6 encryption at rest JIRAs encrypt each Rfile and WAL file with a data encryption key before writing to persistent storage. Each of these data encryption keys are encrypted with a service-wide key encryption key before also being written to storage, each DEK along with its associated file. This design largely achieves the goal of ensuring data security both in transit to and from and at rest. The linchpin in this scheme then, is deciding how to secure the service-wide key encryption key. This is done by implementing the Accumulo SecretKeyEncryptionStrategy interface. The SecretKeyEncryptionStrategy interface defines two methods, encryptSecretKey and decryptSecretKey, which are intended to do, to the service-side key encryption key, exactly what their names suggest. The default implementation of the SecretKeyEncryptionStrategy interface is Accumulo 1.6 stores the key encryption key in HDFS.

In this talk, Im going to focus on the use and operational aspects of the data encryption at rest support. For more detail on the design and implementation of both message encryption (SSL) and data encryption support in Accumulo 1.6, see Michael Allen's "Past and Future Threats: Encryption and Security in Accumulo" presentation from last years Accumulo Summit.http://accumulosummit.com/archives/2014/program/talks/

Enabling Accumulo encryption - accumulo-site.xml

# Cloudera, Inc. All rights reserved.Enabling data encryption in Accumulo is mostly painless, involving the setting of about a dozen properties in the accumulo-site.xml file, most of which can be set by rote. Only a few are shown here. The most important settings to note for the purposes of our discussion today are the crypto.secret.key.encryption.strategy.class and crypto.default.key.strategy.key.location properties, the first of which defines which implementation class will be used to secure and store the key encryption key and the second of which, in the case of the CachingHDFSSecretKeyEncryptionStrategy, specifies where in HDFS the KEK will be stored.

Properties: crypto.module.class, crypto.cipher.suite, crypto.cipher.algorithm.name, crypto.block.stream.size, crypto.cipher.key.length, crypto.secure.rng, crypto.secure.rng.provider, crypto.secret.key.encryption.strategy.class, crypto.default.key.strategy.hdfs.uri, crypto.default.key.strategy.key.location, crypto.default.key.strategy.cipher.suiteAccess attempt thwarted[[email protected] ~]# accumulo shell -u root

[email protected]> importtable table1_copy /tmp/table1_export_dest

[email protected]> scan -t table1_copy -s public,private2015-04-18 22:45:15,282 [shell.Shell] ERROR: java.lang.RuntimeException: org.apache.accumulo.core.client.impl.AccumuloServerException: Error on server insecure-5.vpc.cloudera.com:[email protected]> quitHDFS admin attempt to read unauthorized data fails.# Cloudera, Inc. All rights reserved.Once Accumulo encryption at rest is configured, a rogue actor can still use distcp or other means to copy data from the secure cluster to an insecure location. But attempts to actually use the data will fail because the encrypted data is meaningless to the Accumulo runtime, which cannot parse the encrypted file(s).Data access threats summarizedVectorProtection mechanismUnauthorized usersVisibility labelsNetwork administratorThrift/SSLHDFS administratorAccumulo encryption at restMisconfigurationAll of the above# Cloudera, Inc. All rights reserved.So lets pause for a moment and summarize what weve discussed so far. Weve recognized that visibility labels are probably the Accumulo project's most famous security feature. They provide fine-grained, cell-level access control and thereby protect against unauthorized data access by Accumulo users and client applications. But as weve seen, visibility labels only protect against data access from within Accumulo processes. HDFS administrators can still access the Accumulo RFiles on disk and network administrators can access Accumulo data in transit between hosts. The support for SSL and encryption at rest addresses these threat vectors and allows us to more closely manage who is included in the implicit zone of trust around our Accumulo cluster.Not so [email protected] ~]$ hadoop distcp \hdfs://secure-1:8020/accumulo/crypto/secret/keyEncryptionKey \ hdfs:// insecure-4:8020/accumulo/accumulo/crypto/secret/keyEncryptionKeyHDFS admin can copy the Accumulo key encryption key!# Cloudera, Inc. All rights reserved.But as Michael Allen has discussed in the past and as Ive alluded to earlier in our discussion, were not quite done here because the key encryption key itself is still available to anyone with proper access on HDFS, including the HDFS administrator.

Current threats: nearly back where we started?!Accumulo ClusterHDFS ClusterClient MachinesZookeeper Cluster

Sven(User)Jo(User)Molly(Network Admin)William(Accumulo Admin)Halim(HDFS Admin)

Trusted Zone (implicit)

# Cloudera, Inc. All rights reserved.So it may seem that weve done quite a bit of work, or at least in our particular case, a lot of talking, to end up right back where we started. Our network administrator has definitely been removed from the zone of trust by SSL, but it seems that our HDFS administrator is a bit more stubborn.An interlude: HDFS transparent encryption at restData in encryption zones is transparently encrypted by HDFS clientSecure at rest and in transit Prevents attacks at HDFS, FS and OS levelsKey management is independent of HDFSDesigned for performance, scalability, compartmentalization and compatibility Keys are stored by Hadoop Key Management Service (KMS)Proxy between key store and HDFS encryption subsystems on HDFS client/server # Cloudera, Inc. All rights reserved.In the interim between the release of Accumulo 1.6 and today, support for encryption at rest in HDFS and key management in Hadoop were added to the platform. This support became generally available in Hadoop 2.6. HDFS encryption at rest, like Accumulo encryption at rest, secures data at rest and in transit between HDFS client and server. (Full stop) It prevents attacks at the HDFS, FS and OS levels. In addition it leverages the general purpose Key Management Service (KMS) also introduced in Hadoop 2.6. Both features, HDFS encryption and the KMS, are designed to meet the performance and scalability requirements necessary to provide transparent data encryption and key management function to Hadoop services without adversely impacting the operational capabilities of the services. HDFS encryption, in particular, was designed and implemented to be entirely transparent to and compatible with clients and services running in and on top of HDFS other than the hopefully minimal performance impact of the actual encryption and decryption operations. The KMS, in particular, is designed to not only provide a robust proxy between the Hadoop cluster and back end key stores but also to provide the particular management, administrative and role-based compartmentalization of access and function needed to effectively leverage key management within the Hadoop ecosystem.

Before talking about how HDFS encryption and the KMS help us to solve the problem of key management for encrypted Accumulo data, Ill talk in a bit more detail about how both HDFS encryption and the KMS work.HDFS encryption, simple versionHDFSclientHDFS ClusterHDFSData NodeHDFSName NodeREST/HTTPHadoop Key Provider API1. User or process creates key (KEK)HDFS FileHDFS FileHDFS FileHDFS FileFile metadataFile metadataFile metadata2. HDFS admin creates encryption zone. Associates empty directory and a KEK.3. User or process initiates read/write to file in encryption zoneHadoop KMS# Cloudera, Inc. All rights reserved.HDFS encryption, like Accumulo encryption, makes use of data encryption keys (DEKs) to protect individual files and key encryption keys (KEKs) to protect a set of data encryption keys. The implementation also relies on the Hadoop Key Management Service (KMS) to manage and provide the root of trust for key encryption keys, which removes the ability of HDFS administrators to control or access KEK key material. This also blocks HDFS administrator access to encrypted data.

Operational use of HDFS encryption starts with a user or process creating a key via the Hadoop Key Management Service. In our case, the Accumulo user might create an accumulo-key. Then the HDFS administrator can create an encryption zone for this user on an empty directory set with the appropriate ownership and ACLs. The encryption zone ties together a directory (and all of it subdirectories) with a particular key. The HDFS admin only needs to know the name of the key; he neither has nor requires access to the key material. Once these two steps are completed, client reads and writes to the encryption zone are automatically and transparently decrypted and encrypted by the HDFS client. As a matter of fact, no user or process on the HDFS server ever sees or has access to the key encryption key. The result is that no processes or users on the server side are able to decrypt the data.

The most important take away message for our purposes here is HDFS encryption establishes the root of trust and control in the KMS rather than through any mechanism or authority within HDFS itself, which allows the flexibility to separate the data access roles from file administration roles. The most important message in general about the KMS within the context of HDFS encryption is that the key encryption key never leaves the KMS host (and if a key server is used, it never leaves the KMS process).HDFS encryption, name node actionsHDFSclientHDFSData NodeHDFSName NodeHadoop KMSREST/HTTPHadoop Key Provider APIHDFS FileHDFS FileHDFS FileHDFS FileFile metadataFile metadataFile metadata3. User or process initiates read/write to file in encryption zone5. Name node returns file stream and encrypted key to client4. On file creation, name node requests encrypted data encryption key (EDEK) from KMS. EDEK is stored with file metadata on Name Node.HDFS Cluster# Cloudera, Inc. All rights reserved.NOTE: This level of detail may be unnecessary. Consider skipping this slide unless people want to know more.

The steps on the previous slide are all you need to know to use HDFS encryption. All other details after step 3 are transparent to the user and to the HDFS client processes. However, in order to understand how the key encryption keys (and, thus, the data) are protected, you may want to know more. The details in steps 4 and 5 above are the mechanism used to keep data encryption keys away from any server-side processes, including the name node and data nodes. So then how does the client get the unencrypted data encryption key?

HDFS encryption, client actionsHDFSclientHDFSData NodeHDFSName NodeREST/HTTPHadoop Key Provider APIHDFS FileHDFS FileHDFS FileHDFS FileFile metadataFile metadataFile metadata6. Client requests decrypted DEK from KMS7. KMS uses KEK to decrypt DEK and returns decrypted DEK to client.8. Client uses DEK to read/write encrypted data to/from stream.HDFS ClusterHadoop KMS# Cloudera, Inc. All rights reserved.NOTE: This level of detail may be unnecessary. Consider skipping this slide unless people want to know more.

The client requests the decrypted data encryption key directly from the KMS. This functionality happens transparently implemented in the CryptoStream support in HDFS encryption. The decryption of the Data Encryption Key occurs on the KMS, so the Key Encryption Key never leaves the KMS. Furthermore, assuming ACLs are configured properly, only the client process is able to request the decryption of the Data Encryption Key.

The key take away from this slide and the previous one is that all of the steps on these two slides happen within the HDFS implementation with no changes to client code. This is why it is called transparent encryption. Only administrative actions are required to enable or disable encryption. No code changes are necessary.Hadoop KMS: what's in the black orange box?REST/HTTPSHadoop Key Provider APIHadoop KMSHadoop Key Management Server is a proxy between KMS clients and a backing key storeDefault store is a Java key store fileImplementations for full-featured key servers with support for Hardware Security Module (HSM) integration available todayHSM integration moves the root of trust to the HSMProvides a unified API and scalabilityConfigurable caching supportProvides key lifecycle management (create, delete, roll, etc.)Provides a broad set of access control capabilitiesPer-user ACL configuration for access to KMSPer-key ACL configuration for access to specific keysStrong authentication via Kerberos supportFull featured hadoop shell command line provided# Cloudera, Inc. All rights reserved.Ive said that the Hadoop KMS is the root of trust when using HDFS encryption. This is true, full stop. HDFS encryption key management keys are secured by the KMS. However, the keys are not stored by the KMS. Instead, the KMS provides a key provider API which defines how keys are stored and relies on an implementations of this API to actually store the keys somewhere. The default implementation of the key provider API stores keys in Java key store files on the local file system of the KMS. This is relatively secure, assuming the KMS host is protected from unauthorized local file system access. However, it is not particularly robust in that it cannot provide for high availability or failover. And in some environments, storing key material on an operational host, even one that is protected from unauthorized local file system access is in violation of policy. In such environments, it is likely that a dedicated and full-featured key server is in use. For example, we provider a key provider implementation that leverages a backend keystore that provides for availability and failover and also supports the Hardware Security Modules (known as HSMs) used by the most security-conscious organizations.

Having the KMS as a proxy allows it to provide the particular management, administrative and role-based compartmentalization of access and function needed to effectively leverage key management within the Hadoop ecosystem while respecting the fact that the most security-conscious environments will require a dedicated key server to meet quality of service standards and/or to meet security requirements that the root of trust for keys rest in an organizational HSM.Hadoop KMS ACL example: blacklisting hdfs admin

# Cloudera, Inc. All rights reserved.Getting back to our Accumulo use case, one key aspect of the separation of roles and concerns between HDFS encryption and the Hadoop KMS is that it allows us to define KMS ACLs that block particular users, including the HDFS administrator, from accessing key material and/or particular key management functions.

KMS ACLs support both white and black lists on all exposed KMS functions (create, delete, rollover, get key, get key metadata, generate encrypted key, decrypt encrypted key)You can also configure white lists on a per-key basis.ACL updates in config files are loaded dynamically without re-starting the service.

Finally, HDFS admin is truly [email protected] ~]$ hadoop distcp \hdfs://secure-1:8020/accumulo/crypto/secret/keyEncryptionKey \ hdfs:// insecure-4:8020/accumulo/crypto/secret/keyEncryptionKey15/04/18 22:41:09 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1...15/04/18 22:41:10 ERROR util.RetriableCommand: Failure in Retriable command: Copying hdfs://secure-1.vpc.cloudera.com:8020/accumulo/crypto/secret/keyEncryptionKey to hdfs://insecure-4.vpc.cloudera.com:8020/accumulo/crypto/secret/keyEncryptionKeyorg.apache.hadoop.tools.mapred.RetriableFileCopyCommand$CopyReadException: org.apache.hadoop.security.authorize.AuthorizationException: User:hdfs not allowed to do 'DECRYPT_EEK' on 'accumulo-key'HDFS admin can no longer copy the Accumulo key encryption key!# Cloudera, Inc. All rights reserved.Having enabled HDFS encryption and configured the KMS, we can finally block the HDFS administrator from getting unauthorized access to secure data.Well, mostly blocked...[[email protected] ~]$ hadoop distcp \hdfs://secure-1:8020/.reserved/raw/accumulo/crypto/secret/keyEncryptionKey \hdfs:// insecure-4:8020/.reserved/raw/accumulo/crypto/secret/keyEncryptionKey04/18 22:43:52 INFO mapred.LocalJobRunner: OutputCommitter set in config null15/04/18 22:43:52 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1...15/04/18 22:43:53 INFO mapreduce.Job: Job job_local2004063696_0001 completed successfully

The /.reserved/raw virtual path allows HDFS admins to perform distcp operations but the data is moved in its encrypted form. No decryption occurs.# Cloudera, Inc. All rights reserved.But then again, we have to stop for a bit and consider that moving data around on the cluster is a core part of the HDFS administrators job. We dont want to tempt the HDFS administrator with access to data he shouldnt see but we also dont want to tie his hands. Therefore, HDFS encryption introduces the /.reserved/raw virtual path. This virtual path allows the HDFS admin to perform arbitrary HDFS operations on data but when operations are performed in this virtual path, no transparent encryption or decryption occurs. So, returning to our now canonical example, we see that the HDFS administrator can still use distcp to copy the Accumulo key encryption key from the secure cluster to the insecure cluster. But this time the copy of the file on the insecure cluster is still encrypted via HDFS encryption and cannot be read on the insecure cluster which doesnt have access to the necessary keys.

Also get screen shots of checksum of key on original fs vs. exploit fsAccumulo SecretKeyEncryptionStrategy revisitedAccumulo encryption at rest encrypts each RFile with a data encryption key (DEK)Data encryption keys are encrypted with a key encryption key (KEK)Data is secure at rest and in transitKey encryption key is stored in HDFSOptions to protect the Accumulo KEKLeverage HDFS encryption in Hadoop 2.6Default SecretKeyEncryptionStrategy with HDFS encryptionAccumulo on HDFS encryptionCustom SecretKeyEncryptionStrategy# Cloudera, Inc. All rights reserved.With the introduction of HDFS encryption, we now have a couple of viable options to protect the Accumulo key encryption key without needing to implement a custom SecretKeyEncryptionStrategy, although this remains a reasonable option for specific situations. The first option is to use HDFS encryption to protect the Accumulo KEK stored by the default SecretKeyEncryptionStrategy. The second option is to use HDFS encryption to secure the entire Accumulo data directory directly in HDFS. Ill discuss each of these options in turn.

Using HDFS encryption to protect the Accumulo KEK via the Hadoop KMSAccumulo ClusterHDFS ClusterZookeeper Cluster

William(Accumulo Admin)Halim(HDFS Admin)

Trusted Zone (implicit)

Hadoop KMS# Cloudera, Inc. All rights reserved.The simplest option, especially if youre already using Accumulo encryption at rest and the default SecretKeyEncryptionStrategy implementation, is to use HDFS encryption to protect the Accumulo key encryption key (KEK) with the root of trust in the backing key store of the Hadoop KMS. In this deployment model, Accumulo data continues to be protected using Accumulo encryption at rest while the Accumulo KEK itself is protected by HDFS encryption, effectively blocking the HDFS administrator from unauthorized access.Moving Accumulo KEK to an encryption zone# sudo -u accumulo hadoop key create accumulo-keyaccumulo-key has been successfully created with options Options{cipher='AES/CTR/NoPadding', bitLength=128, description='null', attributes=null}.KMSClientProvider[http://secure-3.vpc.cloudera.com:16000/kms/v1/] has been updated.

# sudo -u accumulo hadoop fs -mv /accumulo/crypto/secret /accumulo/crypto/secret-tmp# sudo u hdfs hadoop fs mkdir p /accumulo/crypto/secret# sudo u hdfs hadoop fs chown accumulo:accumulo /accumulo/crypto/secret

# sudo -u hdfs hdfs crypto -createZone -keyName accumulo-key -path /accumulo/crypto/secretAdded encryption zone /accumulo/crypto/secret

# sudo -u hdfs hadoop distcp -pugpx -skipcrccheck -update /accumulo/crypto/secret-tmp \/accumulo/crypto/secret

# sudo -u accumulo hadoop fs -rm r /accumulo/crypto/secret-tmpDeleted /accumulo/crypto/secretCreating KEK in an existing encryption zone is much simpler.# Cloudera, Inc. All rights reserved.How do we actually do this? The process is straightforward. First, well need to find a brief Accumulo maintenance window (no more than five minutes, but you may want to allow more time if you want to run some acceptance tests or smoke tests afterward) as access to Accumulo data may be briefly interrupted while we encrypt the Accumulo key encryption key.

TODO: More detailed speaker notes here.Tradeoffs of hybrid approach (Accumulo KEK + KMS)ProsLeast effort path forwardMinimal operational riskMinimal Accumulo downtimeAllows gentle adoption of HDFS encryption and Hadoop KMSLeverage nearly all administrative capabilities of Hadoop KMSConsAccumulo 1.6 encryption at rest supports rfiles and write-ahead logs, but not yet recovered write-ahead logsCurrent implementation and framework is experimental# Cloudera, Inc. All rights reserved.This hybrid approach requires little effort to implement and introduces the least possible operational risk, assuming youre already using Accumulo encryption at rest. It also requires the least amount of Accumulo downtime because the distcp operation only has to copy the one Accumulo key encryption key file which will in most cases be completed in seconds. Since only the Accumulo KEK is being protected by HDFS encryption, it allows your Accumulo administrator, HDFS administrator and other Hadoop administrators to gently ramp up their adoption of HDFS encryption and Hadoop KMS without having to become experts on day one while still providing your admins with nearly all of the administrative capabilities that full use of the Hadoop KMS with Accumulo data would provide.

One negative aspect of this approach, especially if youre not already using Accumulo encryption at rest, is that the Accumulo encryption at rest implementation in Accumulo 1.6 is considered experimental and is not yet fully complete. While Accumulo 1.6 supports encryption of rfiles and write ahead logs, it does not yet provide support for encryption of recovered write ahead logs that are created and stored on HDFS when a tablet server fails.

See JIRA https://issues.apache.org/jira/browse/ACCUMULO-981 (support pluggable encryption codecs for MapFile when recovering write-ahead logs) for more details.Using HDFS encryption to protect the Accumulo directory directlyAccumulo ClusterHDFS ClusterZookeeper Cluster

William(Accumulo Admin)Halim(HDFS Admin)

Trusted Zone (implicit)Hadoop KMS# Cloudera, Inc. All rights reserved.Another option worth considering, especially if youre not currently already using Accumulo encryption at rest, is leveraging HDFS transparent encryption to protect your entire Accumulo data directory.Moving Accumulo directory to an encryption zone# sudo -u accumulo hadoop key create accumulo-keyaccumulo-key has been successfully created with options Options{cipher='AES/CTR/NoPadding', bitLength=128, description='null', attributes=null}.KMSClientProvider[http://secure-3.vpc.cloudera.com:16000/kms/v1/] has been updated.

# sudo -u accumulo hadoop fs -mv /accumulo /accumulo-tmp# sudo u hdfs hadoop fs mkdir /accumulo# sudo u hdfs hadoop fs chown accumulo:accumulo /accumulo

# sudo -u hdfs hdfs crypto -createZone -keyName accumulo-key -path /accumulo Added encryption zone /accumulo # sudo -u hdfs hadoop distcp -pugpx -skipcrccheck -update /accumulo-tmp /accumulo

# sudo -u hdfs hadoop fs -rm r /accumulo-tmpDeleted /accumuloStop tablet servers before moving data directory.# Cloudera, Inc. All rights reserved.The procedure to move the entire Accumulo data directory to an encryption zone is quite similar to the process we followed to encrypt the Accumulo key encryption key. However, in this case, because much more data is likely involved, a longer maintenance window will be needed (both because the actual distcp operation will likely take longer and also because you may want to perform more extensive verification and smoke testing on the Accumulo cluster after the modification).

IMPORTANT NOTES: When using CDH, tablet servers should still be stopped via decommissioning a node method described in Apache Accumulo user manual BEFORE stopping service using CM. This prevents errors in the distcp process due to stale HDFS Name Node information about the Accumulo write ahead logs.

The command is accumulo admin stop (be sure to include all tablet servers)

Accumulo gc.trash.ignore policy should be set to true before re-starting the Accumulo service since the /user/accumulo/.Trash directory is now outside of the encryption zone and files cannot be moved between encryption zones.

Doing steps in this order avoids the issue in CDH-26178 on certain versions of Hadoop (pre C5.4 I think). Gives more people a chance of avoiding issues if they try this at home on C5.3.Tradeoffs of full HDFS encryption approachProsLeast effort path forwardHDFS encryption and KMS can be leveraged by multiple services (skill re-use and operational efficiency)HDFS encryption and KMS are generally availableLeverage all administrative capabilities of HDFS encryption and Hadoop KMSConsModerate operational risk (see HBase)Accumulo downtime during data movePossible operational performance impact# Cloudera, Inc. All rights reserved.As we saw on the previous slide, using HDFS encryption to protect your entire Accumulo data directory requires no more effort than no more effort than protecting only the Accumulo key encryption key. However because the entire Accumulo data directory is being moved and encrypted, the distcp operation is likely to take significantly longer. And you may want to run more tests after encrypting the entire Accumulo directory than you would after encrypting only the Accumulo key encryption key.

One great advantage of using HDFS encryption and KMS is that the skills and operational practices can be used across the set of services that run on top of HDFS rather than having different solutions for Accumulo versus other services. Also, HDFS encryption and KMS are generally available and fully supported today.

The primary risk of immediately moving all of your Accumulo data to an HDFS encryption zone is that HDFS encryption, while generally available and fully supported, is a relatively new feature. It was delivered in Hadoop 2.6 which was released last summer, less than a year ago. As a consequence, HDFS or service developers are occasionally, though infrequently thus far, finding that there are subtle interactions between services running in HDFS and HDFS encryption that were not accounted for in the implementation. As a user, you would probably call these subtle interactions bugs, either bugs in HDFS encryption or bugs in the service that heretofore had little impact but which surfaced with the advent of HDFS encryption. And weve seen one such example in HBase recently. But over time, this disadvantage could become an advantage. Because there are a large number of services that run on top of HDFS and therefore on top of HDFS encryption, it is likely that bugs of this type will surface quickly and that HDFS encryption will achieve maturity quickly.

Another consideration in choosing to move the Accumulo directory into an encryption zone is that there will be longer downtime depending on the amount of data you have stored in Accumulo as all of the data is copied from an unencrypted location to an encrypted location.

There may also be performance impacts as encryption and decryption are computational expensive operations. In testing of sample of HDFS workloads, our teams have seen performance hits of between 4 and 10 percent. However, the performance impact for services running on top of HDFS will vary. For example, nascent testing has shown that some sample HBase workloads have lesser performance hits than seen on HDFS workloads because the additional processing overhead of HBase itself makes the overhead added by encryption/decryption operations negligible. Performance testing is ongoing and wed definitely welcome offers of real and/or realistic workloads that youre willing to share with us for testing purposes.

(If more detail is requested about the HBase issue): Weve found that, under load, the HBase write ahead log can fail when running in an HDFS encryption zone due to differing concurrency assumptions between the HBase WAL implementation and HDFS. This is being fixed in an Apache JIRA and, like many concurrency bugs, only occurs in specific circumstances but certainly issues like this are a risk and cannot be discounted when introducing significant new functionality at the HDFS layer.

https://issues.apache.org/jira/browse/HBASE-13221 (HDFS Transparent Encryption breaks WAL writing)Other optionsCustom SecretKeyEncryptionStrategyTighter connection to core Accumulo functionality and release cycleSupport arbitrary key serversBut it's easy to get the details of both encryption and key management wrongArbitrary key server support can also be developed via a custom key provider for the Hadoop KMS Native KMS SecretKeyEncryptionStrategyLeverage administrative functions of KMS without relying on HDFS encryption# Cloudera, Inc. All rights reserved.Thank you. Let's talk about keys!# Cloudera, Inc. All rights reserved.