data-centric security in hadoop - amazon s3 · data-centric security in hadoop | white paper 2 ....

12
Data-Centric Security in Hadoop | White Paper DATA-CENTRIC SECURITY IN HADOOP Creating a seamless fabric of protection for enterprise Hadoop

Upload: others

Post on 03-Nov-2019

9 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper

DATA-CENTRIC SECURITY IN HADOOPCreating a seamless fabric of protection for enterprise Hadoop

Page 2: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 1

Contents

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2

3

3

3

4

Data-centric security methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data-Centric Security in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Coarse-grained security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fine-grained security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

5

5

5

5

Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Comprehensive Security Policy and Authorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data security policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Key Security Considerations for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

7

7

8

9

Support for the Hadoop ecosystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Unified Security Administration Across Hadoop, Enterprise, and Cloud . . . . . . . . . . . . . . . . . .

Enterprise compatibility and integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Policy and key management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Separation of duties (SoD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About Protegrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Auditing and visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Page 3: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 2

IntroductionLeading companies in nearly every industry are using data as an essential new driver of competitive advantage. Big Data platforms such as Hadoop hold the potential to provide visibility into a greater quantity, variety, and complexity of data to gain a better understanding of the business in ways that weren’t previously possible. As such, Big Data projects are hungry for data, and increasingly these projects involve data which is personally identifiable or otherwise considered sensitive according to internal or external regulations.

This document is intended as a guideline to both the best practices for Big Data security in Hadoop-based environments, and as a technological review of the data security options available, including the Protegrity Big Data Protector.

While some regulations point to specific rules and means of protecting sensitive data, most security practices tend to be dictated by balancing two traditionally opposing factors: the risk of potential threats versus the benefit of accessing and using the data. As organizations leverage Big Data to analyze much larger, more diverse data sets, the challenge of effectively securing sensitive data while maintaining usability becomes increasingly difficult.

The best practices to address all of these concerns can be consolidated into three fundamental categories:

• Data-Centric Security

• Comprehensive Security Policy and Authorization

• Unified Security Administration Across Hadoop, Enterprise, and Cloud

Key Security Considerations for HadoopOver time, the massive stores of operational, behavioral and personal data that drive your organization’s development become some of its most valuable and sensitive assets. Like other valuable assets, these assets need to be protected with strong, comprehensive security. However, security must not hinder the analysis and storage advantages of the Hadoop system. Maintaining the usability of data is key to truly extracting full value from Big Data.

While Hadoop provides a strong, low-cost and quickly evolving framework for data management and analysis, its fundamental architecture was not initially designed with security in mind. This causes many companies to resort primarily to access control and authentication, with more mature security options still in early and low adoption.

In addition, users across multiple business units often require access to data throughout the Hadoop ecosystem (using a variety of applications) to refine, explore and enrich its data at will, using methods of their own choosing. This not only increases risks of exposure to unauthorized users, it also further complicates access control and security administration.

Lastly, the external enterprise ecosystem of data and operational systems feeding Hadoop is highly dynamic and can introduce new security threats on a regular basis. Security must be compatible with or extend to and from other enterprise systems to maintain a seamless fabric of protection and to take advantage of economies of scale and efficiencies of a unified approach.

Page 4: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 3

Some enterprise requirements for data security and compliance for data at rest can be met via coarse-grained security, including OS and file encryption. The Protegrity Big Data Protector supports the ability to encrypt files as they are stored in Hadoop. This protects Hadoop data from being read in the clear outside of standard access controls. Protegrity’s advanced coarse-grained security can also extend native HDFS functionality to provide highly transparent, high-performance file encryption on the node.

Figure 1: Coarse-Grained vs. Fine-Grained Data Security

However, coarse-grained security can introduce performance and security challenges, as protection is applied on an “all-or-nothing” basis. Because entire files or volumes are encrypted, the entire file must also be unencrypted in order to access any data held within it, including innocuous, non-sensitive data. This means that an organization utilizing file encryption must either

Data-Centric Security in HadoopData-centric security ensures that sensitive data is protected from internal and external threats by securing the data itself, rather than the platform or environment within which it is stored and/or processed. Best practices dictate securing data at the moment of ingestion, so that it remains protected throughout the cluster, at rest, in flight, and in use. The levels, granularity, and types of security vary depending on the solution chosen.

Data-centric security methods• Coarse-Grained

• Fine-Grained

o Protegrity Vaultless Tokenization (PVT)

o Extended HDFS File Encryption

o Full OS/Disk Encryption

o AES Encryption

o Format-Preserving Encryption (FPE)

o Masking

Coarse-Grained Security(File/Volume)

Coarse-Grained vs. Fine-Grained Data Security

Fine-Grained Security(Field/Column)

Methods: File or volume encryption

Methods: Field/column PVT, encryption or masking

“All or nothing” approach Data is protected wherever it goes

Secures data only at rest, and files in transit

Business intelligence increases usability

Coarse-grained security

Page 5: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 4

Tokenization

Tokenization is the reversible process of replacing data with inert, random values of the same type and length, either through a 1:1 token lookup vault (traditional, vault-based tokenization), where both the original and token values are securely stored in pairs, or a vaultless approach using progressive block replacement and a static token table (Protegrity Vaultless Tokenization – PVT), where no sensitive data is stored at all. Whereas the vault-based approach creates an ever-growing database of token pairs which drags on performance in lookups and replication, PVT maintains high performance with a tiny fraction of the footprint.

segment their volumes or files between sensitive and non-sensitive content, or allow users to potentially access sensitive data, even when they do not require it. Neither of these situations is ideal, especially in an environment where a massive amount of diverse data may be involved in a single job, and data is very often stored first and segmented later.

Fine-grained securityProtegrity goes beyond coarse-grained security by providing capabilities for different types of fine-grained security for specific fields or data elements. Fine-grained security protects the actual data element rather than entire files or volumes. By doing so, the data is always protected at rest, in transit, and in use. Tokenization, encryption and masking are fine-grained security methods that can protect specific sensitive data within Hadoop.

Fine-grained security within Hadoop can be specified at the column, label, cell, job, or file level, and now even at the row level. Protegrity’s ability to protect Hadoop data at the row level is an industry first. In traditional column level security, a field is encrypted or tokenized for a particular role or user. Row level security adds the additional capability to determine whether entire records may be viewed by a role or user, in addition to specific fields of data. This type of functionality is ideal for multi-tenant situations in Hadoop to provide isolation of data between different tenants within the same Hadoop installation.

Protegrity data-centric security leverages the architecture of Hadoop to provide the highest performance encryption and tokenization available. By utilizing the parallel processing architecture of Hadoop, Protegrity scales with every node and distributes the workload across the cluster via its own cluster management capability.

Identifier Clear Protected Authorized Role 1 *Can see most data

Authorized Role 2 *Can see limited data

Name Joe Smith

Address 476 srta coetse, cysieondusbak, HA

Date of Birth

Social Security

Credit Card

Email

Telephone

Joe Smith

100 Main Street,

Pleasantville, CA

12/25/1966

076-39-2778

3678 2289 3907 3378

[email protected]

760-278-3389

csu wusoj

476 srta coetse,

cysieondusbak, HA

01/02/1966

478-39-8920

3846 2290 3371 3890

[email protected]

998-389-2289

12/25/1966

076-39-2778

3846 2290 3371 3378

[email protected]

760-278-3389

01/02/1966

478-39-2778

3846 2290 3371 3890

[email protected]

998-389-2289

100 Main Street, Pleasantville, CA

Figure 2: Data tokenized differently based on user role

Joe Smith

Page 6: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 5

While PVT technology provides proven, robust data security and significant performance gains over vault-based tokenization or Format-Preserving Encryption, it also allows for seamless analysis of non-sensitive data with no additional processing overhead.

In cases where some sensitive data is required in the clear for analysis, PVT also provides the capability to protect only part of the sensitive data, exposing only what’s necessary for business intelligence. Sensitive data can also be exposed only to authorized processes, and never revealed to the user, in cases where individual identity is not needed in the result set.

Encryption technology utilizes mathematical algorithms and cryptographic keys to translate data into binary ciphertext. It is reversible only using the correct key with the correct algorithm. There are many forms of data encryption, various key strengths and other options available. Because encrypted (ciphertext) output is binary data and looks nothing like the original cleartext, it may require changing the data type for the field.

In some cases, encryption is acceptable for fine-grained security for Hadoop, but typically tokenization through PVT enables higher performance and less schema modifications, especially in structured applications, such as Hive tables.

Encryption

Masking

Masking alters the cleartext data to create values that often look much like the original, but contain no real data. It is ideal for and used frequently in development or test scenarios/implementations. However, because obfuscated data cannot be reversed to return to the original data, masking is not suitable for most production implementations.

Comprehensive Security Policy and AuthorizationAny comprehensive security solution for Hadoop must provide a breadth of security functionality with policy definitions that allow for specific criteria on how data is represented, secured, and accessed by a consuming application, user, or role. These policies can dictate what data can be seen, how it is protected, and who has access, among other criteria.

Data security policy

The Protegrity Enterprise Security Administrator (ESA) provides administrators with deep visibility into the security administration process that is required for auditing purposes. Security administrators have the flexibility to define security policies for a database, table and column or a file, and administer permissions for specific LDAP based groups or individual users.

Typically, data security policy defines:

• HOW data is secured

• WHO has access to data

• WHAT data may be accessed

• WHICH systems/applications allow access

Page 7: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 6

Rules based on dynamic conditions such as time (WHEN) or geography (WHERE) can also be added to an existing policy rule. Some roles may see all sensitive data while others may see tokenized, partially revealed, or masked data. For instance, in applications where aggregate data is important, elements such as “year” may be revealed while tokenizing “month” and “day”.

Hadoop does not natively support a separation of duties, but as a security best practice, every environment containing sensitive data must provide it. Protegrity provides this essential functionality to Hadoop through the Enterprise Security Administrator.

Security officers control access to sensitive data through the data security policy set in ESA, preventing unauthorized technologists such as DBAs, programmers, or system engineers from seeing sensitive data in the clear. Security officers (when their roles do not require it) can also be prevented from viewing the data in clear as part of SoD objective.

Figure 3: Comprehensive data security policy administration

Separation of duties (SoD) The term “separation of duties” refers to the separation or segregation of functions for the security officers, who have control over the data security policy (including granting access to sensitive data), from systems administrators who work with or manage environments containing sensitive data, and who may or may not require access to the data for job functions.

When deployed properly, SoD will not prevent the systems administrators or IT staff from performing their jobs of managing different aspects of enterprise IT, data flows and data environments.

Page 8: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 7

Figure 4 illustrates a sample protected node in a Hadoop cluster.

Administrators can use Protegrity to define centralized security policy for the following components:

Unified Security Administration for Hadoop, Enterprise, and CloudThe complexity of business processes across Hadoop and the technologies that support these processes impose many challenges when applying data security solutions. When extending policies beyond Hadoop, Protegrity delivers the broad interoperability with various databases, applications, operating systems, the cloud, and other platforms essential to successfully solve critical and complex enterprise data security challenges.

• Apache Hadoop

• Outside Hadoop

o Enterprise Applications

o HDFS

o YARN

o Databases

o File Systems

o Mainframes

o Hive

o HBase

o Pig

o Cloud Applications

o Cloud Storage

Support for the Hadoop ecosystem Protegrity was the first vendor to develop a comprehensive data protection path for Apache Hadoop. Protegrity provides full integration and support for nearly every component in the Hadoop ecosystem, with protection agents installed on each node. As new libraries and API’s emerge for Hadoop, the Big Data Protector is also fully extensible to utilize these frameworks and services.

Page 9: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 8

Figure 4: Securing data throughout the Hadoop ecosystem

OS/File System: The Protegrity Big Data Protector can be used to encrypt files as they are landing in HDFS (after Hadoop has performed file redundancy splitting across the nodes). This coarse-grained protection can be in the form of file encryption, or the entire volume can be encrypted.

Extended HDFS Encryption (EHE): As an optional added feature, the Big Data Protector can leverage a native HDFS CODEC to encrypt files within HDFS. EHE greatly reduces the complexity compared to OS file protection systems, while facilitating faster, easier delivery of decrypted files to users and processes. Users requesting files from HDFS are instead subject to the Protegrity data security policy. EHE also solves many of the previous issues with securing multi-tenant environments and securing areas within HDFS for staging or containment.

MapReduce: The Big Data Protector includes a Java interface to MapReduce to deliver protection and enforcement capabilities within the framework. This functionality is designed for delivering fine-grained data security, to protect or unprotect data at the field or cell level.

Hive, Pig, etc.: Products such as Hive, Pig, Impala, HAWQ, Sqoop, Flume, and other frameworks that access or manipulate data can use Protegrity User Defined Functions (UDFs) for fine-grained protect or un-protect operations.

The Protegrity Big Data Protector is currently available for and certified on many distributions of Apache Hadoop, including Hortonworks Data Platform, Cloudera Distribution for Hadoop, MapR Distribution including Apache Hadoop, IBM BigInsights, and Pivotal HD.

Fine grained protection persists while data is in use and during analytical processing. Business intelligence can be bled through to allow users and processes that require only part of the original data to function. The API in each of these programs can also be augmented to provide protect/un-protect functions.

Kerberos authentication provides access control to data for authorized users and processess.

Stored sensitive data can be protected with coarse grained file or volume security (including Extended HDFS Encryption), or fine grained cell-level security (such as Protegrity Vaultless Tokenization). Only the data security policy administered by security officers determines access to sensitive data in the clear. This multi-faceted approach provides the best means to protect the data from all threats, including internal rogue privileged users.

Kerberos (Authentication)

Other Programs(Hive, Pig, Sqoop, Flume, HBase, etc.)

MapReduce

HDFS

OS

Enterprise compatibility and integrationRarely is a single environment or application responsible for an entire data flow or operational process. The data lifecycle involves numerous users, applications, and storage platforms with multiple heterogeneous systems common in most enterprises.

Page 10: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 9

Delivering a broad interoperability with various applications, systems and platforms, Protegrity meets the challenges of cross-platform security through a robust security platform which protects data throughout the entire enterprise ecosystem with consistent security and policies.

Protegrity provides unified policy, security and auditing with vital flexibility in many areas, including the following:

Policy and key management Protegrity provides security administrators with the ability to manage keys and authorization policies for enterprise key management (EKM). Protegrity customers also have the flexibility to leverage an open source key management store or use EKM solutions provided by others.

• Wide breadth of platform coverage: Extensive interoperability with a large variety ofapplications, databases, operating systems and platforms, including cloud and big data.

• Cross-platform consistency: Secure data on one system (e.g. Teradata), send it securely toa target system from a different vendor (e.g. Hortonworks), and deliver unprotected data onthe target system to authorized users.

The Protegrity Data Security Platform

Application Protectors

DatabaseProtectors

MPP Protectors

Big Data Protectors

File Protectors

File Protector Gateway/Vault

Protegrity Cloud Gateway

IBM MainframeProtectors

Enterprise Security Administrator

Figure 5: Centralized data security across heterogeneous environments

Protegrity also provides a path to extend enterprise-wide policies and fine-grained data security to the cloud, via the Protegrity Cloud Gateway, which secures data before it is sent to the cloud, including cloud storage, applications and third-party SaaS platforms.

By employing “configuration-over-programming” architecture, Protegrity can enable security for a variety of proprietary SaaS platforms and applications without requiring heavy vendor intervention for updates or platform changes.

Protegrity gateways can also be installed “inline” inside the enterprise to segregate security functions from critical business infrastructure by protecting the data as it enters or leaves various applications and platforms. This allows organizations to utilize more high-cost IT infrastructure for high-value jobs and processes without the need to spend valuable processing power on security.

Page 11: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 10

Monitoring and auditing provide the first and last lines of defense in detecting abnormal and potentially malicious activity. It is also one of the only ways to protect against an insider attack from an authorized user. For scenarios requiring centralized auditing of the enterprise, Protegrity also provides enhanced audit and sophisticated reporting capabilities to achieve a complete view of enterprise security whether it be within Hadoop, a local enterprise asset, or a resource in the cloud. This allows organizations to take advantage of economies of scale and a unified repository of audit logs and alerts to analyze across all enterprise systems.

The heterogeneous capabilities of the Protegrity Data Security Platform are unique to the industry and allow customers to consistently manage and administer security across a variety of environments - minimizing disruption of business processes while maximizing portability of the data, regardless of where it is used.

SummaryTo properly protect their Hadoop environments, organizations must adopt a holistic, comprehensive approach to security. This includes multiple layers of protection which, when properly combined, provide both strong, meaningful security and the ability to serve up data to those users and applications which need it to drive the business forward.

Auditing and visibilityAs customers deploy Hadoop into corporate data and processing environments, metadata and data governance become vital parts of any enterprise-ready implementation. This makes it possible for users to gain a comprehensive view of data lineage and access audit, with an ability to query and filter audit based on data classification, users or groups, and other filters.

In order to help you on your journey, here is a short security checklist:

Following this list of best practices will enable strong, flexible security for your Hadoop implementation, and provide seamless protection across the entire Hadoop ecosystem and beyond, to enterprise IT infrastructure, applications, and the cloud.

• Implement native access controls if you have not yet done so.

• Protect the data itself with fine-grained security, including encryption or tokenization.

• Install flexible data security policies, which provide granular control on data access rights and complement governance policies, to ensure sensitive data is only accessed as needed.

• Utilize centralized security administration tools with heterogeneous compatibility across the enterprise and the cloud, for more consistent, efficient security management and auditing.

Page 12: DATA-CENTRIC SECURITY IN HADOOP - Amazon S3 · Data-Centric Security in Hadoop | White Paper 2 . Introduction. Leading companies in nearly every industry are using data as an essential

Data-Centric Security in Hadoop | White Paper 11

About Protegrity: Proven experts in data securityProtegrity is the only enterprise data security software platform that leverages scalable, data-centric encryption, tokenization and masking to help businesses secure sensitive information while maintaining data usability. Built for complex, heterogeneous business environments, the Protegrity Data Security Platform provides unprecedented levels of data security certified across applications, data warehouses, mainframes, big data, and cloud environments. Companies trust Protegrity to help them manage risk, achieve compliance, enable business analytics, and confidently adopt new platforms.

Protegrity is headquartered in Stamford, Connecticut USA, with regional offices around the world. For additional information visit www.protegrity.com or call 1.203.326.7200.

Copyright © 2016 Protegrity Corporation. All rights reserved. Protegrity® and the Protegrity logo, are trademarks of Protegrity Corporation. All other trademarks are property of their respective owners.

Corporate Headquarters Protegrity USA, Inc. 5 High Ridge Park, 2nd Floor Stamford, CT 06905Phone: +1.203.326.7200

www.protegrity.com