data lake: powering security insights · multiple heterogeneous data sources. to achieve this,...
TRANSCRIPT
Data Lake: Powering Security Insights
Abstract
The high tech industry is transitioning from
traditional IT systems to a pool of integrated and
loosely-coupled infrastructure and software
components that generate huge amounts of data
on a continual basis. Enterprises use metrics
based on this data to address non-functional
system requirements, and provide actionable
insights into their capacity and performance
needs. While organizations focus on improving
functional integration between IT systems, they
tend to overlook the performance, capacity, and
security metrics generated by the systems,
and the correlation between these metrics
across systems.
This white paper proposes an approach to
building a data lake that can be used to capture
various non-functional metrics across IT systems
and establish correlations among those metrics.
Such an approach enables the development of an
ecosystem where capacity or performance
constraints of a certain application can help
isolate related applications and infrastructure
WHITE PAPER
WHITE PAPER
components that are likely to be impacted. In
addition, it provides operational insights to
improve system performance and availability by
proactively addressing constraints. The result: a
future-ready, agile enterprise IT environment that
can support superior performance and security.
WHITE PAPER
Why Traditional Data Architecture
Falls Short
As high tech organizations undertake transformation using
disruptive digital technologies, Big Data analytics and actionable
insights play a critical role in effective decision making.
Traditional data architecture includes data warehouses built
upon data marts that are designed to support specific business
decisions. Such architecture is not amenable to dynamic
changes in the data structure to enable effective decision
making. Most traditional systems are also business transaction
focused and not geared to drive IT decisions.
The introduction of unstructured data sets and Big Data
platforms into decision making systems calls for sophisticated
decision support systems that can integrate heterogeneous and
diverse data structures and sources. Such systems provide
additional insights through contextual information for effective
decision making across the business - to improve performance
and ensure the security and reliability of IT systems.
Non-functional requirements play a critical role in IT decision
making with respect to the performance, availability, reliability,
security, and compliance of IT systems. Some critical questions
that need to be answered are:
n Are users satisfied with the latency metrics across mission-
critical systems – what does the user satisfaction score tell?
n Is the storage consumption and tiering done in a cost-
effective way?
n Are there any vulnerabilities in the IT environment? What are
the cost implications of such vulnerabilities?
n Which set of users or infrastructure assets poses a business
risk in terms of security and unavailability?
The right blend of data warehouse and data lake solutions
provides answers to these questions.
Using Data Warehousing Systems for Non
Functional Requirements
The Limitations
Most monitoring and management data warehouse systems
accumulate a set of capacity, performance, availability, and
Data Sources
Big Data Storage
TransformLoadExtract
Data Sources Data Store
Dash Board
Transformation Logic
Extract Transform Load
WHITE PAPER
security metrics, and provide reporting and dashboard
capabilities to effectively visualize the data. Such systems are
capable of providing historical reports and future projections
based on past trends, and are built using extract, transform,
load (ETL) solutions (see Figure 1).
Figure 1- ETL Solution-based System for Systems
Monitoring and Management.
While most of these systems work well in silos, it requires
significant investment in terms of integration effort and cost,
as well as to leverage them for generating correlated metrics
across the enterprise. The integration effort involves the
incorporation of data transformation logic and schemas into
the systems, thereby increasing the complexity and limiting
modification flexibility.
The Upside
For certain non-functional IT metrics, especially security
domain metrics, organizations require real-time feeds from
unstructured data sources to drive real-time decision making
and categorize and isolate anomalies.
Figure 2: Data Lake System with Big Data Platform for
IT Security Metrics
The major difference between a data lake and a standard
ETL-based system lies in the way data is ingested. A data lake
leverages a Big Data platform to ingest the data into the
system in native format, rather than perform heavy
transformation logic prior to loading (see Figure 2).
Dash Board
WHITE PAPER
Data lakes are particularly suitable for aggregated analysis of
real-time streaming log sources. This helps provide a
consolidated view of system health across various technology
platforms, along with a correlation mechanism to monitor,
troubleshoot, and remediate issues across the system
landscape. The capabilities of such data lake systems can
be enriched by increasing the quantity and variety of
log information.
Building an Integrated Data Lake for IT and
Security Analytics
Before diving into a data lake, it’s important to note that data
warehouse-based monitoring systems have built-in domain
capabilities such as transformation logic written using domain
knowledge. This is hard to ignore. In addition, enterprises need
to enable advanced correlation and analytics capabilities using
multiple heterogeneous data sources. To achieve this,
enterprises can use an integrated data lake that combines the
advanced technology capabilities of a Big Data platform with
the domain capabilities of traditional ETL-based systems
(see Figure 3).
Figure 3: Integrated Data Lake: Combining Big Data Platform with
Domain Capabilities of ETL-based Solutions
Big Data Storage
TransformLoadExtract
Transformation Logic
Extract Transform Load
Data Sources
Data Sources
Data Store
Dash Board
WHITE PAPER
In this approach, a data lake for IT and security metrics would
leverage the existing monitoring systems and ingest the data
into the data lake, along with contextual feeds from a variety
of new data sources. These include configuration management
database (CMDB), IT service management systems (ITSM),
social media feeds, and so on. For the new data sources,
enterprises can leverage the domain capabilities of an existing
ETL solution in the form of an operations management or
storage resource management software, for advanced
reporting and analytics.
One way to assess the applicability of an integrated data lake
for augmenting storage resource management software would
be to build a proof-of-concept (PoC) (see Figure 4). Hadoop is
an ideal Big Data platform to integrate heterogeneous data
from: backup devices, ITSM systems, CMDBs. At the same
time, it can access virtualization platform, security audit, and
HTTP server logs as well as user accounts. Customized data
marts can be created to address security, backup, and incident
related problem statements. Companies can also create
advanced reports using these data marts in storage resource
management software to demonstrate additional capabilities
that can be introduced with the help of a Big Data platform.
Figure 4: Conceptual View of Integrated Data Lake Augmenting
Storage Resource Management Software
Analytics Tools Query Reporting Virtualization Analytics
Data Transformation Refined Data Trusted Data Meta Data
Data Pipeline Data Landing Zone
Data Source ApplicationLogs
ITSM Metrics
ApplicationMetrics
ServiceNowMetrics
Data Store Big Data Store (Original Unaltered Data)
QueryReporting/
VirtualizationAnalytics
Data Warehouse Engine
Data lake mart
Data warehouse
mart
Data warehouse
mart
Data warehouse
mart
Data warehouse
mart
Data Marts
Data Warehouse
WHITE PAPER
Some Use Cases
An integrated data lake leverages Big Data analytics to gain
meaningful insights to address specific business challenges
such as cost optimization, simplification, and security. All
analysis and recommendations can be presented in the context
of associated cost implications. For example, security controls
might vary based on information risk categorization. Additional
security measures can be adopted after evaluating cost against
the value of information. This helps enterprises make effective
business decisions in terms of what it would take to achieve
the desired measures and avoid unnecessary changes.
An integrated data lake approach can be applied to all use
cases of traditional ETL-based and modern data lake systems.
In addition, it addresses certain additional scenarios:
1. Capacity consumption: Optimize capacity consumption using
optimized tier recommendations leveraging machine
learning or Big Data analytics.
2. Performance anomalies and recommended actions: Detect
performance anomalies in the current environment and use
machine learning to recommend the ideal measures with the
help of Big Data reporting.
3. User satisfaction scores for mission-critical systems: Store
end user feedback on Big Data platforms and correlate the
feedback to data from the enterprise environment.
4. Advanced correlation for threat detection: Use machine
learning to model threat detection systems for future-ready
threat detection capabilities.
The Data Lake: A Critical Technology
Component for Long-term Strategic
Decision-making
With the increasing focus on leveraging data-driven analytics
for competitive advantage, it is critical to build a data lake that
addresses some of the key aspects of non-functional
requirements such as capacity, performance, and security.
While serving as a bridge that transitions customers from
traditional legacy monitoring systems to advanced digital IT
platforms, the IT and security data lake also serves as a critical
technology component for long-term strategic decision-making.
However, to realize its true value, a data lake must be part of a
carefully architected end-to-end platform that integrates
different data sources and leverages machine learning
algorithms for superior business decisions.
All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties. Copyright © 2018 Tata Consultancy Services Limited
About Tata Consultancy Services Ltd (TCS)
Tata Consultancy Services is an IT services, consulting and business solutions
organization that delivers real results to global business, ensuring a level of
certainty no other firm can match. TCS offers a consulting-led, integrated portfolio
of IT and IT-enabled, infrastructure, engineering and assurance services. This is TMdelivered through its unique Global Network Delivery Model , recognized as the
benchmark of excellence in software development. A part of the Tata Group,
India’s largest industrial conglomerate, TCS has a global footprint and is listed on
the National Stock Exchange and Bombay Stock Exchange in India.
For more information, visit us at www.tcs.com
TCS
Des
ign
Serv
ices
M
02
18I
II
WHITE PAPER
About The Authors
Ankur Srivastava
Ankur Srivastava is an
Enterprise Architect with TCS’
HiTech business unit and he
currently heads the HiTech
Solutions Lab. With 13 years of
experience, Srivastava drives
the cybersecurity and cloud
infrastructure initiatives within
the unit and is responsible for
developing differentiated digital
automation solutions in the
cloud and cybersecurity space.
He has a Master of Technology
degree in Software Systems
with specialization in Data
Analytics from the Birla
Institute of Technology and
Science, Pilani, India.
Ashish Pandey
Ashish Pandey is a Solution
Developer with TCS’ HiTech
business unit. He has over
seven years of experience and
focuses on developing new
solutions leveraging Big Data
platforms and technologies.
Pandey holds a Bachelor of
Technology degree in Computer
Science and Engineering from
Uttar Pradesh Technical
University, Lucknow, India.
Contact
Visit the page on Hitech www.tcs.com
Email: [email protected]
Subscribe to TCS White Papers
TCS.com RSS: http://www.tcs.com/rss_feeds/Pages/feed.aspx?f=w
Feedburner: http://feeds2.feedburner.com/tcswhitepapers