generating value from big data - dell€¦ · helps in to get an indication about the important...

16
Babu Analakkat Tata Consultancy Services Ltd. GENERATING VALUE FROM BIG DATA

Upload: others

Post on 21-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Babu AnalakkatTata Consultancy Services Ltd.

GENERATING VALUE FROM BIG DATA

2015 EMC Proven Professional Knowledge Sharing 2

Table of Contents

Data Explosion – A Flashback ................................................................................................... 3

Big Data Overview ..................................................................................................................... 5

Big Data – Problem or Opportunity? ........................................................................................... 6

Layers of Big Data ..................................................................................................................... 7

Big Data Analytics – Introduction to Hadoop .............................................................................. 9

Stepping into Data Analytics – A Few Guidelines ......................................................................10

Challenges of Big Data .............................................................................................................11

Big Data use cases ...................................................................................................................12

Latest Trends in Big Data ..........................................................................................................14

Common Myths about Big Data ................................................................................................15

Conclusion ................................................................................................................................16

Disclaimer: The views, processes, or methodologies published in this article are those of the

author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.

2015 EMC Proven Professional Knowledge Sharing 3

Data Explosion – A Flashback

Many years ago, data were stored on clay discs and stones. More recently, the invention of the

computer enabled data to be stored on magnetic discs. This allowed data to be stored,

retrieved, shared, and reused for purposes not imagined while collecting the data. When the

computer first came on the scene, only the computer operator or company staff was generating

data and the magnitude was much less. With the evolution of the Internet, many more users

started generating data and the magnitude grew significantly. Today we have machines like

sensors, cameras, mobile phones, and websites generating huge volumes of data when

compared to earlier times. A data explosion has happened over the past years as the statistics

in Figure 1 clearly show.

Figure1: Explosion of data in recent years

“Every 2 days we create as much information as we did from the beginning of the time until

2003.”

“Every day we create 2.5 quintillion bytes of data. Over 90%of all the data available in the world

has been created in the past 2 years alone.”

“It is expected that by 2020 the amount of digital information in existence will have grown from

3.2 Zettabytes today to 40 Zettabytes.”

2015 EMC Proven Professional Knowledge Sharing 4

More than 80% of the data available today is unstructured (i.e. text, audio, video) which is very

difficult to process using traditional Relational Database Management Systems (RDBMS).

Source: IDC

Figure 2: Structured and Unstructured data

Big Data has become a well-known buzzword in the IT industry and beyond. While there is no

standard definition for Big Data, the one I find most appropriate is this: “Huge collection of data

sets, both structured and unstructured, which are difficult to process using a traditional Data

Base”.

2015 EMC Proven Professional Knowledge Sharing 5

Big Data Overview

Suppose we have a 100 MB document which is difficult to send, or a 100 MB image which is

difficult to view, or a 100 TB video which is difficult to edit. In any of these instances, we have a

Big Data problem. Or suppose company ‘A’ is able to process a video of 300TB while company

‘B’ cannot. We would say that company ’B’ has a Big Data problem. Thus, as you can see, Big

Data can be system-specific or organization-specific.

Big Data is not only about the size of the data. It is related to the velocity and variety as well.

They are known as the 3 V’s of Big Data – Volume, Velocity, and Variety. In addition, another

V is often added to the Big Data dimensions; Veracity.

Figure 3: The 3 V’s of Big Data

Velocity and Volume focus on the speed and amount of data. Variety and Veracity refers to the

category and trustworthiness of data, respectively

Volume (how much): Capturing and processing large quantities of data.

Velocity (changing frequency and real time processing): Processing data in real time.

Variety (category of data): How many different types of data can be processed, e.g. email,

video, audio, text, log files, and various transaction data.

Veracity (truthfulness and confidence): How much you can trust the data for its legitimacy when

it is pouring in from various sources.

2015 EMC Proven Professional Knowledge Sharing 6

Big Data – Problem or Opportunity?

Whether Big Data is perceived as a problem or an opportunity depends on how you or your

organization approaches it. In the beginning, it is certainly a problem for all as we do not have

the right infrastructure and resources to process it. Also, we are unsure about the insights and

value which will be derived from it. Once the data is processed with the right tools, it can lead to

the breakthrough insights which will help in strategic decision making and thereby the growth of

the organization. In a survey conducted by supply chain LLC, 76% of customers considered Big

Data an opportunity rather than a problem. A recent CNBC quote supports this notion; “Data is

the new oil - in its raw form oil has little value; once processed it helps power the world”.

Meanwhile, a study conducted by IBM in 2012 found that nearly half of the respondents

considered customer satisfaction as the top priority. They think Big Data as a key to

understanding the customer and predicting their behavior. Organizations are interested in

investing in Big Data solutions for various reasons including many business processes as

illustrated in Figure 4. Analysis of operational data seems to be the biggest driver for using Big

Data solutions.

Figure 4: Top drivers for using Big Data Analytics or Business Intelligence

2015 EMC Proven Professional Knowledge Sharing 7

Layers of Big Data

To gain actionable insight from data, it has to pass through different stages as illustrated in

Figure 5. There are four layers in a Big Data platform which everyone must be aware of.

1. Data Sources Layer: This is the first stage where the organization accumulates its data

from various sources, i.e. social networking sites, emails, transaction records, and data

residing in the existing database. It is best to perform a detailed assessment of the

problem you are going to address, how it helps the business, and measure it against the

data you currently have. You may need to go to new sources of data.

Figure 5: Layers in Big Data

2. Data Storage Layer: This is where data gets stored once it is collected from various

sources. As computing capacity and storage capacity has increased over the decades,

data storage has attained prime importance in Big Data. While considering a file system

for storing data, keep in mind that it should be easily accessible, free from cyber threats,

and easily implementable and manageable. Google had come up with such a file system

— GFS (Google File System) — over a decade ago and they had not made it as open

source. Later, Yahoo did a lot of work in this area and came up with Hadoop Distributed

File System [HDFS] and made it as open source under Apache Software Foundation.

HDFS has the capability to run on commodity hardware and handles large scale data

with the help of MapReduce (a component of Hadoop which helps in data processing).

2015 EMC Proven Professional Knowledge Sharing 8

3. Data processing or Analysis Layer: This phase will analyze the data collected in the

above phase to derive the insights from it. MapReduce is the common tool used in this

analysis. It is a programming model and an associated implementation for processing

and generating large data sets with parallel, distributed algorithm on a cluster. The

analytic phase will result in trends and patterns of a particular business.

4. Data Output Layer: In this phase, insights gained from the analysis phase will be

transferred to those who are supposed to act on it. These outputs can be in the form of

charts, figures, and key recommendations. Clear and concise communication is a must

here as the decision makers may not have a thorough knowledge in statistics.

2015 EMC Proven Professional Knowledge Sharing 9

Big Data Analytics – Introduction to Hadoop

There are many Big Data platforms such as Hadoop, NoSQL, and MongoDB. Hadoop is useful

when dealing with massive amounts of data as it provides the parallel processing capability to

handle Big Data. It is a framework of tools designed by Doug Cutting and Yahoo when they

reverse engineered the Google File System (which was not open source). They made this

software framework open source and are being distributed by Apache. Hadoop can run on any

commodity hardware such as Linux servers. HDFS and MapReduce are the two components in

the Hadoop framework.

Figure 6 depicts machines arranged in parallel at the bottom and each machine will have a data

node and task tracker. Data node is also known as HDFS and task tracker is known as map-

reducers. Data node contains the data sets and task tracker will perform operations on it. Task

tracker in each machine needs to be controlled or synchronized, which is done by a job tracker.

A name node will coordinate all data nodes, handling distribution of data going to each machine.

Figure 6: Hadoop Framework

2015 EMC Proven Professional Knowledge Sharing 10

Stepping into Data Analytics – A Few Guidelines

Data is valuable only if it can help lead to better decisions. Whole data holds less value when

organizations are unable to translate it to meaningful insights that drive their business. Success

in deriving the insight is not dependent on the size of the data but on the effectiveness of the

analytics used to generate results. Let’s look at a few steps.

How to define your data: What is the outcome the organization aims to achieve by processing

the data? Who is interested in it and who is going to invest in it? Are there other major things or

decisions that can support this initiative and prioritize? Are there any new results which were not

possible by means of an existing system? To what degree will the new result change the future

of the organization? How do we integrate the new outcome to the organization?

Creating the data framework: Understanding the data by preparing charts, figures, and tables

helps in to get an indication about the important insight. Prepare sketches and designs and

relate things. Try to filter out the unwanted data and concentrate on the most important data.

Prioritize data and summarize. Start creating visuals out of it. It is advisable to use some

visualization tools. Tips that help in visualizing include:

Look for trends rather than looking at a single data value

Look for relationships and derive correlation

Examine the data over a time range, i.e. week over week, month over month

Examine from the perspective of others - they are important for insights

Bringing data into action: Many of the above techniques will help reveal insights in the data.

There is no standard procedure that every organization can adopt. It is up to the organization to

choose the best techniques in order to have excellent insights.

2015 EMC Proven Professional Knowledge Sharing 11

Challenges of Big Data

In its simplest form there are four major challenges that every organization will have to face

when implementing a Big Data platform.

1. Ownership: Since Big Data is heavily business-oriented, top management of the

organization will have to play a major role. They should be the leaders of Big Data

projects. Big Data is helping organizations of all sizes to make better business decisions,

save costs, improve customer service, deliver better user experience, and identify

security risks. The insights gleaned from Big Data and the corresponding organization

changes have to be managed very carefully. That’s why top management has to play a

vital role in this.

2. Data: Identifying the correct and most relevant data is another challenge as there are

various sources of it. Only the relevant data will help in producing meaningful insight to

guide the management to take critical decisions. For example, if the organization wants

to analyze the customer experience on their website, it would be good to collect the

failure login attempts and other related error logs in the site rather than collecting the log

of only successful connections.

3. People: For a successful Big Data project, the team should be a mixture of Data

Scientists, Technology experts, and Business owners. Data Scientists will use their skills

and expertise to correlate data sets, identify patterns, and generate the final insights.

Technology experts form the core of the Big Data initiative by playing a role in identifying

the right set of software and hardware tools required for the platform.

Business Owners define the outcome and work with Data Scientists and Technology

experts to achieve the outcome at the right time.

4. Technology: This is the backbone of the Big Data platform in any organization.

Hardware infrastructure and software tools are the two technology components that

need be in place according to the requirement of the organization. Cloud computing can

be one of the options from a hardware infrastructure point of view. Tools like Hadoop,

NoSQL, and MongoDB should be identified and selected for collecting, processing,

storing, and analyzing data sets.

All of the challenges above need to be addressed and managed in a balanced way so that the

Big Data project will succeed. Neglecting any one of these will create great problems for Big

Data projects.

2015 EMC Proven Professional Knowledge Sharing 12

Big Data use cases

Normally, organizations will not reveal their Big Data strategy to others out of fear that doing so

might affect their competitive edge in the market. However, this fear may be groundless as Big

Data projects are starting provide benefits; that is why the market for Hadoop and NoSQL

services are growing fast. A September 2013 study by open source research firm Wikibon, for

instance, forecasts an annual Big Data software growth rate of 45% through 2017.

1. 360 degree view of the customer: Online marketing firms and other retailers want to

know how much time customers spend on their websites, what pages they search, what

they like, and when they leave. All these unstructured data collected will be processed or

analyzed along with transactional data which will be stored as structured data in the

company’s ERP system. They will also add social media sentiments into the mix and the

whole can give a 360 degree view of the customer. Before giving offers to a customer,

these shops will come to know what the customer has bought in the past and their

sentiment and behavior pattern

2. Smart devices: Today almost all machines or pieces of hardware are built with a

number of sensors which can transfer a lot of information when these devices are

connected over the Internet. They collect information such as device health, use, and

security. These unstructured and some structured data will be processed in the

background and derive correlations and patterns to understand device performance over

a period of time and customer usage.

3. Optimizing the data warehouse: Customers can identify their archived data and

unstructured data to store it more cost effectively on Hadoop architecture.

4. Information security: Security vendors and large organizations with sophisticated

security architectures can leverage Hadoop as a platform that can offer reliable and

cheap data protection. Fraud analysis and identification will be much easier as this

platform can derive thousands of relations and patterns.

5. Health Care: A Big Data platform can play a major role in predicting outbreaks of

contagious diseases and illness diagnosis by processing information collected over

many years and pulling patterns and trends from it. Data from disease control efforts,

hospitals, and accident reports can easily show which geographical areas are over-

served or under-served by the current health care efforts in place.

2015 EMC Proven Professional Knowledge Sharing 13

6. Object analytics: Object analytics look for connections among various objects, i.e.

place, thing, transaction, and location. Billions of such data points can be used for

predicting suspicious activity in a location and for detecting fraud.

2015 EMC Proven Professional Knowledge Sharing 14

Latest Trends in Big Data

More companies are getting involved in Big Data analytics to keep pace their competition. With

the Internet of Things (IoT) growing fast, customers are very aware of — and not always happy

— when a business uses their data in an underhanded way. Consequently, companies who are

keen on Big Data analytics have to ensure that they have a strong data management plan to

handle the customer data in an ethical manner. Let’s look at some of the interesting trends.

1. Open Source: The ongoing trends with open source products such as Hadoop, Spark,

and R are going to continue and will have even bigger markets in 2015. Stability of any

open source product must be ensured before placing them alongside your existing

database systems. Commercial distribution of open source is becoming more popular

than pure open source implementations because the chances of changing code is less

in the former and hence won’t affect the functionality frequently.

2. Cognitive computing: It makes new classes of problems computable by addressing

complex situations characterized by ambiguity and uncertainty. The goal of cognitive

computing is to develop a computing technology that works similar to how the human

brain senses work and responds to stimulus. Many pioneers in IT are already into this

area of technology and the trend will increase in 2015.

3. Big Data analytics In the Cloud: Hadoop and its related framework of tools were

originally designed to work on clusters of physical machines. Now the trend has changed

and an increased number of technologies including Hadoop are available for processing

data in the cloud. Google’s BigQuery data analytics service and Amazon’s Redshift

hosted BI data warehouse are good examples.

4. NoSQL: Alternatives to traditional SQL-based relational databases, NoSQL (short for

“Not Only SQL”) databases are rapidly gaining popularity as tools for use in specific

kinds of analytic applications. That momentum will continue to grow.

5. In-Memory analytics: An In-memory database management system primarily relies on

main memory for computer storage. Main memory databases are faster than disk-

optimized databases since the internal optimization algorithm are simpler and execute

fewer CPU instructions. Accessing data in-memory eliminates seek time when querying

the data, which provides faster and greater performance than disk.

2015 EMC Proven Professional Knowledge Sharing 15

Common Myths about Big Data

Those starting their Big Data journey should be aware of some common myths so that the

project will not be a waste of time or manpower.

1. Big is simple: We know that Apache Hadoop can store and process tons of data and it

provides an inbuilt fault tolerance like in-cluster replication to improve cluster availability.

However, HDFS doesn’t natively provide a solution for advanced data protection or

disaster recovery. For such functionality, enhanced Hadoop distributions like that from

MapR would be required.

2. Fast analytics using Hadoop: A common misconception about Hadoop is that it’s fast.

It is only designed for high throughput batch-style processing to reduce the impact due

to common hardware failures in systems. However, these days there are a number of

enhancements to address the issue of performance, among them integrating traditional

database, streaming data, and in-memory processing products.

3. Store everything: Big Data hype has created an impression that Big Data can store

forever all the data that an enterprise can have. It could be true, but the ultimate purpose

of having a data analytic solution will not be utilized. The truth is you can expect a faster,

more efficient, and cost-effective solution when you store less needed data on the

framework.

4. Start it as others are doing it: This is the wrong approach, at least to Big Data, as it

can waste money and effort. Putting tons of data on a scalable cluster and expecting the

Data Scientist to pull out the insights for you will never happen. As with any other

project, success will mostly depend on having a thought-out plan and strategy in place to

drive the whole framework of tools and other resources, including Data Scientists.

2015 EMC Proven Professional Knowledge Sharing 16

Conclusion

Big Data is a buzzword heard just about everywhere. Why is Big Data getting this much

attention? Because it has the potential to profoundly affect the way we do business. In the past ,

we used to look at small data and make our decisions. Now, with the Internet of Things and

technology advances, we are moving huge sets of data (instead of moving technology towards

data) to computing. This has become what is known as Big Data Analytics.

Big Data is about analytics, not storage. Start with questions, not data. All problems are not Big

Data problems. We will have to audit our data, classify problems, and set an approach. Invest in

up-skilling the resources, build the parts, and plan the whole. In the coming years, Big Data is

going to transform how we live, how we work, and how we think. Using the positive sides of Big

Data Analytics and the corresponding insights holds the promise of changing the world. Hence,

Big Data is a big deal!

EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION

MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO

THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying, and distribution of any EMC software described in this publication requires an

applicable software license.