Download - Bigdata notes
To keep things simple, we typically define Big
Data using four Vs; namely,
volume, variety, velocity, and veracity. We
added the veracity characteristic
recently in response to the quality and source
issues our clients began facing
with their Big Data initiatives. Some analysts
include other V-based descriptors,
such as variability and visibility, but we’ll leave
those out of this discussion.
Volume is the obvious Big Data trait. At the
start of this chapter we rhymed
off all kinds of voluminous statistics that do two
things: go out of date the
moment they are quoted and grow bigger! We
can all relate to the cost of
home storage; we can remember geeking out
and bragging to our friends
about our new 1TB drive we bought for $500;
it’s now about $60; in a couple
of years, a consumer version will fit on your
fingernail.
The thing about Big Data and data volumes is
that the language has
changed. Aggregation that used to be measured
in petabytes (PB) is now
referenced by a term that sounds as if it’s from a
Star Wars movie: zettabytes
(ZB). A zettabyte is a trillion gigabytes (GB), or
a billion terabytes!
Since we’ve already given you some great
examples of the volume of data
in the previous section, we’ll keep this section
short and conclude by referencing
the world’s aggregate digital data growth rate.
In 2009, the world had
about 0.8ZB of data; in 2010, we crossed the
1ZB marker, and at the end of
2011 that number was estimated to be 1.8ZB
(we think 80 percent is quite the
significant growth rate). Six or seven years from
now, the number is estimated
(and note that any future estimates in this book
are out of date the
moment we saved the draft, and on the low side
for that matter) to be around
35ZB, equivalent to about four trillion 8GB
iPods! That number is astonishing
considering it’s a low-sided estimate. Just as
astounding are the challenges
and opportunities that are associated with this
amount of data.
The variety characteristic of Big Data is really
about trying to capture all of the
data that pertains to our decision-making
process. Making sense out of
unstructured data, such as opinion and intent
musings on Facebook, or analyzing
images, isn’t something that comes naturally for
computers. However, this
kind of data complements the data that we use
to drive decisions today. Most
of the data out there is semistructured or
unstructured. (To clarify, all data has
some structure; when we refer to unstructured
data, we are referring to the subcomponents
that don’t have structure, such as the freeform
text in a comments
field or the image in an auto-dated picture.)
Consider a customer call center; imagine being
able to detect the change in
tone of a frustrated client who raises his voice to
say, “This is the third outage
I’ve had in one week!” A Big Data solution
would not only identify the terms
“third” and “outage” as negative polarity
trending to consumer vulnerability,
but also the tonal change as another indicator
that a customer churn incident
is trending to happen. All of this insight can be
gleaned from unstructured
data. Now combine this unstructured data with
the customer’s record data
and transaction history (the structured data with
which we’re familiar), and
you’ve got a very personalized model of this
consumer: his value, how brittle
he’s become as your customer, and much more.
(You could start this usage
pattern by attempting to analyze recorded calls
not in real time, and mature
the solution over time to one that analyzes the
spoken word in real time.)
An IBM business partner, TerraEchos, has
developed one of the most
sophisticated sound classification systems in the
world. This system is used
for real-time perimeter security control; a
thousand sensors are buried underground
to collect and classify detected sounds so that
appropriate action can
be taken (dispatch personnel, dispatch aerial
surveillance, and so on) depending
on the classification. Consider the problem of
securing the perimeter of
a nuclear reactor that’s surrounded by parkland.
The TerraEchos system can
near-instantaneously differentiate the whisper of
the wind from a human
voice, or the sound of a human footstep from the
sound of a running deer.
In fact, if a tree were to fall in one of its
protected forests, TerraEchos can affirm
that it makes a sound even if no one is around to
hear it. Sound classification
is a great example of the variety characteristic of
Big Data.
One of our favorite but least understood
characteristics of Big Data is velocity.
We define velocity as the rate at which data
arrives at the enterprise and is
processed or well understood. In fact, we
challenge our clients to ask themselves,
once data arrives at their enterprise’s doorstep:
“How long does it
take you to do something about it or know it has
even arrived?” Think about it for a moment. The
opportunity cost clock on your data
starts ticking the moment the data hits the wire.
As organizations, we’re taking
far too long to spot trends or pick up valuable
insights. It doesn’t matter
what industry you’re in; being able to more
swiftly understand and respond
to data signals puts you in a position of power.
Whether you’re trying to
understand the health of a traffic system, the
health of a patient, or the health
of a loan portfolio, reacting faster gives you an
advantage. Velocity is perhaps
one of the most overlooked areas in the Big
Data craze, and one in
which we believe that IBM is unequalled in the
capabilities and sophistication
that it provides.
In the Big Data craze that has taken the
marketplace by storm, everyone
is fixated on at-rest analytics, using optimized
engines such the Netezza
technology behind the IBM PureData System
for Analytics or Hadoop to
perform analysis that was never before possible,
at least not at such a large
scale. Although this is vitally important, we
must nevertheless ask: “How
do you analyze data in motion?” This capability
has the potential to provide
businesses with the highest level of
differentiation, yet it seems to be somewhat
overlooked. The IBM InfoSphere Streams
(Streams) part of the IBM Big Data
platform provides a real-time streaming data
analytics engine. Streams is a
platform that provides fast, flexible, and
scalable processing of continuous
streams of time-sequenced data packets. We’ll
delve into the details and
capabilities of Streams in Part III, “Analytics for
Big Data in Motion.”
You might be thinking that velocity can be
handled by Complex Event
Processing (CEP) systems, and although they
might seem applicable on the
surface, in the Big Data world, they fall very
short. Stream processing enables
advanced analysis across diverse data types with
very high messaging data
rates and very low latency (μs to s). For
example, one financial services sector
(FSS) client analyzes and correlates over five
million market messages/
second to execute algorithmic option trades with
an average latency of 30
microseconds. Another client analyzes over
500,000 Internet protocol detail
records (IPDRs) per second, more than 6 billion
IPDRs per day, on more than
4PB of data per year, to understand the trending
and current-state health of their
network. Consider an enterprise network
security problem. In this domain,
threats come in microseconds so you need
technology that can respond and
keep pace. However you also need something
that can capture lots of data
quickly, and analyze it to identify emerging
signatures and patterns on the
network packets as they flow across the network
infrastructure.
Finally, from a governance perspective,
consider the added benefit of a Big
Data analytics velocity engine: If you have a
powerful analytics engine that
can apply very complex analytics to data as it
flows across the wire, and you
can glean insight from that data without having
to store it, you might not
have to subject this data to retention policies,
and that can result in huge savings
for your IT department.
Today’s CEP solutions are targeted to
approximately tens of thousands of
messages/second at best, with seconds-to-
minutes latency. Moreover, the
analytics are mostly rules-based and applicable
only to traditional data
types (as opposed to the TerraEchos example
earlier). Don’t get us wrong;
CEP has its place, but it has fundamentally
different design points. CEP is a
non-programmer-oriented solution for the
application of simple rules to
discrete, “complex” events.
Note that not a lot of people are talking about
Big Data velocity, because
there aren’t a lot of vendors that can do it, let
alone integrate at-rest technologies
with velocity to deliver economies of scale for
an enterprise’s current
investment. Take a moment to consider the
competitive advantage that your
company would have with an in-motion, at-rest
Big Data analytics platform,
by looking at Figure 1-1 (the IBM Big Data
platform is covered in detail in
Chapter 3).
You can see how Big Data streams into the
enterprise; note the point at
which the opportunity cost clock starts ticking
on the left. The more time
that passes, the less the potential competitive
advantage you have, and the
less return on data (ROD) you’re going to
experience. We feel this ROD
metric will be one that will dominate the future
IT landscape in a Big Data
world: we’re used to talking about return on
investment (ROI), which
talks about the entire solution investment;
however, in a Big Data world,
ROD is a finer granularization that helps fuel
future Big Data investments.
Traditionally, we’ve used at-rest solutions
(traditional data warehouses,
Hadoop, graph stores, and so on). The T box on
the right in Figure 1-1
represents the analytics that you discover and
harvest at rest (in this case,
it’s text-based sentiment analysis).
Unfortunately, this is where many
vendors’ Big Data talk ends. The truth is that
many vendors can’t help you
build the analytics; they can only help you to
execute it. This is a key
differentiator that you’ll find in the IBM Big
Data platform. Imagine being
able to seamlessly move the analytic artifacts
that you harvest at rest and
apply that insight to the data as it happens in
motion (the T box by the
lightning bolt on the left). This changes the
game. It makes the analytic
model adaptive, a living and breathing entity
that gets smarter day by day
and applies learned intelligence to the data as it
hits your organization’s
doorstep. This model is cyclical, and we often
refer to this as adaptive
analytics because of the real-time and closed-
loop mechanism of this
architecture.
The ability to have seamless analytics for both
at-rest and in-motion data
moves you from the forecast model that’s so
tightly aligned with traditional
warehousing (on the right) and energizes the
business with a nowcastmodel.
The whole point is getting the insight you learn
at rest to the frontier of the
business so it can be optimized and understood
as it happens. Ironically, the
more times the enterprise goes through this
adaptive analytics cycle
Veracity is a term that’s being used more and
more to describe Big Data; it
refers to the quality or trustworthiness of the
data. Tools that help handle Big
Data’s veracity transform the data into
trustworthy insights and discard
noise.
Collectively, a Big Data platform gives
businesses the opportunity to analyze
all of the data (whole population analytics), and
to gain a better understanding
of your business, your customers, the
marketplace, and so on. This
opportunity leads to the Big Data conundrum:
although the economics of
deletion have caused a massive spike in the data
that’s available to an organization,
the percentage of the data that an enterprise can
understand is on
the decline. A further complication is that the
data that the enterprise is trying
to understand is saturated with both useful
signals and lots of noise (data
that can’t be trusted, or isn’t useful to the
business problem at hand), as
shown in Figure 1-2.
We all have firsthand experience with this;
Twitter is full of examples of
spambots and directed tweets, which is
untrustworthy data. The2012 presidential
election in Mexico turned into a Twitter veracity
example
with fake accounts, which polluted political
discussion, introduced derogatory
hash tags, and more. Spam is nothing new to
folks in IT, but you
need to be aware that in the Big Data world,
there is also Big Spam potential,
and you need a way to sift through it and figure
out what data can and
can’t be trusted. Of course, there are words that
need to be understood in
context, jargon, and more (we cover this in
Chapter 8).
As previously noted, embedded within all of this
noise are useful signals:
the person who professes a profound disdain for
her current smartphone
manufacturer and starts a soliloquy about the
need for a new one is expressing
monetizable intent. Big Data is so vast that
quality issues are a reality, and
veracity is what we generally use to refer to this
problem domain. The fact
that one in three business leaders don’t trust the
information that they use to
make decisions is a strong indicator that a good
BIG DATA – Nathan Marz
1.5 Desired Properties of a Big Data System 1.5.1 Robust and fault-tolerant
The properties you should strive for in Big Data
systems are as much about
complexity as they are about scalability. Not
only must a Big Data system perform
well and be resource-efficient, it must be easy to
reason about as well. Let's go
over each property one by one. You don't need
to memorize these properties, as we
will revisit them as we use first principles to
show how to achieve these properties.
Building systems that "do the right thing" is
difficult in the face of the challenges
of distributed systems. Systems need to behave
correctly in the face of machines
going down randomly, the complex semantics
of consistency in distributed
databases, duplicated data, concurrency, and
more. These challenges make it
difficult just to reason about what a system is
doing. Part of making a Big Data
system robust is avoiding these complexities so
that you can easily reason about
the system.
Additionally, it is imperative for systems to be
"human fault-tolerant." This is
an oft-overlooked property of systems that we
are not going to ignore. In a
production system, it's inevitable that someone
is going to make a mistake
sometime, like by deploying incorrect code that
corrupts values in a database. You
will learn how to bake immutability and
recomputation into the core of your
systems to make your systems innately resilient
to human error. Immutability and
recomputation will be described in depth in
Chapters 2 through 5.
1.5.2 Low latency reads and updates The vast majority of applications require reads
to be satisfied with very low
latency, typically between a few milliseconds to
a few hundred milliseconds. On
the other hand, the update latency requirements
vary a great deal between
applications. Some applications require updates
to propogate immediately, while in
other applications a latency of a few hours is
fine. Regardless, you will need to be
able to achieve low latency updates when you need them in your Big Data systems.
More importantly, you need to be able to
achieve low latency reads and updates
without compromising the robustness of the
system. You will learn how to achieve
low latency updates in the discussion of the
"speed layer" in Chapter 7.
1.5.3 Scalable
Scalability is the ability to maintain
performance in the face of increasing data
and/or load by adding resources to the system.
The Lambda Architecture is
horizontally scalable across all layers of the
system stack: scaling is accomplished
by adding more machines.
1.5.4 General
A general system can support a wide range of
applications. Indeed, this book
wouldn't be very useful if it didn't generalize to
a wide range of applications! The
Lambda Architecture generalizes to applications
as diverse as financial
management systems, social media analytics,
scientific applications, and social
networking.
1.5.5 Extensible
You don't want to have to reinvent the wheel
each time you want to add a related
feature or make a change to how your system
works. Extensible systems allow
functionality to be added with a minimal
development cost.
Oftentimes a new feature or change to an
existing feature requires a migration
of old data into a new format. Part of a system
being extensible is making it easy to
do large-scale migrations. Being able to do big
migrations quickly and easily is
core to the approach you will learn.
1.5.6 Allows ad hoc queries
Being able to do ad hoc queries on your data is
extremely important. Nearly every
large dataset has unanticipated value within it.
Being able to mine a dataset
arbitrarily gives opportunities for business
optimization and new applications.
Ultimately, you can't discover interesting things
to do with your data unless you
can ask arbitrary questions of it. You will learn
how to do ad hoc queries in
Chapters 4 and 5 when we discuss batch
processing.
1.5.7 Minimal maintenance
Maintenance is the work required to keep a
system running smoothly. This
includes anticipating when to add machines to
scale, keeping processes up and
running, and debugging anything that goes
wrong in production.
An important part of minimizing maintenance is
choosing components that
have as small an as possible. implementation complexity That is, you want to rely
on components that have simple mechanisms
underlying them. In particular,
distributed databases tend to have very
complicated internals. The more complex a
system, the more likely something will go
wrong and the more you need to
understand about the system to debug and tune
it.
You combat implementation complexity by
relying on simple algorithms and
simple components. A trick employed in the
Lambda Architecture is to push
complexity out of the core components and into
pieces of the system whose
outputs are discardable after a few hours. The
most complex components used, like
read/write distributed databases, are in this layer
where outputs are eventually
discardable. We will discuss this technique in
depth when we discuss the "speed
layer" in Chapter 7.
A Big Data system must provide the
information necessary to debug the system
when things go wrong. The key is to be able to
trace for each value in the system
exactly what caused it to have that value.
1.5.8 Debuggable
Achieving all these properties together in one
system seems like a daunting
challenge. But by starting from first principles,
these properties naturally emerge
from the resulting system design. Let's now take
a look at the Lambda Architecture
which derives from first principles and satisifes
all of these properties.
Computing arbitrary functions on an arbitrary
dataset in realtime is a daunting
problem. There is no single tool that provides a
complete solution. Instead, you
have to use a variety of tools and techniques to
build a complete Big Data system.
The Lambda Architecture solves the problem of
computing arbitrary functions
on arbitrary data in realtime by decomposing the
problem into three layers: the
batch layer, t