as you seek – how search enables big data analytics

49
The Briefing Room As You Seek—How Search Enables Big Data Analytics

Upload: inside-analysis

Post on 20-Aug-2015

935 views

Category:

Technology


0 download

TRANSCRIPT

The Briefing Room

As You Seek—How Search Enables Big Data Analytics

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected]

Twitter Tag: #briefr

The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission

Twitter Tag: #briefr

The Briefing Room

JUNE: Database

July: CLOUD

August: HIGH PERFORMANCE ANALYTICS

September: ANALYTICS

Twitter Tag: #briefr

The Briefing Room

Database

Better SEARCH

Faster INSIGHT

Twitter Tag: #briefr

The Briefing Room

Analyst: Robin Bloor

Robin Bloor is Chief Analyst at The Bloor Group

[email protected]

Twitter Tag: #briefr

The Briefing Room

! MarkLogic is an enterprise-class NoSQL database company

!   Key features of its database include ACID transactions, horizontal scaling, real-time indexing, high availability, disaster recovery, and government-grade security

!   Its platform provides full-text query and search capabilities, application services and big data analytics

MarkLogic

Twitter Tag: #briefr

The Briefing Room

David Gorbet

David Gorbet is Vice President of Engineering for MarkLogic, where he also runs the Support organization. Gorbet brings two decades of experience delivering some of the highest-volume applications and enterprise software in the world. Prior to MarkLogic, Gorbet helped pioneer Microsoft’s business online services strategy by founding and leading the SharePoint Online team. Gorbet holds a Bachelor of Applied Science degree in Systems Design Engineering with an additional major in Psychology from the University of Waterloo, and an MBA from the University of Washington Foster School of Business.

MarkLogic: What it is, how it works

David Gorbet, VP Engineering

Slide 2 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

WE ARE THE NEW GENERATION

DATABASE

Any Structure Era “For all your data!” • Schema-agnostic • Massive scale • Query and search • Analytics • Application services • Faster time-to-results

Relational Era “For all your structured data!” • Normalized, tabular

model • Application-

independent query • User control

Hierarchical Era For your application data! • Application- and

hardware-specific

Slide 3 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Real Value From Big Data

Make The World More Secure

Provide Access To Valuable Information

Create New Revenue Streams

Gain Insights to Increase Market Share

Reduce Bottom Line Expense

Slide 4 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

The MarkLogic Advantage

Only Enterprise NoSQL Database

ACID compliant

Big data search

High availability

Replication

Point in-time recovery

Government-grade security

Real-time your Hadoop

Proven customer success

Slide 5 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

How Does It Work?

Schema-agnostic design

Real-time indexing and query

Event processing and alerting

Scale-out shared-nothing cluster topology

Analytics and Visualization

High availability and disaster recovery

Slide 6 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Hierarchical Data Model

MarkLogic Server is a document-centric database

Supports any-structured data via hierarchical data model

Document

Title Author

Section

Section Section Section Section

First Last

Metadata

Trade Cashflows

Party Identifier

Net Payment

Payment Date

Party Reference

Payer Party

tradeID

Payment Amount

Receiver Party

Slide 7 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

MarkLogic is Schema Agnostic

JSON and XML are self-describing <article>

<title> MarkLogic Server:… </title>

<author>

<first-name> John </first-name>

<last-name> Doe </last-name>

</author>

<abstract>

. . . . <company> MarkLogic </company> . . . .

</abstract>

<body>

<section>

<section> . . . . </section>

</section>

<section> …index… </section>

</body>

<copyright> Copyright © … </copyright>

</article>

Slide 8 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

MarkLogic is Schema Agnostic

JSON and XML are self-describing <article>

<title>

MarkLogic Server:…

<author>

<first-name>

John

<last-name>

Doe

<abstract>

. . . .

<company>

MarkLogic

. . . .

<body>

<section>

<section>

. . . .

<section> …index…

<copyright>

Copyright © …

Slide 9 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

“brown” 123, 125, 129, 152, 344, 491, …

“mice” 123, 125, 126, 129, 130, 152, …

“brown mice” 125, 152, 516, 522, 765, 890, …

STEM “mouse” 123, 125, 126, 129, 130, 152, …

STEM “brown mouse” 125, 152, 516, 522, 765, 890, …

<article> …

<article>/<abstract> …

<section>/<paragraph> …

<animal>mouse</animal> …

<year>1950</year> …

Collection:Draft …

Role:Editor + Action:Read …

… …

… …

… …

Universal Index

Term Term List

MarkLogic indexes…

Words

Phrases

Stemming

Structure

Values

Collections

Security Permissions

Document References

125, 516, 890, …

Which draft articles contain the phrase brown mice?

Slide 10 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

“brown” 123, 125, 129, 152, 344, 491, …

“mice” 123, 125, 126, 129, 130, 152, …

“brown mice” 125, 152, 516, 522, 765, 890, …

STEM “mouse” 123, 125, 126, 129, 130, 152, …

STEM “brown mouse” 125, 152, 516, 522, 765, 890, …

<article> …

<article>/<abstract> …

<section>/<paragraph> …

<animal>mouse</animal> …

<year>1950</year> …

Collection:Draft …

Role:Editor + Action:Read …

… …

… …

… …

Scalar Queries

Term Term List Document References

125, 516, 890, …

Which draft articles that contain the phrase brown mice were written before 2010?

Slide 11 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Range Indexes

Value ID

2002 3

2003 10

2004 5

2004 11

2007 4

2007 17

2009 1

2011 8

… …

… …

… …

ID Value

1 2009

3 2002

4 2007

5 2004

8 2011

10 2003

11 2004

17 2007

… …

… …

… …

Map document IDs to

values, and vice-versa in

a compact in-memory

representation

Slide 12 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Geospatial Index: A 2-Dimensional Range Index

Fully composable with all other indexes!

Built-in support for:

Point

Box

Circle

Polygon

Complex Polygon

Polygon Intersection

Polygon Containment

Slide 13 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Reverse Indexes (Alerting)

1. Load serialized queries as query documents

2. For a given data document, find all queries that match

Can provide real-time alerts during loads

With no significant performance impact!

Can let documents store values as "ranges"

Documents about cities self-defining their geo boundaries

Person documents defining birthdays as ranges, sequences

Can power classifiers and "matchmaker" queries

Slide 14 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Value ID

2002 3

2003 10

2004 5

2004 11

2007 4

2007 17

2009 1

2011 8

… …

… …

… …

ID Value

1 2009

3 2002

4 2007

5 2004

8 2011

10 2003

11 2004

17 2007

… …

… …

… …

Range Indexes

Map document IDs to

values, and vice-versa in

a compact in-memory

representation

Range Indexes work like a built-in in-memory column store

Slide 15 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Facets and Aggregation

Slide 16 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Interactive Visualization

Slide 17 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

In-database Analytic Functions

Leverage ready-made analytic built-ins for commonly-used numeric applications

Variance

Covariance

Correlation

Standard deviation

Linear model

Median

Mode

Percentile

Rank

Percent-rank

Benefits

Faster analytics-based application development

Supports more users & more data

Eliminates costs associated with writing custom code

Slide 18 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

User-defined Functions

class InfluenceRank : public AggregateUDF

{

public:

struct Value {

double sum, sum_sq, count;

Value() : sum(0), sum_sq(0), count(0) {}

} value;

public:

AggregateUDF* clone() const { return new InfluenceRank (*this); }

void close() { delete this; }

void start(Sequence&, Reporter&) {}

void finish(OutputSequence& os, Reporter& reporter);

void map(TupleIterator& values, Reporter& reporter);

void reduce(const AggregateUDF* _o, Reporter& reporter);

void encode(Encoder& e, Reporter& reporter);

void decode(Decoder& d, Reporter& reporter);

};

Slide 19 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

• • •

• • •

In-database MapReduce

start encode

decode reduce finish

decode map reduce encode

Slide 20 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

SQL and BI Tools

ODBC

SQL

Range Indexes

Slide 21 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

SQL and BI Tools

Slide 22 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

HA/DR Features of MarkLogic

Needs expansion: • How local-disk/shared-disk

failover works • How Flexrep works • How DBRep works

Slide 23 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

MarkLogic 6

Flexible Indexes

Full Text Search

Schema-Agnostic

Scalable Analytic

Functions

Hadoop Distribution

Alerting & Event

Processing

Geospatial Query

In-database

MapReduce

Visualization Widgets

Transactions Role-based

Security

Automated Failover

Replication Journal Archiving

Point-in-time

Recovery

Database Rollback

Backup/ Restore

Distributed Transactions

Super-clusters

Powerful Everything you need to deliver business value

Trusted Enterprise-ready for mission-critical apps

REST & Java APIs

JSON Storage

Application Builder

Information Studio

Hadoop Connector

Content Pump

BI Integration

SQL Support

Monitoring &

Management

OS Support

Accessible Leverage existing tools, knowledge, skills

Slide 24 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Any Questions?

Slide 25 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

What is Semantics Technology?

Slide 26 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Elasticity

New tools to characterize and monitor the resource requirements of your applications and loads.

Dynamic provisioning system that can add or subtract resources on-the-fly to match the loads.

Distributed & virtualized environments including VMWare, Amazon AWS and Hadoop are supported to scale-out.

Make the cloud a first-class citizen: Use Hadoop HDFS or Amazon S3 for backup

Aligning infrastructure + demand, continually

Slide 27 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Tiered storage

ML

SSD

local

HDFS

amzn s3

Benefits

Keep data on tiers appropriate to

access needs = lower costs

Detach and reattach storage when

needed. Fewer compute nodes

required = lower costs

Leverage Hadoop HDFS investment

Choose infrastructure based on

value of data stored.

100% online with different tiers

at different SLAs/topologies

On-line/near-line mix utilizing

mount on-demand and

dynamic node spin-up.

Tiered Storage New Constructs

• Range partitions by Date/Scalar

manage group of forests by

range (“Q1” or “1990-1995”)

• Super Databases federate

queries across multiple

databases

Slide 28 Copyright © 2013 MarkLogic® Corporation. All rights reserved.

Tiered Storage

96 504 1,044

592 2,066 2,080

Total Size (TB)

Total Cost ($000)

Operational

$25

Effective Unit Cost ($/GB)

$4

Compliance

$1.50

Analytic

Twitter Tag: #briefr

The Briefing Room

Perceptions & Questions

Analyst: Robin Bloor

The Bloor Group

The Bloor Group

Database Innovation Database used to be a “zero-innovation market.”

Now it is the opposite.

Traditional (relational) database is now seen

(rightly) as inadequate in many respects

Big Data is, mainly, new data posing new

problems

New products are emerging and some older products are

being given a make-over (and gaining popularity)

Hadoop has changed perceptions and

thinking about database

The Bloor Group

Multiple Database Roles

HAVE INCREASED SIGNIFICANTLY…

The Bloor Group

The Analytics Issue

The Bloor Group

The Origin of Big Data

The Bloor Group

NoSQL Confusion

As the graph indicates NoSQL is a very

confusing descriptor.

WHAT CAN A GIVEN DATABASE ACTUALLY

DO?

The important question is

The Bloor Group

The Joys and Sorrows of SQL

SQL: Very good for set manipulation Works for OLTP and many query environments

Not good for nested data structures (documents, web pages, etc.) Not good for ordered data sets Not good for data graphs (networks of values)

The Bloor Group

!   In my view we have reached a situation where there will be multiple “data engines.” Is that MarkLogic’s view?

!   Specifically, are there data structures or database contexts for which MarkLogic is inappropriate?

!   What new features or capabilities are on the MarkLogic roadmap?

!   In your view, is the “age of the data warehouse” over?

The Bloor Group

!   Which sectors/businesses are currently in MarkLogic’s “sweet spot”?

!   Data analytics involves much more than having analytical functions in the database. It is more than 50% data prep (merging, cleansing, joining, transformation, etc.). How does MarkLogic accommodate that?

!   What is MarkLogic’s attitude to the cloud? Specifically, where would it recommend cloud deployment?

Twitter Tag: #briefr

The Briefing Room

Twitter Tag: #briefr

The Briefing Room

July: CLOUD

August: HIGH PERFORMANCE ANALYTICS

September: ANALYTICS

Upcoming Topics

www.insideanalysis.com

Twitter Tag: #briefr

The Briefing Room

Thank You for Your

Attention