cloud computing and big data xiaofeng meng renmin ...idke.ruc.edu.cn/invited_talk/big data.pdfsybase...

63
Cloud Computing and Big Data Xiaofeng Meng Renmin University of China Forum of Future Data

Upload: vodang

Post on 19-Mar-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Cloud Computing and Big Data

Xiaofeng MengRenmin University of China

Forum of Future Data

Page 2: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

FFD, 2012, 武夷山

Page 3: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Our Work

5

Page 4: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Our Work

5

Page 5: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Big Data is so hot!

Google Trends of Big Data

Big Data Across the Federal Government(USA, March, 2012)

Page 6: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized
Page 7: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

What is Big Data?

Page 8: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

DB(Database) vs. BD(Big Data)“Small data”,

Very Large Database(VLDB) MB, 结构数据

以数据为对象解决其存储和管理问题

Big Data,Extremely Large Database(XLDB) >PB,非结构数据

以数据为资源解决诸领域问题

数据工程

数据思维

Data Engineering

Data Thinking

Page 9: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

What Can Big Data do ?

华尔街根据民众情绪抛售股票

对冲基金依据购物网站的顾客评论,分析企业产品销售情况

银行根据求职网站的岗位数量,推断就业率

投资机构收集并分析上市企业声明,从中寻找破产的蛛丝马迹

美国疾病控制和预防中心依据网民搜索,分析全球范围内流感等病疫的传播情况

美国总统奥巴马的竞选团队依据选民的微博,实时分析选民对总统竞选人的喜好

Page 10: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Prediction

Page 11: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Big Data Application

应用 用户数 精确度 可靠度 数据量 反应

科学计算 少 极高 低 -- 中等 Tera 慢

股市交易 大量 高 极高 Gega 快

Web数据 大量 中等 -- 高 中等 Peta 快

微博数据 大量 中等 -- 高 中等 100Peta 快

。。。

Page 12: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Our Work

5

Page 13: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Cloud Computing and Big Data

Cloud Computing is just like the highway which can support a variety of transportation

Big Data can be seen as one vehicle on the highway

Cloud Computing is infrastructure while Big Data is its service object

Page 14: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Big Data Analysis Pipeline

Analysis

Integration

Extraction& Cleaning

Acquisition

Interpretation

Collaboration of cloud computing can greatly promote these process

From:

Page 15: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Our Work

5

Page 16: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Data, Data and Data!

Page 17: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Data is all around you!Data type is variousMost data is occupied by companyResearchers are difficult to get the data

Page 18: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

No Size Fits All

Web dataScience dataFinancial DataMoving Object Data………

Page 19: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

21%

18%

12%

11%

10%

9%

9%

8%

4%

2%

1%

1%

3%

35%

11%

0% 5% 10% 15% 20% 25% 30% 35% 40%

Oracle Exadata

Microsoft SQL PDW

IBM DB2 Smart Analytics System

Hadoop/Mapreduce

IBM Netzza

HP Vertica

Teradata EDW

EMC Greenplum

Sybase IQ

Infobright

Kognitb WX2

ParAccel Analytic Database

Other

We aren't using big data analytics tools

Don't know

Big Data Analytics Tools in Use

Page 20: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

“大海捕鱼”vs.“池塘捕鱼”“Data is widely available;

what is scarce is the ability to extract wisdom from it.”

Page 21: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Parallelism Parallelism across nodes in a cluster Parallelism within a single node

Cloud ComputingNew hardware: SSD、PCM…

Page 22: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Timeliness

Many situations need the result of analysis immediately

Real-time processing can be a challenge with big data, especially in dynamic data environments like financial trading and social media.

Develop partial results in advance and then do incremental computation

New index structures are required

From:

Page 23: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Privacy

Manage privacy is both technical and sociological problem

New data source bring new problems:LBS、Microblog….

Share private data while limiting disclosure and ensuring sufficient data utility in the shared data

Differential privacy is a very important step, but it reduces information content too far in order to be useful in most practical cases

From:

Page 24: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Our Work

5

Page 25: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Overview of our work: Web Data Management

2010

2009

2006

2001

EasyScholar

C-DBLP

Deep Web Integration

Surface Web Data Extraction

ScholarSpace2011-Present

Page 26: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

面向领域的Web数据集成技术 成功研发多个线上系统,验证了数据集成技术有有效性

学术空间ScholarSpace 工作通数据集成系统

舆情监控平台 图书价格比较网

(访问量超过了350万人次) (集成数据量超过了300万条)

(集成数据量超过了450万条) (动态集成方式, 实时数据)

Page 27: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

ScholarSpace

文献:50万作者:40万

累计访问:400万 日访问量:6000人次

Page 28: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

ScholarSpace

实体:作者, 论文, 期刊, 会议, 研究机构, …

关联:作者关系, 论文发表关系,合作者关系,

数据抽取

数据集成

Advisor

Advisor

Advisor

Co-AuthorCo-Author

Author-Of

Author-Of

Author-Of

Published-In

Published-In

Member

Classmate

Reference

Published-In

Author-Of

关联演化

浏览 查询 分析基于任务 多种形式 丰富多样

隶属关系, 导师关系,参考文献关系…

关联发现、删除、更新

Page 29: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Web据管理框架

Page 30: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

成果意义

建立了一种将数据结构化管理的途径,为解决特定领域的大数据集成问题奠定了基础

进而为大数据管理提供一种新的解决思路

Page 31: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Overview of our work: Cloud Data Management

2011/06

2010/06

2010/01

2008/04

Query Process & Benchmark_v2.0

TaijiDB_v1.0

System Survey & Benchmark_v1.0

Introduction & Index for Cloud

Extensive Research &TaijiDB_v2.0present

join querydistribution strategyprogress estimate

multidimensional index, query optimization, online aggregation

Page 32: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Current Work

Multidimensional Index in the Cloud

Multi-Fields Query Processing in the Cloud

Online Aggregation in the Cloud

Our Prototype System: TaijiDB

Benchmarking Cloud-based Data Management Systems

Practical Industrial Applications Motivated

Page 33: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Current Work

Multidimensional Index in the Cloud

Multi-Fields Query Processing in the Cloud

Online Aggregation in the Cloud

Our Prototype System: TaijiDB

Benchmarking Cloud-based Data Management Systems

Page 34: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Multi-dimensional Index in the cloud - motivation

Massive Millions of senors or GPS enabled devices 10^6 * 2*60*24*1KB = 3TB/day

High Update frequency Data collection Frequency Hundreds of thousands of insertion per second

Multi-Dimensional Inherent attributes: spatio-temporal attributes Other attributes: speed, direction …

Toyota: G-Book

1 million+ members

GE: OnStar

5 million+ membersInternet of vehicles

Collaboration with NEC

Page 35: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Limits of Current Approaches

Traditional DBMS Be in trouble with scalability Can not support high insert throughput

Key-value Stores Pros

• High scalability、availability and fault tolerance• Efficient random read and write • Support high insertion throughput

Cons

• Only support fast rowkey based query • Can not support multi-dimensional query efficiently

Page 36: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Requirements

Design a new index model that can support efficient multi-dimensional query according to the characteristics of IoTapplications

The index model must support high inert throughput at the same time

Implementing the new index based on HBase

Page 37: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Multi-level Index Framework

Dividing the data into current data and historical data, indexing them at different granularities

For the present data, indexing the time intervals and subspaces at high level ; For the historical data, indexing each record in batch

Page 38: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Z-order Based Dynamical Space Partitioning

Advantages Make sure the data is distributed evenly The data that is close in the original time and space dimension can be

stored in the same regions

Page 39: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Current Work

Multidimensional Index in the Cloud

Multi-Fields Query Processing in the Cloud

Online Aggregation in the Cloud

Our Prototype System: TaijiDB

Benchmarking Cloud-based Data Management Systems

Page 40: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Multi-Fields Query Processing in the Cloud - Motivation

Input

OutputR (msisdn, url, ts, size, otherData)

Select Top 100url, sum(size) s, count(msisdn) cFrom RWhere msisdn =861346672558 And

ts>20120205 And ts<20120429Group by url

Order by c

Select Top 100msisdn, sum(size) s, count(url) cFrom RWhere url=“www.baidu.com” And

ts>20120205 And ts<20120429Group by msisdn

Order by c

Collaboration with 诺西

Page 41: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Multiple Layer Grid Tree For Telecom

Typedef Struct MLGT {

Int N;

Int M;

Boolean bm[m][n];

Long insets[m][n];

Long trange;

Long mspace;

Map< RegionID,

MLGT>SR;

} MLGT

ts(0,0)

(mspace,0)

(0,trange)

sub MGLTsub MGLTsub MGLTmsisdn

Split Region

Region(Cell)

sub MGLT sub MGLT

MLGT(Multiple Layer Grid Tree)

Solution: MLGT + Optimized MapReduce Algorithm

Organized all the Regions of a given table into a multiple layer grid tree (MLGT)

Page 42: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Multiple Layer Grid Tree For Telecom

region

region

Query decomposition

HBase

Map/Reduce

Map task

Tablets meta info

setting parameters

Query resultsJob settings

Query1

2 23

4

Data flow

Component

Map task

Map task

Map task

Map task

5

Page 43: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Current Work

Multidimensional Index in the Cloud

Multi-Fields Query Processing in the Cloud

Online Aggregation in the Cloud

Our Prototype System: TaijiDB

Benchmarking Cloud-based Data Management Systems

Page 44: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Online Aggregation in the Cloud - Motivation

Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM tableWHERE languageIN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language.

20TB

Amazon EC260 node cluster

Page 45: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Online Aggregation in the Cloud - Motivation

Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM tableWHERE languageIN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language.

20TB

Being processed…

Page 46: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Online Aggregation in the Cloud - Motivation

Wikipedia Page Traffic Statistics SELECT language, SUM(pageviews) FROM tableWHERE languageIN(‘en’,’ja’,’de’,’es’,’fr’,’it’,’pl’) GROUP BY language.

20TB

95h

$1400

Batch-processing Online Aggregation

1h

Results with 95% confidence

Save Cost !!!

Page 47: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

COLA - Architecture

Online Aggregation Executor State Manage Estimate Progress Prediction

Query Engine Backward Compatibility Transparent

User Interface 2 interfaces 2 processing modes

Data Manager Data SamplingMetadata Management

Page 48: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

COLA - Implementation

COLA

Result Estimator

State Manager

Data Sampler

OLA Translator Progress Predictor

Map TranslatorCombine TranslatorReduce TranslatorNo Translator

Result Estimation &Confidence Interval Computation

Combiner+ Reducer

Split-based Queue:a queue for a tableequal length

a State Manager for a ReducerStateful Incremental Computation

MapReduce DAG Graph Task-based PERT NetworkCritical Path

Page 49: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Current Work

Multidimensional Index in the Cloud

Multi-Fields Query Processing in the Cloud

Online Aggregation in the Cloud

Our Prototype System: TaijiDB

Benchmarking Cloud-based Data Management Systems

Page 50: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Benchmarking the CloudDBMS - Motivation

How is the performance?

How to choose the most appropriate system?

How to evaluate the systems?

Page 51: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Existing CloudDB

DataAnalysis

WEB Data Management

Applications

Architecture

Storage

Key Value

Data Model

Page 52: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Benchmark Design

StandardizationBroad representationEfficiency

Benchmark

TestCaseOperation

Scenario MetricsReal application scenario from

industry

A series of metrics to evaluate performance

Representative operations in the business application

Business process in the application

Page 53: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Benchmark Scenario

Input

Output

Page 54: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Benchmark Operations

PUT

PUT(KEY,VALUE)

GET

VALUE = GET(KEY)

SCAN

RESULT = SCAN(STARTKEY,ENDKEY)

LOAD LOAD(PATH)

Page 55: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Evaluation Results

Partition

Without partition

nodes

nodes

nodes

nodes

Res

pons

e tim

e

Data Import File Load Scalability

Page 56: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Current Work

Multidimensional Index in the Cloud

Multi-Fields Query Processing in the Cloud

Online Aggregation in the Cloud

Our Prototype System: TaijiDB

Benchmarking Cloud-based Data Management Systems

Page 57: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

TaijiDB - Motivation

Real World Applications Big Data

Cloud ComputingCloud Based DBMS

No One-To-All Solutions In the Cloud

TaijiDB: A TitAnIc and Just-In-time DataBase

Page 58: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

系统架构

2012/7/10

HDFSTables & Files & Logs

HMaster

Basic SQL Interface/Application Interfaces

HRegionServer HRegionServer

SSD & Buffer Management

Unified API

Operation & Management Service

Storage Management

Index Management

Query Optimization

Random Sampling Algorithm

E -Commerce

Internet of Things Telecom

Security

Lock

Monitoring

LoadBalance

Metadata

Testing

Multi-Level Index

MLGTAlgorithm

Progress Estimating

Online Aggregation

Cassandra

Keyspace

SuperColumn

Thrift Interface

SuperColumn

SuperColumn

HBase

Front-end Interface

Query Processing

Unified Execution Engine

Storage Manager

Page 59: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

Outline

Introduction to Big Data1

Cloud Computing and Big Data 2

3

4

Conclusion

Challenging Problems

Our Work

5

Page 60: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

SummaryCloud Computing helps organizations store, manage,

share and analyze their Big Data in an affordable and easy-to-use way

The concept of Big data is wide and empty. We must focus on one or some domains.

Data thinking: Nothing can do without dataDifferent situations need different type of process: Batch

or StreamHardware and software both need to update

Page 61: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

香山科学会议

网络数据科学与工程

李国杰,华云生,姚期智,成思危

主要议题 社会、经济与IT领域中网络大数据应用

网络数据科学的共性理论基础

网络大数据的良性生态环境构建

中国科学报-李国杰

Page 62: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

XLDB Asia 2012

Invited Talks Reference cases from scientific communities

Astroinformatics, Geoinformatics, Earth…

Reference cases from industry Facebook, eBay, EMC, Taobao…

Research on Big Data Management Laura(IBM), Xiaodong Zhang (Ohio), Martin(MonetDB)…

Panel Discussion Handling Extremely Large Scientific Data NoSQL: the Cure for Big Data? Evolution or Revolution: Database Research for Big Data

Lightning talks

Page 63: Cloud Computing and Big Data Xiaofeng Meng Renmin ...idke.ruc.edu.cn/invited_talk/Big data.pdfSybase IQ Infobright Kognitb WX2 ... Only support fast rowkey based query ... Organized

未来每18 个月产生的数据量等于有史以来的数据量之和

--Jim Gray1998图灵奖获奖演说

谢 谢!