big-data analytics: challenges and opportunities

57
Big-data Analytics: Challenges and Opportunities Chih-Jen Lin Department of Computer Science National Taiwan University Talk at c˙x}t, August 30, 2014 Chih-Jen Lin (National Taiwan Univ.) 1 / 54

Post on 27-Nov-2014

19.131 views

Category:

Technology


1 download

DESCRIPTION

林智仁 (Chih-Jen Lin) 國立臺灣大學資訊工程學系特聘教授 林智仁教授畢業於台大數學系,之後前往美國密西根大學獲得碩士與博士學位。他的研究團隊多年來致力於機器學習之相關研究,所發展的資料分類軟體在世界上被廣泛使用,是台灣計算機科學界突出的重要成果。

TRANSCRIPT

Page 1: Big-data analytics: challenges and opportunities

Big-data Analytics: Challenges andOpportunities

Chih-Jen LinDepartment of Computer Science

National Taiwan University

Talk at 台灣資料科學愛好者年會, August 30, 2014

Chih-Jen Lin (National Taiwan Univ.) 1 / 54

Page 2: Big-data analytics: challenges and opportunities

Everybody talks about big data now, but it’s not easy tohave an overall picture of this subject

In this talk, I will give some personal thoughts ontechnical developments of big-data analytics. Some arevery pre-mature, so your comments are very welcome

Chih-Jen Lin (National Taiwan Univ.) 2 / 54

Page 3: Big-data analytics: challenges and opportunities

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 3 / 54

Page 4: Big-data analytics: challenges and opportunities

From data mining to big data

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 4 / 54

Page 5: Big-data analytics: challenges and opportunities

From data mining to big data

From Data Mining to Big Data

In early 90’s, a buzzword called data miningappeared

Many years after, we have another one called bigdata

Well, what’s the difference?

Chih-Jen Lin (National Taiwan Univ.) 5 / 54

Page 6: Big-data analytics: challenges and opportunities

From data mining to big data

Status of Data Mining and MachineLearning

Over the years, we have all kinds of effectivemethods for classification, clustering, and regression

We also have good integrated tools for data mining(e.g., Weka, R, Scikit-learn)

However, mining useful information remains difficultfor some real-world applications

Chih-Jen Lin (National Taiwan Univ.) 6 / 54

Page 7: Big-data analytics: challenges and opportunities

From data mining to big data

What’s Big Data?

• Though many definitions areavailable, I am consideringthe situation that data arelarger than the capacity of acomputer

• I think this is a maindifference between datamining and big data

• So in a sense we are talkingabout distributed datamining or machine learning

(a), (b): distributedsystemsImage from Wikimedia

Chih-Jen Lin (National Taiwan Univ.) 7 / 54

Page 8: Big-data analytics: challenges and opportunities

From data mining to big data

From Small to Big Data

Two important differences:

Negative side:

Methods for big data analytics are not quite ready,not even mentioned to integrated tools

Positive side:

Some (Halevy et al., 2009) argue that the almostunlimited data make us easier to mine information

I will discuss the first difference

Chih-Jen Lin (National Taiwan Univ.) 8 / 54

Page 9: Big-data analytics: challenges and opportunities

Challenges

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 9 / 54

Page 10: Big-data analytics: challenges and opportunities

Challenges

Possible Advantages of Distributed DataAnalytics

Parallel data loading

Reading several TB data from disk is slow

Using 100 machines, each has 1/100 data in itslocal disk ⇒ 1/100 loading time

But having data ready in these 100 machines isanother issue

Fault tolerance

Some data replicated across machines: if one fails,others are still available

Chih-Jen Lin (National Taiwan Univ.) 10 / 54

Page 11: Big-data analytics: challenges and opportunities

Challenges

Possible Advantages of Distributed DataAnalytics (Cont’d)

Workflow not interrupted

If data are already distributedly stored, it’s notconvenient to reduce some to one machine foranalysis

Chih-Jen Lin (National Taiwan Univ.) 11 / 54

Page 12: Big-data analytics: challenges and opportunities

Challenges

Possible Disadvantages of DistributedData Analytics

More complicated (of course)

Communication and synchronization

Everybody says moving computation to data, butthis isn’t that easy

Chih-Jen Lin (National Taiwan Univ.) 12 / 54

Page 13: Big-data analytics: challenges and opportunities

Challenges

Going Distributed or Not Isn’t Easy toDecide

Quote from Yann LeCun (KDnuggets News 14:n05)

“I have seen people insisting on using Hadoop fordatasets that could easily fit on a flash drive andcould easily be processed on a laptop.”

Now disk and RAM are large. You may load severalTB of data once and conveniently conduct allanalysis

The decision is application dependent

We will discuss this issue again later

Chih-Jen Lin (National Taiwan Univ.) 13 / 54

Page 14: Big-data analytics: challenges and opportunities

Challenges

Distributed Environments

Many easy tasks on one computer become difficultin a distributed environment

For example, subsampling is easy on one machine,but may not be in a distributed system

Usually we attribute the problem to slowcommunication between machines

Chih-Jen Lin (National Taiwan Univ.) 14 / 54

Page 15: Big-data analytics: challenges and opportunities

Challenges

Challenges

Big data, small analysis

versus

Big data, big analysis

If you need a single record from a huge set, it’sreasonably easy

For example, accessing your high-speed railreservation is fast

However, if you want to analyze the whole set byaccessing data several time, it can be much harder

Chih-Jen Lin (National Taiwan Univ.) 15 / 54

Page 16: Big-data analytics: challenges and opportunities

Challenges

Challenges (Cont’d)

Most existing data mining/machine learningmethods were designed without considering dataaccess and communication of intermediate results

They iteratively use data by assuming they arereadily available

Example: doing least-square regression isn’t easy ina distributed environment

Chih-Jen Lin (National Taiwan Univ.) 16 / 54

Page 17: Big-data analytics: challenges and opportunities

Challenges

Challenges (Cont’d)

So we are facing many challenges

methods not ready

no convenient tools

rapid change on the system side

and many others

What should we do?

Chih-Jen Lin (National Taiwan Univ.) 17 / 54

Page 18: Big-data analytics: challenges and opportunities

Opportunities

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 18 / 54

Page 19: Big-data analytics: challenges and opportunities

Opportunities

Opportunities

Looks like we are in the early stage of a researchtopic

But what is our chance?

Chih-Jen Lin (National Taiwan Univ.) 19 / 54

Page 20: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Outline

3 OpportunitiesLessons from past developments in one machineSuccessful examples?Design of big-data algorithms

Chih-Jen Lin (National Taiwan Univ.) 20 / 54

Page 21: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Algorithms for Distributed Data Analytics

This is an on-going research topic.

Roughly there are two types of approaches1 Parallelize existing (single-machine) algorithms2 Design new algorithms particularly for distributed

settings

Of course there are things in between

Chih-Jen Lin (National Taiwan Univ.) 21 / 54

Page 22: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Algorithms for Distributed Data Analytics(Cont’d)

Given the complicated distributed setting, wewonder if easy-to-use big-data analytics tools canever be available?

I don’t know either. Let’s try to think about thesituation on one computer first

Indeed those easy-to-use analytics tools on onecomputer were not there at the first day

Chih-Jen Lin (National Taiwan Univ.) 22 / 54

Page 23: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Past Development on One Computer

The problem now is we take many things forgranted on one computer

On one computer, have you ever worried aboutcalculating the average of some numbers?

Probably not. You can use Excel, statisticalsoftware (e.g., R and SAS), and many things else

We seldom care internally how these tools work

Can we go back to see the early development onone computer and learn some lessons/experiences?

Chih-Jen Lin (National Taiwan Univ.) 23 / 54

Page 24: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product

Consider the example of matrix-matrix products

C = A× B , A ∈ Rn×d ,B ∈ Rd×m

where

Cij =d∑

k=1

AikBkj

This is a simple operation. You can easily write yourown code

Chih-Jen Lin (National Taiwan Univ.) 24 / 54

Page 25: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

A segment of C code (assume n = m here)for (i=0;i<n;i++)

for (j=0;j<n;j++)

{

c[i][j]=0;

for (k=0;k<n;k++)

c[i][j] += a[i][k]*b[k][j];

}

For 3, 000× 3, 000 matrices$ gcc -O3 mat.c

$ time ./a.out

3m24.843sChih-Jen Lin (National Taiwan Univ.) 25 / 54

Page 26: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

But on Matlab (single-thread mode)

$ matlab -singleCompThread

>> tic; c = a*b; toc

Elapsed time is 4.095059 seconds.

Chih-Jen Lin (National Taiwan Univ.) 26 / 54

Page 27: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

How can Matlab be much faster than ours?

The fast implementation comes from some deepresearch and development

Matlab calls optimized BLAS (Basic Linear AlgebraSubroutines) that was developed in 80’s-90’s

Our implementation is slow because data are notavailable for computation

Chih-Jen Lin (National Taiwan Univ.) 27 / 54

Page 28: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

CPU

↓Registers

↓Cache

↓Main Memory

↓Secondary storage (Disk)

↑: increasing in speed

↓: increasing incapacity

Optimized BLAS: tryto make data availablein a higher level ofmemory

You don’t waste timeto frequently movedata

Chih-Jen Lin (National Taiwan Univ.) 28 / 54

Page 29: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

Optimized BLAS uses block algorithms

A× B =

A11 · · · A14...

A41 · · · A44

B11 · · · B14...

B41 · · · B44

=

[A11B11 + · · ·+ A14B41 · · ·

... . . .

]If we compare the number of page faults (cachemisses)

Ours: much larger

Block: much smallerChih-Jen Lin (National Taiwan Univ.) 29 / 54

Page 30: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

I like this example because it involves both

mathematical operations (matrix products),andcomputer architecture (memory hierarchy)

Only if knowing both, you can make breakthroughs

Chih-Jen Lin (National Taiwan Univ.) 30 / 54

Page 31: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Matrix-matrix Product (Cont’d)

For big-data analytics, we are in a similar situation

We want to run mathematical algorithms(classification and clustering) in a complicatedarchitecture (distributed system)

But we are like at the time point before optimizedBLAS was developed

Chih-Jen Lin (National Taiwan Univ.) 31 / 54

Page 32: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Algorithms and Systems

To have technical breakthroughs for big-dataanalytics, we should know both algorithms andsystems well, and consider them together

Indeed, if you are an expert on both topics,everybody wants you now

Many machine learning Ph.D. students don’t knowmuch about systems. But this isn’t the case in theearly days of computer science

Chih-Jen Lin (National Taiwan Univ.) 32 / 54

Page 33: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Algorithms and Systems (Cont’d)

At that time, every numerical analyst knowscomputer architecture well.

That’s how they successfully developedfloating-point systems and IEEE 754/854 standard

Chih-Jen Lin (National Taiwan Univ.) 33 / 54

Page 34: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Machine Learning Using Spark

Recently we developed a classifier on Spark

Spark is an in-memory cluster-computing platform

Beyond algorithms we must take details of

SparkScala

into account

For example, you want to know

the difference between mapPartitions andmap in Spark, andthe slower for loop than while loop in Scala

Chih-Jen Lin (National Taiwan Univ.) 34 / 54

Page 35: Big-data analytics: challenges and opportunities

Opportunities Lessons from past developments in one machine

Example: Machine Learning Using Spark(Cont’d)

During our development, Spark was significantlyupgraded from version 0.9 to 1.0. We must learntheir changes

It’s like when you write a code on a computer, butthe compiler or OS is actively changed. We are in astage just like that.

Chih-Jen Lin (National Taiwan Univ.) 35 / 54

Page 36: Big-data analytics: challenges and opportunities

Opportunities Successful examples?

Outline

3 OpportunitiesLessons from past developments in one machineSuccessful examples?Design of big-data algorithms

Chih-Jen Lin (National Taiwan Univ.) 36 / 54

Page 37: Big-data analytics: challenges and opportunities

Opportunities Successful examples?

Example of Distributed Machine Learning

I don’t think we have many successful examples yet

Here I will show one: CTR (Click Through Rate)prediction for computational advertising

Many companies now run distributed classificationfor CTR problems

Chih-Jen Lin (National Taiwan Univ.) 37 / 54

Page 38: Big-data analytics: challenges and opportunities

Opportunities Successful examples?

Example: CTR Prediction

Definition of CTR:

CTR =# clicks

# impressions.

A sequence of events

Not clicked Features of userClicked Features of userNot clicked Features of user· · · · · ·

A binary classification problem.

Chih-Jen Lin (National Taiwan Univ.) 38 / 54

Page 39: Big-data analytics: challenges and opportunities

Opportunities Successful examples?

Example: CTR Prediction (Cont’d)

Chih-Jen Lin (National Taiwan Univ.) 39 / 54

Page 40: Big-data analytics: challenges and opportunities

Opportunities Design of big-data algorithms

Outline

3 OpportunitiesLessons from past developments in one machineSuccessful examples?Design of big-data algorithms

Chih-Jen Lin (National Taiwan Univ.) 40 / 54

Page 41: Big-data analytics: challenges and opportunities

Opportunities Design of big-data algorithms

Design Considerations

Generally you want to minimize the data access andcommunication in a distributed environment

It’s possible that

method A better than B on one computer

but

method A worse than B in distributed environments

Chih-Jen Lin (National Taiwan Univ.) 41 / 54

Page 42: Big-data analytics: challenges and opportunities

Opportunities Design of big-data algorithms

Design Considerations (Cont’d)

Example: on one computer, often we do batchrather than online learning

Online and streaming learning may be more usefulfor big-data applications

Example: very often we design synchronous parallelalgorithms

Maybe asynchronous ones are better for big data?

Chih-Jen Lin (National Taiwan Univ.) 42 / 54

Page 43: Big-data analytics: challenges and opportunities

Opportunities Design of big-data algorithms

Workflow Issues

Data analytics is often only part of the workflow ofa big-data application

By workflow, I mean things from raw data to finaluse of the results

Other steps may be more complicated than theanalytics step

In one-computer situation, the focus is often on theanalytics step

Chih-Jen Lin (National Taiwan Univ.) 43 / 54

Page 44: Big-data analytics: challenges and opportunities

Opportunities Design of big-data algorithms

How to Get Started?

In my opinion, we should start from applications

Applications → programming frameworks andalgorithms → general tools

Now almost every big-data application requiresspecial settings of algorithms, but I believe generaltools will be possible

Chih-Jen Lin (National Taiwan Univ.) 44 / 54

Page 45: Big-data analytics: challenges and opportunities

Discussion and conclusions

Outline

1 From data mining to big data

2 Challenges

3 Opportunities

4 Discussion and conclusions

Chih-Jen Lin (National Taiwan Univ.) 45 / 54

Page 46: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk of This Topic

It’s unclear how successful we can be

Two problems:

Technology limitsApplicability limits

Chih-Jen Lin (National Taiwan Univ.) 46 / 54

Page 47: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Technology limits

It’s possible that we cannot get satisfactory resultsbecause of the distributed configuration

Recall that parallel programming or HPC (highperformance computing) wasn’t very successful inearly 90’s. But there are two differences this time

1 We are using commodity machines2 Data become the focus

Well, every area has its limitation. The degree ofsuccess varies

Chih-Jen Lin (National Taiwan Univ.) 47 / 54

Page 48: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Technology Limits (Cont’d)

Let’s compare two matrix products:

Dense matrix products: very successful as the finaloutcome (optimized BLAS) is much better thanwhat ordinary users wrote

Sparse matrix products: not as successful. My codeis about as good as those provided by Matlab

For big data analytics, it’s too early to tell

We never know until we try

Chih-Jen Lin (National Taiwan Univ.) 48 / 54

Page 49: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Applicability Limits

What’s the percentage of applications that needbig-data analytics?

Not clear. Indeed some think the percentage issmall (so they think big-data analytics is a hype)

One main reason is that you can always analyze arandom subest on one machine

But you may say this is a chicken and egg problem –because of no available tools, so no applications??

Chih-Jen Lin (National Taiwan Univ.) 49 / 54

Page 50: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

Another problem is the mis-understanding

Until recently, few universities or companies canaccess data center environments. They thereforethink those big ones (e.g., Google) are doingbig-data analytics for everything

In fact, the situation isn’t like that

Chih-Jen Lin (National Taiwan Univ.) 50 / 54

Page 51: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenagesex: everyone talks about it, nobody really knowshow to do it, everyone thinks everyone else is doingit, so everyone claims they are doing it ...”

In my recent visit to a large company, their peopledid say that most analytics works are still done onone machine

Chih-Jen Lin (National Taiwan Univ.) 51 / 54

Page 52: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenagesex: everyone talks about it, nobody really knowshow to do it, everyone thinks everyone else is doingit, so everyone claims they are doing it ...”

In my recent visit to a large company, their peopledid say that most analytics works are still done onone machine

Chih-Jen Lin (National Taiwan Univ.) 51 / 54

Page 53: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenagesex: everyone talks about it, nobody really knowshow to do it, everyone thinks everyone else is doingit, so everyone claims they are doing it ...”

In my recent visit to a large company, their peopledid say that most analytics works are still done onone machine

Chih-Jen Lin (National Taiwan Univ.) 51 / 54

Page 54: Big-data analytics: challenges and opportunities

Discussion and conclusions

Risk: Applicability Limits (Cont’d)

A quote from Dan Ariely, “Big data is like teenagesex: everyone talks about it, nobody really knowshow to do it, everyone thinks everyone else is doingit, so everyone claims they are doing it ...”

In my recent visit to a large company, their peopledid say that most analytics works are still done onone machine

Chih-Jen Lin (National Taiwan Univ.) 51 / 54

Page 55: Big-data analytics: challenges and opportunities

Discussion and conclusions

Open-source Developments

Open-source developments are very important forbig data analytics

How it works:

The company must do an application X. Theyconsider an open-source tool Y. But Y is notenough for X. Then their engineers improve Y andsubmit pull requests

Through this process, core developers of a projectare formed. They are from various companies

Chih-Jen Lin (National Taiwan Univ.) 52 / 54

Page 56: Big-data analytics: challenges and opportunities

Discussion and conclusions

Open-source Developments (Cont’d)

For Taiwanese data-science companies, I think weshould actively participate in such developments

Indeed industry rather than schools are in a betterposition to do this

Chih-Jen Lin (National Taiwan Univ.) 53 / 54

Page 57: Big-data analytics: challenges and opportunities

Discussion and conclusions

Conclusions

Big-data analytics is in its infancy

It’s challenging to development algorithms and toolsin a distributed environment

To start, we should take both algorithms andsystems into consideration

Hopefully we will get some breakthroughs in thenear future

Chih-Jen Lin (National Taiwan Univ.) 54 / 54