semi-nonparametric inferences for massive datachengg/massive_heterogeneous_data.pdf · recent news...

99
Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work with Tianqi Zhao and Han Liu

Upload: others

Post on 30-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Semi-Nonparametric Inferences for MassiveData

Guang Cheng1

Department of StatisticsPurdue University

Statistics Seminar at NCSUOctober, 2015

1Acknowledge NSF, Simons Foundation and ONR. A Joint Work withTianqi Zhao and Han Liu

Page 2: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

The Era of Big Data

At the 2010 Google Atmosphere Convention, Google’s CEOEric Schmidt pointed out that,

“There were 5 exabytes of information created between the dawnof civilization through 2003, but that much information is nowcreated every 2 days.”

No wonder that the era of Big Data has arrived...

Page 3: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Recent News on Big Data

On August 6, 2014, Nature2 released news: “US Big-DataHealth Network Launches Aspirin Study.”

In this $10-million pilot study, the use of aspirin to preventheart disease will be investigated;

Participants will take daily doses of aspirin that fall withinthe range typically prescribed for heart disease, and bemonitored to determine whether one dosage works betterthan the others;

The health-care data such as insurance claims, blood testsand medical histories will be collected from as many as 30million people in the United States through PCORnet3;

2http://www.nature.com/news/us-big-data-health-network-launches-aspirin-study-1.15675

3A network setup by Patient-Centered Outcomes Research (PCOR)Institute for collecting health-care data.

Page 4: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Recent News on Big Data

On August 6, 2014, Nature2 released news: “US Big-DataHealth Network Launches Aspirin Study.”

In this $10-million pilot study, the use of aspirin to preventheart disease will be investigated;

Participants will take daily doses of aspirin that fall withinthe range typically prescribed for heart disease, and bemonitored to determine whether one dosage works betterthan the others;

The health-care data such as insurance claims, blood testsand medical histories will be collected from as many as 30million people in the United States through PCORnet3;

2http://www.nature.com/news/us-big-data-health-network-launches-aspirin-study-1.15675

3A network setup by Patient-Centered Outcomes Research (PCOR)Institute for collecting health-care data.

Page 5: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Recent News on Big Data

On August 6, 2014, Nature2 released news: “US Big-DataHealth Network Launches Aspirin Study.”

In this $10-million pilot study, the use of aspirin to preventheart disease will be investigated;

Participants will take daily doses of aspirin that fall withinthe range typically prescribed for heart disease, and bemonitored to determine whether one dosage works betterthan the others;

The health-care data such as insurance claims, blood testsand medical histories will be collected from as many as 30million people in the United States through PCORnet3;

2http://www.nature.com/news/us-big-data-health-network-launches-aspirin-study-1.15675

3A network setup by Patient-Centered Outcomes Research (PCOR)Institute for collecting health-care data.

Page 6: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Recent News on Big Data

On August 6, 2014, Nature2 released news: “US Big-DataHealth Network Launches Aspirin Study.”

In this $10-million pilot study, the use of aspirin to preventheart disease will be investigated;

Participants will take daily doses of aspirin that fall withinthe range typically prescribed for heart disease, and bemonitored to determine whether one dosage works betterthan the others;

The health-care data such as insurance claims, blood testsand medical histories will be collected from as many as 30million people in the United States through PCORnet3;

2http://www.nature.com/news/us-big-data-health-network-launches-aspirin-study-1.15675

3A network setup by Patient-Centered Outcomes Research (PCOR)Institute for collecting health-care data.

Page 7: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Recent News on Big Data (cont’)

PCORnet will connect multiple smaller networks, givingresearchers access to records at a large number ofinstitutions without creating a central data repository;

This decentralization creates one of the greatest challengeson how to merge and standardize data from differentnetworks to enable accurate comparison;

The many types of data – scans from medical imaging,vital-signs records and, eventually, genetic information canbe messy, and record-keeping systems vary amonghealth-care institutions.

Page 8: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Recent News on Big Data (cont’)

PCORnet will connect multiple smaller networks, givingresearchers access to records at a large number ofinstitutions without creating a central data repository;

This decentralization creates one of the greatest challengeson how to merge and standardize data from differentnetworks to enable accurate comparison;

The many types of data – scans from medical imaging,vital-signs records and, eventually, genetic information canbe messy, and record-keeping systems vary amonghealth-care institutions.

Page 9: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Recent News on Big Data (cont’)

PCORnet will connect multiple smaller networks, givingresearchers access to records at a large number ofinstitutions without creating a central data repository;

This decentralization creates one of the greatest challengeson how to merge and standardize data from differentnetworks to enable accurate comparison;

The many types of data – scans from medical imaging,vital-signs records and, eventually, genetic information canbe messy, and record-keeping systems vary amonghealth-care institutions.

Page 10: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Challenges of Big Data

Motivated by this US health network data, we summarize thefeatures of big data as 4D:

Distributed: computation and storage bottleneck;

Dirty: the curse of heterogeneity;

Dimensionality: scale with sample size;

Dynamic: non-stationary underlying distribution;

This talk focuses on “Distributed” and “Dirty”.

Page 11: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Challenges of Big Data

Motivated by this US health network data, we summarize thefeatures of big data as 4D:

Distributed: computation and storage bottleneck;

Dirty: the curse of heterogeneity;

Dimensionality: scale with sample size;

Dynamic: non-stationary underlying distribution;

This talk focuses on “Distributed” and “Dirty”.

Page 12: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Challenges of Big Data

Motivated by this US health network data, we summarize thefeatures of big data as 4D:

Distributed: computation and storage bottleneck;

Dirty: the curse of heterogeneity;

Dimensionality: scale with sample size;

Dynamic: non-stationary underlying distribution;

This talk focuses on “Distributed” and “Dirty”.

Page 13: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Challenges of Big Data

Motivated by this US health network data, we summarize thefeatures of big data as 4D:

Distributed: computation and storage bottleneck;

Dirty: the curse of heterogeneity;

Dimensionality: scale with sample size;

Dynamic: non-stationary underlying distribution;

This talk focuses on “Distributed” and “Dirty”.

Page 14: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Challenges of Big Data

Motivated by this US health network data, we summarize thefeatures of big data as 4D:

Distributed: computation and storage bottleneck;

Dirty: the curse of heterogeneity;

Dimensionality: scale with sample size;

Dynamic: non-stationary underlying distribution;

This talk focuses on “Distributed” and “Dirty”.

Page 15: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Challenges of Big Data

Motivated by this US health network data, we summarize thefeatures of big data as 4D:

Distributed: computation and storage bottleneck;

Dirty: the curse of heterogeneity;

Dimensionality: scale with sample size;

Dynamic: non-stationary underlying distribution;

This talk focuses on “Distributed” and “Dirty”.

Page 16: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

General Goal

In the era of massive data, here are my questions of curiosity:

Can we guarantee a high level of statistical inferentialaccuracy under a certain computation/time constraint?

Or what is the least computational cost in obtaining thebest possible statistical inferences?

How to break the curse of heterogeneity by exploiting thecommonality information?

How to perform a large scale heterogeneity testing?

Page 17: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

General Goal

In the era of massive data, here are my questions of curiosity:

Can we guarantee a high level of statistical inferentialaccuracy under a certain computation/time constraint?

Or what is the least computational cost in obtaining thebest possible statistical inferences?

How to break the curse of heterogeneity by exploiting thecommonality information?

How to perform a large scale heterogeneity testing?

Page 18: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

General Goal

In the era of massive data, here are my questions of curiosity:

Can we guarantee a high level of statistical inferentialaccuracy under a certain computation/time constraint?

Or what is the least computational cost in obtaining thebest possible statistical inferences?

How to break the curse of heterogeneity by exploiting thecommonality information?

How to perform a large scale heterogeneity testing?

Page 19: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

General Goal

In the era of massive data, here are my questions of curiosity:

Can we guarantee a high level of statistical inferentialaccuracy under a certain computation/time constraint?

Or what is the least computational cost in obtaining thebest possible statistical inferences?

How to break the curse of heterogeneity by exploiting thecommonality information?

How to perform a large scale heterogeneity testing?

Page 20: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

General Goal

In the era of massive data, here are my questions of curiosity:

Can we guarantee a high level of statistical inferentialaccuracy under a certain computation/time constraint?

Or what is the least computational cost in obtaining thebest possible statistical inferences?

How to break the curse of heterogeneity by exploiting thecommonality information?

How to perform a large scale heterogeneity testing?

Page 21: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Oracle rule for massive data is the key4.

4Simplified technical results are presented for better delivering insights.

Page 22: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Part I: Homogeneous Data

Page 23: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Outline

1 Divide-and-Conquer Strategy

2 Kernel Ridge Regression

3 Nonparametric Inference

4 Simulations

Page 24: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Divide-and-Conquer Approach

Consider a univariate nonparametric regression model:

Y = f(Z) + ε;

Entire Dataset (iid data):

X1, X2, . . . , XN , for X = (Y,Z);

Randomly split dataset into s subsamples (with equalsample size n = N/s): P1, . . . , Ps;

Perform nonparametric estimating in each subsample:

Pj = X(j)1 , . . . , X(j)

n =⇒ f (j)n ;

Aggregation such as fN = (1/s)∑s

j=1 f(j)n .

Page 25: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Divide-and-Conquer Approach

Consider a univariate nonparametric regression model:

Y = f(Z) + ε;

Entire Dataset (iid data):

X1, X2, . . . , XN , for X = (Y,Z);

Randomly split dataset into s subsamples (with equalsample size n = N/s): P1, . . . , Ps;

Perform nonparametric estimating in each subsample:

Pj = X(j)1 , . . . , X(j)

n =⇒ f (j)n ;

Aggregation such as fN = (1/s)∑s

j=1 f(j)n .

Page 26: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Divide-and-Conquer Approach

Consider a univariate nonparametric regression model:

Y = f(Z) + ε;

Entire Dataset (iid data):

X1, X2, . . . , XN , for X = (Y,Z);

Randomly split dataset into s subsamples (with equalsample size n = N/s): P1, . . . , Ps;

Perform nonparametric estimating in each subsample:

Pj = X(j)1 , . . . , X(j)

n =⇒ f (j)n ;

Aggregation such as fN = (1/s)∑s

j=1 f(j)n .

Page 27: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Divide-and-Conquer Approach

Consider a univariate nonparametric regression model:

Y = f(Z) + ε;

Entire Dataset (iid data):

X1, X2, . . . , XN , for X = (Y,Z);

Randomly split dataset into s subsamples (with equalsample size n = N/s): P1, . . . , Ps;

Perform nonparametric estimating in each subsample:

Pj = X(j)1 , . . . , X(j)

n =⇒ f (j)n ;

Aggregation such as fN = (1/s)∑s

j=1 f(j)n .

Page 28: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Divide-and-Conquer Approach

Consider a univariate nonparametric regression model:

Y = f(Z) + ε;

Entire Dataset (iid data):

X1, X2, . . . , XN , for X = (Y,Z);

Randomly split dataset into s subsamples (with equalsample size n = N/s): P1, . . . , Ps;

Perform nonparametric estimating in each subsample:

Pj = X(j)1 , . . . , X(j)

n =⇒ f (j)n ;

Aggregation such as fN = (1/s)∑s

j=1 f(j)n .

Page 29: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

A Few Comments

As far as we are aware, the statistical studies of the D&Cmethod focus on either parametric inferences, e.g.,Bootstrap (Kleiner et al, 2014, JRSS-B) and Bayesian(Wang and Dunson, 2014, Arxiv), or nonparametricminimaxity (Zhang et al, 2014, Arxiv);

Semi/nonparametric inferences for massive data stillremain untouched (although they are crucially importantin evaluating reproducibility in modern scientific studies).

Page 30: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

A Few Comments

As far as we are aware, the statistical studies of the D&Cmethod focus on either parametric inferences, e.g.,Bootstrap (Kleiner et al, 2014, JRSS-B) and Bayesian(Wang and Dunson, 2014, Arxiv), or nonparametricminimaxity (Zhang et al, 2014, Arxiv);

Semi/nonparametric inferences for massive data stillremain untouched (although they are crucially importantin evaluating reproducibility in modern scientific studies).

Page 31: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Splitotics Theory (s→∞ as N →∞)

In theory, we want to derive a theoretical upper bound fors under which the following oracle rule holds:“the nonparametric inferences constructed based on fN are(asymp.) the same as those on the oracle estimator fN .”

Meanwhile, we want to know how to choose the smoothingparameter in each sub-sample;

Allowing s→∞ significantly complicates the traditionaltheoretical analysis.

Page 32: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Splitotics Theory (s→∞ as N →∞)

In theory, we want to derive a theoretical upper bound fors under which the following oracle rule holds:“the nonparametric inferences constructed based on fN are(asymp.) the same as those on the oracle estimator fN .”

Meanwhile, we want to know how to choose the smoothingparameter in each sub-sample;

Allowing s→∞ significantly complicates the traditionaltheoretical analysis.

Page 33: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Splitotics Theory (s→∞ as N →∞)

In theory, we want to derive a theoretical upper bound fors under which the following oracle rule holds:“the nonparametric inferences constructed based on fN are(asymp.) the same as those on the oracle estimator fN .”

Meanwhile, we want to know how to choose the smoothingparameter in each sub-sample;

Allowing s→∞ significantly complicates the traditionaltheoretical analysis.

Page 34: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Kernel Ridge Regression (KRR)

Define the KRR estimate f : R1 7→ R1 as

fn = arg minf∈H

1

n

n∑i=1

(Yi − f(Zi))2 + λ‖f‖2H

,

where H is a reproducing kernel Hilbert space (RKHS)with a kernel K(z, z′) =

∑∞i=1 µiφi(z)φi(z

′). Here, µi’s areeigenvalues and φi(·)’s are eigenfunctions.

Explicitly, fn(x) =∑n

i=1 αiK(xi, x) withα = (K + λnI)−1y.

Smoothing spline is a special case of KRR estimation.

The early study on KRR estimation in large datasetfocuses on either low rank approximation or early-stopping.

Page 35: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Kernel Ridge Regression (KRR)

Define the KRR estimate f : R1 7→ R1 as

fn = arg minf∈H

1

n

n∑i=1

(Yi − f(Zi))2 + λ‖f‖2H

,

where H is a reproducing kernel Hilbert space (RKHS)with a kernel K(z, z′) =

∑∞i=1 µiφi(z)φi(z

′). Here, µi’s areeigenvalues and φi(·)’s are eigenfunctions.

Explicitly, fn(x) =∑n

i=1 αiK(xi, x) withα = (K + λnI)−1y.

Smoothing spline is a special case of KRR estimation.

The early study on KRR estimation in large datasetfocuses on either low rank approximation or early-stopping.

Page 36: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Kernel Ridge Regression (KRR)

Define the KRR estimate f : R1 7→ R1 as

fn = arg minf∈H

1

n

n∑i=1

(Yi − f(Zi))2 + λ‖f‖2H

,

where H is a reproducing kernel Hilbert space (RKHS)with a kernel K(z, z′) =

∑∞i=1 µiφi(z)φi(z

′). Here, µi’s areeigenvalues and φi(·)’s are eigenfunctions.

Explicitly, fn(x) =∑n

i=1 αiK(xi, x) withα = (K + λnI)−1y.

Smoothing spline is a special case of KRR estimation.

The early study on KRR estimation in large datasetfocuses on either low rank approximation or early-stopping.

Page 37: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Kernel Ridge Regression (KRR)

Define the KRR estimate f : R1 7→ R1 as

fn = arg minf∈H

1

n

n∑i=1

(Yi − f(Zi))2 + λ‖f‖2H

,

where H is a reproducing kernel Hilbert space (RKHS)with a kernel K(z, z′) =

∑∞i=1 µiφi(z)φi(z

′). Here, µi’s areeigenvalues and φi(·)’s are eigenfunctions.

Explicitly, fn(x) =∑n

i=1 αiK(xi, x) withα = (K + λnI)−1y.

Smoothing spline is a special case of KRR estimation.

The early study on KRR estimation in large datasetfocuses on either low rank approximation or early-stopping.

Page 38: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 39: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 40: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 41: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 42: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 43: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 44: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 45: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Commonly Used Kernels

The decay rate of µk characterizes the smoothness of f .

Finite Rank (µk = 0 for k > r):

polynomial kernel K(x, x′) = (1 + xx′)d with rank r = d+ 1;

Exponential Decay (µk exp(−αkp) for some α, p > 0):

Gaussian kernel K(x, x′) = exp(−‖x− x′‖2/σ2) for p = 2;

Polynomial Decay (µk k−2m for some m > 1/2):

Kernels for the Sobolev spaces, e.g.,K(x, x′) = 1 +minx, x′ for the first order Sobolev space;Smoothing spline estimate (Wahba, 1990).

Page 46: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Local Confidence Interval5

Theorem 1. Suppose regularity conditions on ε, K(·, ·) andφj(·) hold, e.g., tail condition on ε and supj ‖φj‖∞ ≤ Cφ. Giventhat H is not too large (in terms of its packing entropy), wehave for any fixed x0 ∈ X ,

√Nh(fN (x0)− f0(x0))

d−→ N(0, σ2x0), (1)

where h = h(λ) = r(λ)−1 and r(λ) ≡∑∞

i=11 + λ/µi−1.

An important consequence is that the rate√Nh and variance

σ2x0 are the same as those of fN (based on the entire dataset).Hence, the oracle property of the local confidence interval holdsunder the above conditions that determine s and λ.

5Simultaneous confidence band result delivers similar theoretical insights

Page 47: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Examples

The oracle property of local confidence interval holds under thefollowing conditions on λ and s:

Finite Rank (with a rank r):

λ = o(N−1/2) and log(λ−1) = o(log2N);

Exponential Decay (with a power p):

λ = o((logN)1/(2p)/√N) and log(λ−1) = o(log2(N));

Polynomial Decay (with a power m > 1/2):

λ N−d for some 2m/(4m+ 1) < d < 4m2/(8m− 1).

Choose λ as if working on the entire dataset with samplesize N . Hence, the standard generalized cross validationmethod (applied to each subsample) fails in this case.

Page 48: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Examples

The oracle property of local confidence interval holds under thefollowing conditions on λ and s:

Finite Rank (with a rank r):

λ = o(N−1/2) and log(λ−1) = o(log2N);

Exponential Decay (with a power p):

λ = o((logN)1/(2p)/√N) and log(λ−1) = o(log2(N));

Polynomial Decay (with a power m > 1/2):

λ N−d for some 2m/(4m+ 1) < d < 4m2/(8m− 1).

Choose λ as if working on the entire dataset with samplesize N . Hence, the standard generalized cross validationmethod (applied to each subsample) fails in this case.

Page 49: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Examples

The oracle property of local confidence interval holds under thefollowing conditions on λ and s:

Finite Rank (with a rank r):

λ = o(N−1/2) and log(λ−1) = o(log2N);

Exponential Decay (with a power p):

λ = o((logN)1/(2p)/√N) and log(λ−1) = o(log2(N));

Polynomial Decay (with a power m > 1/2):

λ N−d for some 2m/(4m+ 1) < d < 4m2/(8m− 1).

Choose λ as if working on the entire dataset with samplesize N . Hence, the standard generalized cross validationmethod (applied to each subsample) fails in this case.

Page 50: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Examples

The oracle property of local confidence interval holds under thefollowing conditions on λ and s:

Finite Rank (with a rank r):

λ = o(N−1/2) and log(λ−1) = o(log2N);

Exponential Decay (with a power p):

λ = o((logN)1/(2p)/√N) and log(λ−1) = o(log2(N));

Polynomial Decay (with a power m > 1/2):

λ N−d for some 2m/(4m+ 1) < d < 4m2/(8m− 1).

Choose λ as if working on the entire dataset with samplesize N . Hence, the standard generalized cross validationmethod (applied to each subsample) fails in this case.

Page 51: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Examples

The oracle property of local confidence interval holds under thefollowing conditions on λ and s:

Finite Rank (with a rank r):

λ = o(N−1/2) and log(λ−1) = o(log2N);

Exponential Decay (with a power p):

λ = o((logN)1/(2p)/√N) and log(λ−1) = o(log2(N));

Polynomial Decay (with a power m > 1/2):

λ N−d for some 2m/(4m+ 1) < d < 4m2/(8m− 1).

Choose λ as if working on the entire dataset with samplesize N . Hence, the standard generalized cross validationmethod (applied to each subsample) fails in this case.

Page 52: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Examples

The oracle property of local confidence interval holds under thefollowing conditions on λ and s:

Finite Rank (with a rank r):

λ = o(N−1/2) and log(λ−1) = o(log2N);

Exponential Decay (with a power p):

λ = o((logN)1/(2p)/√N) and log(λ−1) = o(log2(N));

Polynomial Decay (with a power m > 1/2):

λ N−d for some 2m/(4m+ 1) < d < 4m2/(8m− 1).

Choose λ as if working on the entire dataset with samplesize N . Hence, the standard generalized cross validationmethod (applied to each subsample) fails in this case.

Page 53: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Examples

The oracle property of local confidence interval holds under thefollowing conditions on λ and s:

Finite Rank (with a rank r):

λ = o(N−1/2) and log(λ−1) = o(log2N);

Exponential Decay (with a power p):

λ = o((logN)1/(2p)/√N) and log(λ−1) = o(log2(N));

Polynomial Decay (with a power m > 1/2):

λ N−d for some 2m/(4m+ 1) < d < 4m2/(8m− 1).

Choose λ as if working on the entire dataset with samplesize N . Hence, the standard generalized cross validationmethod (applied to each subsample) fails in this case.

Page 54: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Specifically, we have the following upper bounds for s:

For finite rank kernel (with any finite rank r),

s = O(Nγ) for any γ < 1/2;

For exponential decay kernel (with any finite power p),

s = O(Nγ′) for any γ′ < γ < 1/2;

For polynomial decay kernel (with m = 2),

s = o(N4/27) ≈ o(N0.29).

Page 55: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Specifically, we have the following upper bounds for s:

For finite rank kernel (with any finite rank r),

s = O(Nγ) for any γ < 1/2;

For exponential decay kernel (with any finite power p),

s = O(Nγ′) for any γ′ < γ < 1/2;

For polynomial decay kernel (with m = 2),

s = o(N4/27) ≈ o(N0.29).

Page 56: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Specifically, we have the following upper bounds for s:

For finite rank kernel (with any finite rank r),

s = O(Nγ) for any γ < 1/2;

For exponential decay kernel (with any finite power p),

s = O(Nγ′) for any γ′ < γ < 1/2;

For polynomial decay kernel (with m = 2),

s = o(N4/27) ≈ o(N0.29).

Page 57: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Consider the following test:

H0 : f = f0 v.s. H1 : f 6= f0,

where f0 ∈ H;

Let LN,λ be the (penalized) likelihood function based onthe entire dataset.

Let PLRT(j)n,λ be the (penalized) likelihood ratio based on

the j-th subsample.

Given the Divide-and-Conquer strategy, we have twonatural choices of test statistic:

PLRTN,λ = (1/s)∑sj=1 PLRT

(j)n,λ;

PLRTN,λ = LN,λ(fN )− LN,λ(f0);

Page 58: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Consider the following test:

H0 : f = f0 v.s. H1 : f 6= f0,

where f0 ∈ H;

Let LN,λ be the (penalized) likelihood function based onthe entire dataset.

Let PLRT(j)n,λ be the (penalized) likelihood ratio based on

the j-th subsample.

Given the Divide-and-Conquer strategy, we have twonatural choices of test statistic:

PLRTN,λ = (1/s)∑sj=1 PLRT

(j)n,λ;

PLRTN,λ = LN,λ(fN )− LN,λ(f0);

Page 59: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Consider the following test:

H0 : f = f0 v.s. H1 : f 6= f0,

where f0 ∈ H;

Let LN,λ be the (penalized) likelihood function based onthe entire dataset.

Let PLRT(j)n,λ be the (penalized) likelihood ratio based on

the j-th subsample.

Given the Divide-and-Conquer strategy, we have twonatural choices of test statistic:

PLRTN,λ = (1/s)∑sj=1 PLRT

(j)n,λ;

PLRTN,λ = LN,λ(fN )− LN,λ(f0);

Page 60: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Consider the following test:

H0 : f = f0 v.s. H1 : f 6= f0,

where f0 ∈ H;

Let LN,λ be the (penalized) likelihood function based onthe entire dataset.

Let PLRT(j)n,λ be the (penalized) likelihood ratio based on

the j-th subsample.

Given the Divide-and-Conquer strategy, we have twonatural choices of test statistic:

PLRTN,λ = (1/s)∑sj=1 PLRT

(j)n,λ;

PLRTN,λ = LN,λ(fN )− LN,λ(f0);

Page 61: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Consider the following test:

H0 : f = f0 v.s. H1 : f 6= f0,

where f0 ∈ H;

Let LN,λ be the (penalized) likelihood function based onthe entire dataset.

Let PLRT(j)n,λ be the (penalized) likelihood ratio based on

the j-th subsample.

Given the Divide-and-Conquer strategy, we have twonatural choices of test statistic:

PLRTN,λ = (1/s)∑sj=1 PLRT

(j)n,λ;

PLRTN,λ = LN,λ(fN )− LN,λ(f0);

Page 62: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Consider the following test:

H0 : f = f0 v.s. H1 : f 6= f0,

where f0 ∈ H;

Let LN,λ be the (penalized) likelihood function based onthe entire dataset.

Let PLRT(j)n,λ be the (penalized) likelihood ratio based on

the j-th subsample.

Given the Divide-and-Conquer strategy, we have twonatural choices of test statistic:

PLRTN,λ = (1/s)∑sj=1 PLRT

(j)n,λ;

PLRTN,λ = LN,λ(fN )− LN,λ(f0);

Page 63: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Theorem 2. We prove that PLRTN,λ and PLRTN,λ are bothconsistent under some upper bound of s, but the latter isminimax optimal (Ingster, 1993) when choosing some s strictlysmaller than the above upper bound required for consistency.

An additional big data insight: we have to sacrifice certainamount of computational efficiency (avoid choosing thelargest possible s) for obtaining the optimality.

Page 64: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Penalized Likelihood Ratio Test

Theorem 2. We prove that PLRTN,λ and PLRTN,λ are bothconsistent under some upper bound of s, but the latter isminimax optimal (Ingster, 1993) when choosing some s strictlysmaller than the above upper bound required for consistency.

An additional big data insight: we have to sacrifice certainamount of computational efficiency (avoid choosing thelargest possible s) for obtaining the optimality.

Page 65: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Summary

Big Data Insights:

Oracle rule holds when s does not grow too fast;choose the smoothing parameter as if not splitting the data;sacrifice computational efficiency for obtaining optimality.

Key technical tool: Functional Bahadur Representation inShang and C. (2013, AoS).

Page 66: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Summary

Big Data Insights:

Oracle rule holds when s does not grow too fast;choose the smoothing parameter as if not splitting the data;sacrifice computational efficiency for obtaining optimality.

Key technical tool: Functional Bahadur Representation inShang and C. (2013, AoS).

Page 67: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Summary

Big Data Insights:

Oracle rule holds when s does not grow too fast;choose the smoothing parameter as if not splitting the data;sacrifice computational efficiency for obtaining optimality.

Key technical tool: Functional Bahadur Representation inShang and C. (2013, AoS).

Page 68: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Summary

Big Data Insights:

Oracle rule holds when s does not grow too fast;choose the smoothing parameter as if not splitting the data;sacrifice computational efficiency for obtaining optimality.

Key technical tool: Functional Bahadur Representation inShang and C. (2013, AoS).

Page 69: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Summary

Big Data Insights:

Oracle rule holds when s does not grow too fast;choose the smoothing parameter as if not splitting the data;sacrifice computational efficiency for obtaining optimality.

Key technical tool: Functional Bahadur Representation inShang and C. (2013, AoS).

Page 70: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Phase Transition of Coverage Probability

(a) True function (b) CPs at x0 = 0.5

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

x

f(x)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

log(s)/log(N)

CP

N=256N=512N=1024N=2048

(c) CPs on [0, 1] for N = 512 (d) CPs on [0, 1] for N = 1024

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

CP

s=1s=4s=16s=64

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

CP

s=1s=4s=16s=64

Page 71: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

Divide-and-Conquer Strategy Kernel Ridge Regression Nonparametric Inference Simulations

Phase Transition of Mean Squared Error

0.0 0.2 0.4 0.6 0.8

0.00

0.05

0.10

0.15

log(s)/log(N)

MSE

z0 = 0.95

N=256N=528N=1024N=2048N=4096

DĞĂŶ ^ƋƵĂƌĞ ƌƌŽƌ ŽĨ Ĩ

Figure: Mean-square errors of fN under different choices of N and s

Page 72: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Part II: Heterogeneous Data

Page 73: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Outline

1 A Partially Linear Modelling

2 Efficiency Boosting

3 Heterogeneity Testing

Page 74: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Revisit US Health Data

Let us revisit the news on US Big-Data Health Network.

Different networks such as US hospitals conduct the sameclinical trial on the relation between a response variable Yi.e., heart disease, and a set of predictors Z,X1, X2, . . . , Xp

including the dosage of aspirin;

Medical knowledge suggests that the relation between Yand Z (e.g., blood pressure) should be homogeneous for allhuman;

However, for the other covariates X1, X2, . . . , Xp (e.g.,certain genes), we allow their (linear) relations with Y topotentially vary in different networks (located in differentareas). For example, the genetic functionality of differentraces might be heterogenous;

The linear relation is assumed here for simplicity, andparticularly suitable when the covariates are discrete suchas the dosage of aspirin, e.g., 1 or 2 tablets each day.

Page 75: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Revisit US Health Data

Let us revisit the news on US Big-Data Health Network.

Different networks such as US hospitals conduct the sameclinical trial on the relation between a response variable Yi.e., heart disease, and a set of predictors Z,X1, X2, . . . , Xp

including the dosage of aspirin;

Medical knowledge suggests that the relation between Yand Z (e.g., blood pressure) should be homogeneous for allhuman;

However, for the other covariates X1, X2, . . . , Xp (e.g.,certain genes), we allow their (linear) relations with Y topotentially vary in different networks (located in differentareas). For example, the genetic functionality of differentraces might be heterogenous;

The linear relation is assumed here for simplicity, andparticularly suitable when the covariates are discrete suchas the dosage of aspirin, e.g., 1 or 2 tablets each day.

Page 76: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Revisit US Health Data

Let us revisit the news on US Big-Data Health Network.

Different networks such as US hospitals conduct the sameclinical trial on the relation between a response variable Yi.e., heart disease, and a set of predictors Z,X1, X2, . . . , Xp

including the dosage of aspirin;

Medical knowledge suggests that the relation between Yand Z (e.g., blood pressure) should be homogeneous for allhuman;

However, for the other covariates X1, X2, . . . , Xp (e.g.,certain genes), we allow their (linear) relations with Y topotentially vary in different networks (located in differentareas). For example, the genetic functionality of differentraces might be heterogenous;

The linear relation is assumed here for simplicity, andparticularly suitable when the covariates are discrete suchas the dosage of aspirin, e.g., 1 or 2 tablets each day.

Page 77: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Revisit US Health Data

Let us revisit the news on US Big-Data Health Network.

Different networks such as US hospitals conduct the sameclinical trial on the relation between a response variable Yi.e., heart disease, and a set of predictors Z,X1, X2, . . . , Xp

including the dosage of aspirin;

Medical knowledge suggests that the relation between Yand Z (e.g., blood pressure) should be homogeneous for allhuman;

However, for the other covariates X1, X2, . . . , Xp (e.g.,certain genes), we allow their (linear) relations with Y topotentially vary in different networks (located in differentareas). For example, the genetic functionality of differentraces might be heterogenous;

The linear relation is assumed here for simplicity, andparticularly suitable when the covariates are discrete suchas the dosage of aspirin, e.g., 1 or 2 tablets each day.

Page 78: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Revisit US Health Data

Let us revisit the news on US Big-Data Health Network.

Different networks such as US hospitals conduct the sameclinical trial on the relation between a response variable Yi.e., heart disease, and a set of predictors Z,X1, X2, . . . , Xp

including the dosage of aspirin;

Medical knowledge suggests that the relation between Yand Z (e.g., blood pressure) should be homogeneous for allhuman;

However, for the other covariates X1, X2, . . . , Xp (e.g.,certain genes), we allow their (linear) relations with Y topotentially vary in different networks (located in differentareas). For example, the genetic functionality of differentraces might be heterogenous;

The linear relation is assumed here for simplicity, andparticularly suitable when the covariates are discrete suchas the dosage of aspirin, e.g., 1 or 2 tablets each day.

Page 79: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

A Partially Linear Modelling

Assume that there exist s heterogeneous subpopulations:P1, . . . , Ps (with equal sample size n = N/s);

In the j-th subpopulation, we assume

Y = XTβ(j)0 + f0(Z) + ε, (1)

where ε has a sub-Gaussian tail and V ar(ε) = σ2;

We call β(j) as the heterogeneity and f as the commonalityof the massive data in consideration;

(1) is a typical semi-nonparametric model (see C. andShang, 2015, AoS) since β(j) and f are both of interest.

Page 80: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

A Partially Linear Modelling

Assume that there exist s heterogeneous subpopulations:P1, . . . , Ps (with equal sample size n = N/s);

In the j-th subpopulation, we assume

Y = XTβ(j)0 + f0(Z) + ε, (1)

where ε has a sub-Gaussian tail and V ar(ε) = σ2;

We call β(j) as the heterogeneity and f as the commonalityof the massive data in consideration;

(1) is a typical semi-nonparametric model (see C. andShang, 2015, AoS) since β(j) and f are both of interest.

Page 81: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

A Partially Linear Modelling

Assume that there exist s heterogeneous subpopulations:P1, . . . , Ps (with equal sample size n = N/s);

In the j-th subpopulation, we assume

Y = XTβ(j)0 + f0(Z) + ε, (1)

where ε has a sub-Gaussian tail and V ar(ε) = σ2;

We call β(j) as the heterogeneity and f as the commonalityof the massive data in consideration;

(1) is a typical semi-nonparametric model (see C. andShang, 2015, AoS) since β(j) and f are both of interest.

Page 82: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

A Partially Linear Modelling

Assume that there exist s heterogeneous subpopulations:P1, . . . , Ps (with equal sample size n = N/s);

In the j-th subpopulation, we assume

Y = XTβ(j)0 + f0(Z) + ε, (1)

where ε has a sub-Gaussian tail and V ar(ε) = σ2;

We call β(j) as the heterogeneity and f as the commonalityof the massive data in consideration;

(1) is a typical semi-nonparametric model (see C. andShang, 2015, AoS) since β(j) and f are both of interest.

Page 83: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Estimation Procedure

Individual estimation in the j-th subpopulation:

(β(j)n , f (j)n )

= argmin(β,f)∈Rp×H

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − f(Z

(j)i ))2

+ λ‖f‖2H

;

Aggregation: fN = (1/s)∑s

j=1 f(j)n ;

A plug-in estimate for the j-th heterogeneity parameter:

β(j)n = argmin

β∈Rp

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − fN (Z

(j)i ))2

;

Our final estimate is (β(j)n , fN ).

Page 84: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Estimation Procedure

Individual estimation in the j-th subpopulation:

(β(j)n , f (j)n )

= argmin(β,f)∈Rp×H

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − f(Z

(j)i ))2

+ λ‖f‖2H

;

Aggregation: fN = (1/s)∑s

j=1 f(j)n ;

A plug-in estimate for the j-th heterogeneity parameter:

β(j)n = argmin

β∈Rp

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − fN (Z

(j)i ))2

;

Our final estimate is (β(j)n , fN ).

Page 85: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Estimation Procedure

Individual estimation in the j-th subpopulation:

(β(j)n , f (j)n )

= argmin(β,f)∈Rp×H

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − f(Z

(j)i ))2

+ λ‖f‖2H

;

Aggregation: fN = (1/s)∑s

j=1 f(j)n ;

A plug-in estimate for the j-th heterogeneity parameter:

β(j)n = argmin

β∈Rp

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − fN (Z

(j)i ))2

;

Our final estimate is (β(j)n , fN ).

Page 86: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Estimation Procedure

Individual estimation in the j-th subpopulation:

(β(j)n , f (j)n )

= argmin(β,f)∈Rp×H

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − f(Z

(j)i ))2

+ λ‖f‖2H

;

Aggregation: fN = (1/s)∑s

j=1 f(j)n ;

A plug-in estimate for the j-th heterogeneity parameter:

β(j)n = argmin

β∈Rp

1

n

n∑i=1

(Y

(j)i − βTX

(j)i − fN (Z

(j)i ))2

;

Our final estimate is (β(j)n , fN ).

Page 87: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Relation to Homogeneous Data

The major concern of homogeneous data is the extremelyhigh computational cost. Fortunately, this can be dealt bythe divide-and-conquer approach;

However, when analyzing heterogeneous data, our majorinterest1 is about how to efficiently extract commonfeatures across many subpopulations while exploringheterogeneity of each subpopulation as s→∞;

Therefore, comparisons between (β(j)n , fN ) and oracle

estimate (in terms of both risk and limit distribution)would be needed.

1D&C can be applied to the sub-population with large sample size.

Page 88: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Relation to Homogeneous Data

The major concern of homogeneous data is the extremelyhigh computational cost. Fortunately, this can be dealt bythe divide-and-conquer approach;

However, when analyzing heterogeneous data, our majorinterest1 is about how to efficiently extract commonfeatures across many subpopulations while exploringheterogeneity of each subpopulation as s→∞;

Therefore, comparisons between (β(j)n , fN ) and oracle

estimate (in terms of both risk and limit distribution)would be needed.

1D&C can be applied to the sub-population with large sample size.

Page 89: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Relation to Homogeneous Data

The major concern of homogeneous data is the extremelyhigh computational cost. Fortunately, this can be dealt bythe divide-and-conquer approach;

However, when analyzing heterogeneous data, our majorinterest1 is about how to efficiently extract commonfeatures across many subpopulations while exploringheterogeneity of each subpopulation as s→∞;

Therefore, comparisons between (β(j)n , fN ) and oracle

estimate (in terms of both risk and limit distribution)would be needed.

1D&C can be applied to the sub-population with large sample size.

Page 90: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Oracle Estimate

We define the oracle estimate for f as if the heterogeneityinformation βj were known:

for = argminf∈H

1

N

n,s∑i,j=1

(Y

(j)i − (β

(j)0 )TX

(j)i − f(Z

(j)i ))2

+ λ‖f‖2H

.

The oracle estimate for βj can be defined similarly:

β(j)or = argmin

β

1

n

n∑i=1

(Y

(j)i − (β(j))TX

(j)i − f0(Z

(j)i ))2

+ λ‖f‖2H

.

Page 91: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

A Preliminary Result: Joint Asymptotics

Theorem 3. Given proper s→∞2 and λ→ 0, we have3

( √n(β

(j)n − β

(j)0 )√

Nh(fN (z0)− f0(z0)

)) N

(0, σ2

(Ω−1 00 Σ22

)),

where Ω = E(X− E(X|Z))⊗2.

2The asymptotic independence between β(j)n and fN (z0) is mainly due

to the fact that n/N = s−1 → 0.3The asymptotic variance Σ22 of fN is the same as that of for.

Page 92: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Efficiency Boosting

Theorem 4 implies that β(j)n is semiparametric efficient:

√n(β(j)

n − β0) N(0, σ2(E(X− E(X|Z))⊗2)−1).

We next illustrate an important feature of massive data:strength-borrowing. That is, the aggregation ofcommonality in turn boosts the estimation efficiency of

β(j)n from semiparametric level to parametric level.

By imposing a lower bound on s (such that strength areborrowed from a sufficient number of sub-populations), weshow that4

√n(β(j)

n − β(j)0 ) N(0, σ2(E[XXT ])−1)

as if the commonality information were available.

4Recall that β(j)n = argminβ∈Rp

1n

∑ni=1

(Y

(j)i − βTX

(j)i − fN (Z

(j)i ))2

.

Page 93: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Efficiency Boosting

Theorem 4 implies that β(j)n is semiparametric efficient:

√n(β(j)

n − β0) N(0, σ2(E(X− E(X|Z))⊗2)−1).

We next illustrate an important feature of massive data:strength-borrowing. That is, the aggregation ofcommonality in turn boosts the estimation efficiency of

β(j)n from semiparametric level to parametric level.

By imposing a lower bound on s (such that strength areborrowed from a sufficient number of sub-populations), weshow that4

√n(β(j)

n − β(j)0 ) N(0, σ2(E[XXT ])−1)

as if the commonality information were available.

4Recall that β(j)n = argminβ∈Rp

1n

∑ni=1

(Y

(j)i − βTX

(j)i − fN (Z

(j)i ))2

.

Page 94: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Efficiency Boosting

Theorem 4 implies that β(j)n is semiparametric efficient:

√n(β(j)

n − β0) N(0, σ2(E(X− E(X|Z))⊗2)−1).

We next illustrate an important feature of massive data:strength-borrowing. That is, the aggregation ofcommonality in turn boosts the estimation efficiency of

β(j)n from semiparametric level to parametric level.

By imposing a lower bound on s (such that strength areborrowed from a sufficient number of sub-populations), weshow that4

√n(β(j)

n − β(j)0 ) N(0, σ2(E[XXT ])−1)

as if the commonality information were available.

4Recall that β(j)n = argminβ∈Rp

1n

∑ni=1

(Y

(j)i − βTX

(j)i − fN (Z

(j)i ))2

.

Page 95: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

0.0 0.2 0.4 0.6 0.8

0.6

0.7

0.8

0.9

1.0

log(s)/log(N)

CP

Coverage Probability of betacheck

N=256N=528N=1024N=2048N=4096

ŽǀĞƌĂŐĞ WƌŽďĂďŝůŝƚLJ ŽĨ /Ϯ

Figure: Coverage probability of 95% confidence interval based on β(j)n

Page 96: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Large Scale Heterogeneity Testing

Consider a high dimensional simultaneous testing:

H0 : β(j) = β(j) for all j ∈ J, (2)

where J ⊂ 1, 2, . . . , s and |J | → ∞, versus

H1 : β(j) 6= β(j) for some j ∈ J ; (3)

Test statistic:

T0 = supj∈J

supk∈[p]

√n|β(j)k − βk|;

We can consistently approximate the quantile of the nulldistribution via bootstrap even when |J | diverges at anexponential rate of n by a nontrivial application of a recentGaussian approximation theory.

Page 97: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Large Scale Heterogeneity Testing

Consider a high dimensional simultaneous testing:

H0 : β(j) = β(j) for all j ∈ J, (2)

where J ⊂ 1, 2, . . . , s and |J | → ∞, versus

H1 : β(j) 6= β(j) for some j ∈ J ; (3)

Test statistic:

T0 = supj∈J

supk∈[p]

√n|β(j)k − βk|;

We can consistently approximate the quantile of the nulldistribution via bootstrap even when |J | diverges at anexponential rate of n by a nontrivial application of a recentGaussian approximation theory.

Page 98: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Large Scale Heterogeneity Testing

Consider a high dimensional simultaneous testing:

H0 : β(j) = β(j) for all j ∈ J, (2)

where J ⊂ 1, 2, . . . , s and |J | → ∞, versus

H1 : β(j) 6= β(j) for some j ∈ J ; (3)

Test statistic:

T0 = supj∈J

supk∈[p]

√n|β(j)k − βk|;

We can consistently approximate the quantile of the nulldistribution via bootstrap even when |J | diverges at anexponential rate of n by a nontrivial application of a recentGaussian approximation theory.

Page 99: Semi-Nonparametric Inferences for Massive Datachengg/Massive_Heterogeneous_Data.pdf · Recent News on Big Data On August 6, 2014, Nature2 released news: \US Big-Data Health Network

A Partially Linear Modelling Efficiency Boosting Heterogeneity Testing

Thank You!