poggi analytics - ensamble - 1b

Buenos Aires, marzo de 2016Eduardo Poggi

Temas

Ensambles Bagging Boosting Random Forest

4Ensambles

Ensambles

Ensamble: Conjunto de modelos que se usan juntos como un “meta

modelo”. Idea base conocida:

Usar conocimiento de distintas fuentes al tomar decisiones.

Ensambles

Comité de expertos: muchos elementos todos con alto conocimiento todos sobre el mismo tema votan

Gabinete de asesores: expertos en diferentes áreas alto conocimiento hay una cabeza que decide

quién sabe del tema

Ensambles planos:

-Fusión

-Bagging

-Boosting

-Random Forest

Ensambles divisivos:

-Mixture of experts

-Stacking

Crowding decision?

Ensambles

Dos componentes base: Un método para seleccionar o construir los miembros

Misma o distinta área? Distintos datasets x distintos modelos x distintas

configuraciones Un método para combinar las decisiones

Votación simple, votación ponderada, promedio, función específica, selectividad …

Ensambles

Planos: Muchos expertos, todos buenos: Necesito que sean lo mejor posible individualmente.

De lo contrario, usualmente no sirven. Pero necesito que opinen distinto en algunos casos.

Si todos opinan siempre igual… me quedo con uno solo!

Ensambles

Divisivos: Dividir el problema en una serie de subproblemas con

mínima sobreposición. Estrategia de “divide & conquer”. Útiles para atacar problemas grandes. Se necesita una función que decida que clasificador tiene

que actuar.

Ensambles

Si un “aprendiz” es bueno produce un buen clasificador, puede que muchos “aprendices” produzcan algo mejor?

Por qué no aprender: { h1, h2, h3 }, entonces: h*(x) = mayoría { h1(x), h2(x), h3(x) } Si hi’s tienen errores independientes h* es más precisa. Error(hi) = ε, entonces Error(h*) = 3ε⌃2 (0.01 → 0.0003)

Ensambles

Ensambles

1. Subsample Training Sample Bagging Boosting

2. Manipulate Input Features 3. Manipulate Output Targets

ECOC 4. Injecting Randomness

Data Algorithm 5. Algorithm Specific methods

Other combinations Why do Ensembles work?

Ensambles

Subsampling

Ensambles

Manipulate Input Features

Ensambles

Manipulate Output Targets

Ensambles

Un aprendiz se dice inestable si el clasificador que produce sufre cambios importantes ante pequeñas variaciones en los datos de entrenamiento

Inestables: árbol de decisiones, redes neuronales, … Estables: La regresión lineal, el vecino más cercano, ...

Subsampling es mejor para los alumnos inestables

Ensambles

Voting Algoritms Take an inducer and A training set, Run the inducer multiple times by changing the

distribution of the training set instances, The generated classifiers are combined, … and then classify the set.

Ensambles

Voting algorithms can be divided into two types: those that adaptively change the distribution of the

training set based on the performance of previous classifers (as in boosting methods) and

those that do not (as in Bagging).

Temas


Bagging Algorithm

Bootstrap aggregating (Breiman 96) Votes classifiers generated by different bootstrap

samples (replicates) Uniformly sampling m instances from the training

set with replacement. T bootstrap samples B1, B2, … , BT are generated

and a classfier Ci is built from each bootstrap sample Bi

A final classfier C* is built from C1, C2, … , CT whose output is the class predicted most often by its subclassiers, with ties broken arbitrarily

Bagging Algorithm

Bagging Algorithm

An instance instance in the training set has probability 1−(1−1/m)^m of being selected at least once in the m times instances are randomly selected

For large m, this is about 1 − 1/e = 63.2%, which means that each bootstrap sample contains only about 63.2% unique instances from the training set.

If the inducer is unstable (ANN, DT), the performance can improve.

If the inducer is stable (k-nearest neighbor), may slightly degrade the performance.

Temas


Adaboost Algorithm

Boosting (Schapire 90), AdaBoost M1 (Freund & Schapire 96)

Generates the classifers sequentially, while Bagging can generate them in parallel.

AdaBoost also changes the weights of the training instances provided as input to each inducer based on classifers that were previously built.

The goal is to force the inducer to minimize expected error over diferent input distributions.

C* = weighted voting. The weight of each classfier depends on its performance on the training set used to build i

Adaboost Algorithm

The incorrect instances are weighted by a factor inversely proportional to the error on the training set, i.e., 1/(2Ei). Small training set errors, such as 0.1%, will cause weights to grow by several orders of magnitude.

The AdaBoost algorithm requires a weak learning algorithm whose error is bounded by a constant strictly less than 1/2. In practice, the inducers we use provide no such guarantee.

The original algorithm aborted when the error bound was breached

Resampling + reweighting Success (???) distribution of the “margins”

Adaboost Algorithm

Adaboost : How Will Test Error Behave? (Guess!)

Expect… training error to continue to drop (or reach 0) test error to increase when h* becomes “too complex”

“Occam’s razor” overfitting

Adaboost : How Will Test Error Behave? (Real!)

But… test error does not increase, even after 1000 rounds test error continues to drop, even after training error is 0!

Occam’s razor: “simpler rule is better”... appears to not apply!

Adaboost : Margins

key idea: training error only measures whether classifications are right or

wrong should also consider confidence of classifications

measure confidence by margin = strength of the vote (weighted fraction voting correctly) − (weighted fraction voting

incorrectly)

Adaboost : Margins

key idea: training error only measures whether classifications

are right or wrong should also consider confidence of classifications

Adaboost : Application detecting Faces [Viola & Jones]

problem: find faces in photograph or movie weak classifiers: detect light/dark rectangles in

image

many clever tricks to make extremely fast and accurate

Adaboost : practical advantages

Fast simple and easy to program no parameters to tune (except T, sometimes) flexible — can combine with any learning

algorithm no prior knowledge needed about weak learner provably effective, given weak classifier

shift in mind set: goal now is merely to find classifiers barely better than random guessing

Versatile can use with data that is textual, numeric, discrete, etc. has been extended to learning problems well beyond

binary classification

Adaboost : warnings

Performance of AdaBoost depends on data and weak learner.

Consistent with theory, AdaBoost can fail if... weak classifiers too complex

overfitting weak classifiers too weak (γt → 0 too quickly)

underfitting low margins overfitting

Empirically, AdaBoost seems especially susceptible to uniform noise.

Adaboost : Conclusions

Boosting is a practical tool for classification and other learning problems

grounded in rich theory performs well experimentally often (but not always!) resistant to overfitting many applications and extensions

Recognizing Handwritten Number

“Obvious” approach: learn F: Scribble → {0,1,2,...,9}

...doesn’t work very well (too hard!)

Or... “decompose” the learning task into 6 “subproblems”

learn 6 classifiers, one for each “sub-problem ”to classify a new scribble:

Run each classifier Predict the class whose code-word is closest (Hamming

distance) to the predicted code

Recognizing Handwritten Number

Predict the class whose code-word is closest (Hamming distance) to the predicted code

Temas


Ramdom Forest: Bagging + trees

Usar bootstraps genera diversidad, pero los árboles siguen estando muy correlacionados

Las mismas variables tienden a ocupar los primeros cortes siempre.

Ejemplo:

Dos árboles generados con rpart a partir de bootstraps del dataset Pima.tr. La misma variable está en la raíz

Ramdom Forest

Agregar un poco de azar al crecimiento En cada nodo, seleccionar un grupo chico de

variables al azar y evaluar sólo esas variables. No agrega sesgo: A la larga todas las variables entran en

juego Agrega varianza: pero eso se soluciona fácil promediando

modelos Es efectivo para decorrelacionar los árboles

Ramdom Forest

Ramdom Forest

Construye los árboles hasta separar todo. No hay podado. No hay criterio de parada.

El valor de m (mtry en R) es importante. El default es sqrt(p) que suele ser bueno.

Si uso m=p recupero bagging El número de árboles no es importante, mientras

sean muchos. 500, 1000, 2000.

Ramdom Forest

Orden

típico

RF: Ejemplo

Ramdom Forest

Resumen Mejora de bagging sólo para árboles Mejores predicciones que Bagging. Muy usado. Casi automático. Resultados comparables a los mejores métodos actuales. Subproductos útiles, sobre todo la estima OOB y la

importancia de variables.

Discover main color

Bagging o Boosting: El dilema sesgo-varianza

Los predictores sin sesgo tienen alta varianza (y al revés)

Hay dos formas de resolver el dilema: Disminuir la varianza de los predictores sin sesgo

Construir muchos predictores y promediarlos: Bagging y Random Forest

Reducir el sesgo de los predictores estables Construir una secuencia tal que la combinación tenga menos

sesgo: Boosting

Sesgo y Varianza

Que funciones utilizar? Funciones rígidas:

Buena estimación de los parámetros óptimos – poca flexibilidad.

Funciones flexibles: Buen ajuste – mala

estimación de los parámetros óptimos.

Error de sesgo

Error de varianza

¿Y ahora?

Las herramientas de ensamble han demostrado que mejoran la performance de las técnicas atómicas que las conforman.

Hay teoremas que demuestran que AdaBoost es mejor siempre y cuando el modelo busteado tenga ciertas características de weakness (sean limitados, no complejos).

¿Y ahora?

Corolario: No hace falta que los votantes sean inteligentes, bien

formados, expertos, etc., basta que sean diversos y fieles a sus capacidades limitadas.

“Un comité de tontos funciona mejor que un experto …”

¿Cómo sería un parlamento con legisladores busteados?

[email protected]

eduardo-poggi

http://ar.linkedin.com/in/eduardoapoggi

https://www.facebook.com/eduardo.poggi

@eduardoapoggi

mailto:[email protected]

http://ar.linkedin.com/in/eduardoapoggi

https://www.facebook.com/eduardo.poggi

Bibliografía

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm



poggi analytics - ensamble - 1b

Business