Download - Adabag 2.0 presentacion
-
8/3/2019 Adabag 2.0 presentacion
1/17
Adabag 2.0: una librera de R
para Adaboost.M1 y Bagging
s e an aro or sMatas Gmez Martnez
Noelia Garca Rubio
orna as e suar os e17 y 18 de Noviembre de 2011
Escuela de Organizacin IndustrialUniv. de Castilla-La Mancha
F. C.C. Econmicas y EmpresarialesAlbacete
a r
ndice
Albacete
Albacete
1. Introduccin
esariales
esarialesdd
2. Algoritmos
Adaboost
cas
casyyEmp
r
Emp
r
Bagging
asasEconmi
Econmi .
Funciones de Boosting
dddedeCienci
Cienci
Funciones de Bagging
Funcin Margen
Fac
ulta
Fac
ulta
4. Ejemplo5. Conclusiones
-
8/3/2019 Adabag 2.0 presentacion
2/17
Albacete
Albacete . n ro ucc n
En las ltimas dcadas se han desarrollado muchos mtodos de
sariales
sarialesdd clasificacin combinados basndose en rboles de clasificacin.
En la librera que presentamos se implementan dos de los ms
cas
casyyEmpr
Empr , .
Bagging.
asasEconmi
Econmi
Shapire, 1996] construye sus clasificadores basesecuencialmente, actualizando la distribucin sobre el conjunto
dd
dedeCienci
Cienci
de entrenamiento para crear cada uno de ellos, Bagging[Breiman, 1996] combina los clasificadores individuales
Faculta
Faculta
entrenamiento.
1. Introduccin
Albacete
Albacete
Adabag 2.1 es una actualizacin de Adabag que aade una nuevafuncin margins para calcular los mrgenes de los
esariales
esarialesdd clasificadores. La primera versin fue publicada en junio de 2006
y se ha utilizado ampliamente en tareas de clasificacin en campos
cas
casyyEmp
r
Emp
r
Kriegler, B. & Berk, R. (2007). Estimacin de las personas sin hogar en Losngeles (estimacin espacial en pequeas reas).
asasEconmi
Econmi
Stewart, B. & Zhukov, Y. (2009). Anlisis del contenido para clasificacin dedocumentos militares.
dddedeCienci
Cienci De Bock, K.W. , Coussement, K. & Van den Poel, D. (2010). Clasificacin
conjunta basada en Modelos Aditivos Generalizados.
De Bock, K.W. , & Van den Poel, D. (2011). Prediccin de la fu a de
Fac
ulta
Fac
ulta
clientes.
Y en paquetes de R como digeR, GUI para analizar datos DIGEerence n- e ectrop ores s mens ona es, rea za o por:
Fan, Y., Murphy, T.B. & Watson, W.G. (2009): digeR Manual.
-
8/3/2019 Adabag 2.0 presentacion
3/17
2. Al oritmos: AdaBoost
Albacete
Albacete
sariales
sarialesdd
,
Utiliza reponderacin en lugar de remuestreo
cas
casyyEmpr
Empr
El clasificador dbil anterior tiene slo un 50% deprecisin sobre el conjunto de entrenamiento con los
asasEconmi
Econmi
nuevos pesos
dd
dedeCienci
Cienci
La clasificacin final se basa en el voto ponderado de losFaculta
Faculta c as ca ores n v ua es
AdaBoost.M1
Albacete
Albacete
AdaBoost.M1 se aplica de la siguiente manera:
esariales
esarialesdd .
(Tb), inicialmente todos los pesos son 1/n
2. Re ita el rocedimiento ara b=1 2 .... B
cas
casyyEmp
r
Emp
r
a) Construir sobre Tb un clasif. individual Cb(xi) ={1,, k} n e1
asasEconmi
Econmi
b) Calcular:
==b
biiibe
ywe1
b nyx
dddedeCienci
Cienci
=
c) Actualizar y normalizar los pesos: )))(I(Cexp( b iibb
i
1b
i yww =+ x
Fac
ulta
Fac
ulta
B
==
. ,,
b
1
f i
bYj
i
=
-
8/3/2019 Adabag 2.0 presentacion
4/17
Albacete
Albacete
Boosting (AdaBoost)Boosting (AdaBoost)
sariales
sarialesdd reun an c ap re, Cf
1 B
Voto Ponderado
cas
casyyEmpr
Empr
. . .
asasEconmi
Econmi
1 2 B
dd
dedeCienci
Cienci
T1
T2
TB
. . .actualizacin
Faculta
Faculta
Distribucin de probabilidad
Tootstrap o pon era a
2. Al oritmos: Ba in
Albacete
Albacete
Bootstrapping and aggregating (Breiman, 1996)
esariales
esarialesdd
Para b = 1, 2, .., B
cas
casyyEmp
r
Emp
r
ea zar con reemp azam ento n =n muestras eEntrenar el clasificador C
b
asasEconmi
Econmi
El clasificador final se obtiene por mayora simple
dddedeCienci
Cienci
1..
B
Aumenta la estabilidad del clasificador/disminuye
Fac
ulta
Fac
ulta
su varianza
-
8/3/2019 Adabag 2.0 presentacion
5/17
Albacete
Albacete
sariales
sarialesdd Cf(Breiman, 1996) Voto Simple
cas
casyyEmpr
Empr
. . .
asasEconmi
Econmi
1 B2
dd
dedeCienci
Cienci
TB
T2
T1
. . .
Faculta
Faculta
Seleccin aleatoria
Tcon reemp azam en o
3. Funciones: Boosting
Albacete
Albacete adaboost.M1 Applies the Adaboost.M1 algorithm to a data set
Description
esariales
esarialesdd Fits the Adaboost.M1 algorithm proposed by Freund and Schapire in 1996 using
classification trees as single classifiers.Usage
cas
casyyEmp
r
Emp
r a a oos . ormu a, a a, oos = , m na = ,
coeflearn = 'Breiman', control)
Arguments
asasEconmi
Econmi ormu a a ormu a, as n e m unc on.
data a data frame in which to interpret the variables named in formula.
boos if TRUE (by default), a bootstrap sample of the training set is drawn using the
dddedeCienci
Cienci . ,
its weights.
mfinal an integer, the number of iterations for which boosting is run or the number of
=
Fac
ulta
Fac
ulta . .
coeflearn ifBreiman
(by default),alpha=1/2ln((1-err)/err)
is used. IfFreund alpha=ln((1-err)/err) is used. Where alpha is the weight
u datin coefficient.
control options that control details of the rpart algorithm. See rpart.control for
more details.
-
8/3/2019 Adabag 2.0 presentacion
6/17
3. Funciones: Boosting
Albacete
Albacete adaboost.M1 Applies the Adaboost.M1 algorithm to a data set
sariales
sarialesdd
Details
Adaboost.M1 is a simple generalization of Adaboost for more than two classes.
cas
casyyEmpr
Empr
ValueAn object of class adaboost.M1, which is a list with the following components:
asasEconmi
Econmi
formula the formula used.
trees the trees grown along the iterations.
dd
dedeCienci
Cienci .
votes a matrix describing, for each observation, the number of trees that assigned it to
each class, weighting each tree by its alpha coefficient.
class the class redicted b the ensemble classifier.
Faculta
Faculta
importance returns the relative importance of each variable in the classification task.
This measure is the number of times each variable is selected to split.
3. Funciones: Boosting
Albacete
Albacete
predict.boosting Predicts from a fitted adaboost.M1 object
esariales
esarialesdd
DescriptionClassifies a data frame using a fitted adaboost.M1 object.
cas
casyyEmp
r
Emp
r
Usage## S3 method for class 'boosting'
asasEconmi
Econmi pre c o ec , new a a, ...
Arguments
dddedeCienci
Cienci . .
some function that produces an object with the same named components as that returned
by the adaboost.M1 function.
Fac
ulta
Fac
ulta .
predictors referred to in the right side offormula(object)
must be present by name innewdata.
further ar uments assed to or from other methods.
-
8/3/2019 Adabag 2.0 presentacion
7/17
3. Funciones: Boosting
Albacete
Albacete
predict.boosting Predicts from a fitted adaboost.M1 object
sariales
sarialesdd
cas
casyyEmpr
Empr
ValueAn object of class predict.boosting, which is a list with the followingcomponents:
asasEconmi
Econmi
formula the formula used.
votes a matrix describing, for each observation, the number of trees that assigned it to
dd
dedeCienci
Cienci
each class, weighting each tree by its alpha coefficient.
class the class predicted by the ensemble classifier.
confusion the confusion matrix which compares the real class with the predicted one.
Faculta
Faculta
error returns the average error.
3. Funciones: Boosting
Albacete
Albacete
. - .
DescriptionThe data are divided into v non-overlapping subsets of roughly equal size. Then, adaboost.M1 is
esariales
esarialesdd applied on (v-1) of the subsets. Finally, predictions are made for the left out subsets, and the
process is repeated for each of the v subsets.
Usage= = =
cas
casyyEmp
r
Emp
r . , , , , ,
coeflearn = "Breiman", control)Argumentsformula a formula, as in the lm function.
asasEconmi
Econmi
data a data frame in which to interpret the variables named in formula.
boos ifTRUE (by default), a bootstrap sample of the training set is drawn using the weights for
each observation on that iteration. IfFALSE, every observation is used with its weights.
v- v
dddedeCienci
Cienci , . .
of observations, leave-one-out cross validation is carried out. Besides this, every value between two
and the number of observations is valid and means that roughly every v-th observation is left out.
mfinal an integer, the number of iterations for which boosting is run or the number of trees to use.
Fac
ulta
Fac
ulta Defaults to mfinal=100 iterations.
coeflearn if Breiman(by default), alpha=1/2ln((1-err)/err) is used. If Freund alpha=ln((1-err)/err) is used. Where alpha is the weight updating
.control options that control details of the rpart algorithm. See rpart.control for more
details.
-
8/3/2019 Adabag 2.0 presentacion
8/17
3. Funciones: Boosting
Albacete
Albacete
boosting.cv Runs v-fold cross validation with adaboost.M1
sariales
sarialesdd
cas
casyyEmpr
Empr
Value
asasEconmi
Econmi . ,
class the class predicted by the ensemble classifier.
dd
dedeCienci
Cienci
confusion the confusion matrix which compares the real class with the predicted one.
error returns the average error.
Faculta
Faculta
3. Funciones: Bagging
Albacete
Albacete
bagging Applies the Bagging algorithm to a data set
esariales
esarialesdd
Description
cas
casyyEmp
r
Emp
r
Fits the Bagging algorithm proposed by Breiman in 1996 using classification trees assingle classifiers.Usage
asasEconmi
Econmi agg ng ormu a, a a, m na = , con ro
Argumentsformula a formula, as in the lm function.
dddedeCienci
Cienci .
mfinal an integer, the number of iterations for which boosting is run or the number of
trees to use. Defaults to mfinal=100 iterations.
Fac
ulta
Fac
ulta . .
more details.
-
8/3/2019 Adabag 2.0 presentacion
9/17
3. Funciones: Bagging
Albacete
Albacete
bagging Applies the Bagging algorithm to a data set
sariales
sarialesdd
Details
Unlike boosting, individual classifiers are independent among them in bagging.
cas
casyyEmpr
Empr
ValueAn object of class bagging, which is a list with the following components:
asasEconmi
Econmi
formula the formula used.
trees the trees grown along the iterations.
dd
dedeCienci
Cienci , ,
each class.
class the class predicted by the ensemble classifier.
sam les the bootstra sam les used alon the iterations.
Faculta
Faculta
importance returns the relative importance of each variable in the classification task.
This measure is the number of times each variable is selected to split.
3. Funciones: Bagging
Albacete
Albacete
predict.bagging Predicts from a fitted bagging object
esariales
esarialesdd
Description
cas
casyyEmp
r
Emp
r
ass es a ata rame us ng a tte agg ng o ect.Usage## S3 method for class 'bagging'
asasEconmi
Econmi pre c o ec , new a a, ...
Argumentsobject fitted model object of class bagging. This is assumed to be the result of some
dddedeCienci
Cienci
bagging function.
newdata data frame containing the values at which predictions are required. The
Fac
ulta
Fac
ulta
newdata. further arguments passed to or from other methods.
-
8/3/2019 Adabag 2.0 presentacion
10/17
3. Funciones: Bagging
Albacete
Albacete
predict.bagging Predicts from a fitted bagging object
sariales
sarialesdd
cas
casyyEmpr
Empr
ValueAn object of class predict.bagging, which is a list with the following components:
asasEconmi
Econmi
formula the formula used.
votes a matrix describing, for each observation, the number of trees that assigned it to each
dd
dedeCienci
Cienci .
class the class predicted by the ensemble classifier.
confusion the confusion matrix which compares the real class with the predicted one.
Faculta
Faculta .
3. Funciones: Bagging
Albacete
Albacetebagging.cv Runs v-fold cross validation with Bagging
esariales
esarialesdd Description
The data are divided into v non-overlapping subsets of roughly equal size. Then,bagging is applied on (v-1) of the subsets. Finally, predictions are made for the
cas
casyyEmp
r
Emp
r e ou su se s, an e process s repea e or eac o e v su se s.Usage
bagging.cv(formula, data, v = 10, mfinal = 100, control)
asasEconmi
Econmi rgumen s
formula a formula, as in the lm function.
data a data frame in which to interpret the variables named in formula.
dddedeCienci
Cienci , - . .
the number of observations, leave-one-out cross validation is carried out. Besides this,
every value between two and the number of observations is valid and means that roughly
-
Fac
ulta
Fac
ulta .
mfinal an integer, the number of iterations for which boosting is run or the number oftrees to use. Defaults to mfinal=100 iterations.
control o tions that control details of the r art al orithm. See r art.control for
more details.
-
8/3/2019 Adabag 2.0 presentacion
11/17
3. Funciones: Bagging
Albacete
Albacete
bagging.cv Runs v-fold cross validation with Bagging
sariales
sarialesdd
cas
casyyEmpr
Empr
Value
asasEconmi
Econmi n o ec o c ass agg ng.cv, w c s a s w e o ow ng componen s:
class the class predicted by the ensemble classifier.
dd
dedeCienci
Cienci
confusion the confusion matrix which compares the real class with the predicted one.
Faculta
Faculta .
3. Funciones: Mrgenes
Albacete
Albacete
margins Calculates the margins
esariales
esarialesdd
Descri tion
cas
casyyEmp
r
Emp
r
Calculates the margins of an adaboost.M1 or bagging classifier for a data frame.
Usa e
asasEconmi
Econmi
margins(object, newdata, ...)
Arguments
dddedeCienci
Cienci
object This object must be the output of one of the functions bagging,
adaboost.M1, predict.bagging or predict.boosting. This is assumed to
be the result of some function that produces an object with two components named
Fac
ulta
Fac
ulta
formula and class, as those returned for instance by the bagging function.
newdata The same data frame used for building the object.
-
8/3/2019 Adabag 2.0 presentacion
12/17
3. Funciones: Mrgenes
Albacete
Albacete
margins Calculates the margins
sariales
sarialesdd
cas
casyyEmpr
Empr
Details
Intuitively, the margin for an observation is related to the certainty of its
asasEconmi
Econmi
classification. It is calculated as the difference between the support of thecorrect class and the maximum support of a wrong class.
dd
dedeCienci
Cienci Value
An object of class margins, which is a list with only a component:
Faculta
Faculta marg ns vec or w e marg ns.
4. Ejemplo: Vehicle (4 clases)
Albacete
Albacete library(adabag)
data(Vehicle)
esariales
esarialesdd
- ,
sub
-
8/3/2019 Adabag 2.0 presentacion
13/17
4. Ejemplo: Vehicle (4 clases)
Albacete
Albacete
sariales
sarialesdd > Vehicle.bagging Vehicle.bagging.pred Vehicle.bagging.pred$confusion
Observed Class
asasEconmi
Econmi
Predicted Class bus opel saab van
bus 67 10 9 1
opel 1 32 11 0
dd
dedeCienci
Cienci
saab 1 25 40 0
van 0 5 12 68
Faculta
Faculta . .
[1] 0.2659574
4. Ejemplo: Vehicle (4 clases)
Albacete
Albacete> Vehicle.bagging$importance
Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong
9.433962 2.744425 4.459691 2.572899 3.087479 14.236707 4.459691 9.777015
esariales
esarialesdd
Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis
0.000000 12.006861 4.116638 9.777015 1.372213 2.229846 6.861063 1.715266
Kurt.Maxis Holl.Ra
5.660377 5.488851
cas
casyyEmp
r
Emp
r
> barplot(sort(Vehicle.bagging$importance,dec=T),col="blue",horiz=T,las=1)
asasEconmi
Econmi
dddedeCienci
Cienci
Fac
ulta
Fac
ulta
-
8/3/2019 Adabag 2.0 presentacion
14/17
4. Ejemplo: Vehicle (4 clases)
Albacete
Albacete
sariales
sarialesdd> Vehicle.bagging.cv names(Vehicle.bagging.cv)1 "class" "confusion" "error"
cas
casyyEmpr
Empr
> Vehicle.bagging.cv$confusion
Clase real
asasEconmi
Econmi Clase estimada bus opel saab van
bus 206 20 28 1
opel 3 106 72 0
dd
dedeCienci
Cienci
saab 0 70 91 1
van 9 16 26 197
> Vehicle.ba in .cv$error
Faculta
Faculta
[1] 0.2907801
> mar ins Vehicle.ba in Vehicle[sub ] ->Vehicle.ba in .mar ins trainin set
Albacete
Albacete
. . .
> margins(Vehicle.bagging.pred,Vehicle[-sub,])->Vehicle.bagging.predmargins # test set
> margins.test margins.train plot(sort(margins.train), (1:length(margins.train))/length(margins.train),type="l", xlim=c(-1,1),
+ main=Bagging, Margin cumulative distribution graph",xlab="m,ylab="% observations", col="blue3",lty=2)
> abline(v=0, col="red",lty=2)
=" " = =" "
cas
casyyEmp
r
Emp
r . , . . , , . ,
asasEconmi
Econmi
dddedeCienci
Cienci
Fac
ulta
Fac
ulta
-
8/3/2019 Adabag 2.0 presentacion
15/17
4. Ejemplo: Vehicle (4 clases)
Albacete
Albacete
> Vehicle.adaboost Vehicle.adaboost.pred error.adaboost error.adaboost
[[1] 0.2304965
Albacete
Albacete> Vehicle.adaboost$importance
Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong
7.100592 3.550296 6.213018 3.254438 9.171598 11.538462 6.804734 5.917160
esariales
esarialesdd
Pr.Axis.Rect Max.L.Rect Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis
0.000000 10.946746 5.029586 5.325444 6.213018 4.733728 5.325444 3.254438
Kurt.Maxis Holl.Ra
3.254438 2.366864
cas
casyyEmp
r
Emp
r
> barplot(sort(Vehicle.adaboost$importance,dec=T),col="blue",horiz=T,las=1)
asasEconmi
Econmi
dddedeCienci
Cienci
Fac
ulta
Fac
ulta
-
8/3/2019 Adabag 2.0 presentacion
16/17
4. Ejemplo: Vehicle (4 clases)
Albacete
Albacete
sariales
sarialesdd
. . - . ~., =
> names(Vehicle.adaboost.cv)
[1] "class" "confusion" "error"
cas
casyyEmpr
Empr > Vehicle.adaboost.cv$confusion
Observed Class
Predicted Class bus opel saab van
asasEconmi
Econmi
bus 213 2 3 1
opel 3 116 74 2
dd
dedeCienci
Cienci
van 1 6 6 192
> Vehicle.adaboost.cv$error
Faculta
Faculta[1] 0.2257683
[
Albacete
Albacete
. , u , - . .
> margins(Vehicle.adaboost.pred,Vehicle[-sub,])->Vehicle.adaboost.predmargins # test set
> margins.test margins.train plot(sort(margins.train), (1:length(margins.train))/length(margins.train),type="l", xlim=c(-1,1),
+main="AdaBoost, Margin cumulative distribution graph",xlab="m,ylab="% observations", col="blue3",lty=2)
> abline(v=0, col="red",lty=2)
> lines(sort(mar ins.test), (1:len th(mar ins.test))/len th(mar ins.test),t pe="l", cex=.5,col=" reen")
cas
casyyEmp
r
Emp
r
asasEconmi
Econmi
dddedeCienci
Cienci
Fac
ulta
Fac
ulta
-
8/3/2019 Adabag 2.0 presentacion
17/17
5. Conclusiones
Albacete
Albacete
sariales
sarialesdd
Las funciones incluidas en esta librera permiten:
cas
casyyEmpr
Empr 1. Resolver problemas de clasificacin con ms de dos clases.
asasEconmi
Econmi . .
3. Medir la im ortacia relativa de las variables.
dd
dedeCienci
Cienci
4. Calcular los mrgenes de los clasificadores combinados.Faculta
Faculta
5. Estimar el error mediante validacin cruzada.
Gracias por su atencin!
e-mail: [email protected]
. .