adaboost-1
TRANSCRIPT
-
7/29/2019 AdaBoost-1
1/4
AdaBoost Theory and Application
Ying Qin
AdaBoost, short for Adaptive Boosting, is a machine learning algorithm formulated by Freund and
Shapire. AdaBoost is adaptive in the sense that subsequent classifiers are built to focus on the
instances misclassified by previous classifiers. In this report, I am going to summarize basic theory
about boosting and AdaBoost and introduce the application of AdaBoost in music information
retrieval (MIR).
1BoostingThere is an old saying that states there is strength in numbers, which means, the result of a
group can be higher than the simple sum of its parts. This is also, in some extent, true for
machine learning. Boosting is such a supervised machine learning approach that builds a strong
classifier from weak ones. Each weak classifier receives an input and returns a positive or
negative vote, and the final strong classifier output the weighted voting where the weights
depend on quality of weak classifiers. In this way, every added weak classifier contributes to or
improves the outcome.
The development of boosting algorithm can date back to 1988, when Kearns and Valiant first
explored the potential of boosting a weak classifier (slightly better than chance) into a
strong classifier. Later in 1990, Schapire showed that a learner, even if rough an moderately
inaccurate, could always improve its performance by training two additional classifiers on filtered
version of the input data stream. The first provable polynomial-time boosting algorithm was
discussed in Schapires work The strength of weak learnability. Inspired by Shapire, Freund, in
1995, proposed a far more efficient algorithm by combining a large number of hypotheses.
However, this algorithm has practical drawbacks for it assumes that each hypothesis has a fixed
error rate. Finally the AdaBoost algorithm was introduced in 1997 by Freund and Schapire, which
solved many of the practical difficulties of the previous boosting algorithms (de Haan 2010).
2AdaBoostAdaBoost needs no prior knowledge of the accuracies of the weak classifier. Rather, it iteratively
applies a learning algorithm to the same training data and adds versions to final classifier. At each
iteration, it generates a confidence parameter that changes according to the error of the weak
hypothesis. This is the basis of its name: Ada is short for adaptive.
2.1Understanding AdaBoostGiven a binary classification case, the training set will have both typical and rare samples, and we
-
7/29/2019 AdaBoost-1
2/4
usually have no idea about the importance of samples. Thus, we might just give them equal
weights to initiate training. The classification error is calculated to reweigh the data for next
classifier, and the aim of re-weighting is to make both the correct and error at the rate of 50%.
Since we have a null hypothesis that the error rate is always smaller than , the reweighting will
reduce the weights of correct samples and increase the weights of error samples. In other words,
the weak classifier at first iteration (or the first classifier) is good on average training samples, and
the classifier at second iteration (or the second classifier) is good on errors in the first classifier.
After a number of iterations, the sample weight focuses the attention of the weak learner on the
hard examples near the boundary of two classes.
During the whole training process, once the weak classifier has been received, AdaBoostassigns a confidence parameter to , which is directly related to its error. In this way, wegive more weight to classifier with lower error and this choice can decrease overall error. The
strong classifier results as a weighted linear combination of weak classifiers, whose weights are
determined by error of itself. The algorithm must terminate if 0 which is equivalent to 1/2. A step-by-step illustration of AdaBoost algorithm can be formulated as,
When using AdaBoost, the training data must represent reality, and the total number of samples
must be relatively large compared to the number of features. Since AdaBoost is particularly
suitable to work with many features, it prefers enormous databases. Besides, it is important to
remember that the weak classifier must be weak enough, otherwise the resulting strong learner
might overfit easily. In fact boosting seems to be especially susceptible to noise in such cases. The
most popular choices for weak classifier is decision trees or decision stumps (decision trees with
two leaves), if there is no a-priori knowledge available on the domain of the learning problem.
12 ln 1 > 0
+
.
1. Build distribution , assuming all samples equally important2. For t = 1,,T (rounds of boosting)
- Select weak classifier with the lowest error from a group
- Check if error larger than
(YES: terminate; NO: go on)
- Calculate confidence parameter, weight of sub-classifier
- Re-weight data samples to give poorly classified samples an increased weight
Where is the normalization factor
3. At the end (tth round), the final strong classifier results
-
7/29/2019 AdaBoost-1
3/4
2.2AdaBoost extensionsIt is possible to extend the basic AdaBoost algorithm to obtain better performance. Two major
extensions include abstention and regularization. As we have seen, in the typical AdaBoost the
binary weak learner
:{1,1}, and is therefore forced to give an option for each
examples . This is not always desirable as the weak learner might have not suited to classifyevery x . The solution of this problem is the abstention base classifier which knows when toabstain and has the form :{1,0,1}. Adaboost might overfit in some cases if it is runlong enough. To solve this problem, the general way is to validate the number of iterations ona validation set. While, regularization introduces an edge offset parameter 0 in theconfidence formula
1
2ln 1
1
2ln 1
.
This formula shows how the confidence is decreased by a constant term every iteration,
suggesting a mechanism similar to weight decay for reducing the effect of the overfitting.
2.3Multi-class AdaBoostThe binary AdaBoost is a simple and well understood scenario, but we still need to extend it to
deal with multi-class problems. AdaBoost.M1 is the most simple and straightforward way,
proposed by Freund and Schapire. In this approach, the weak learner is a full multi-class
algorithm itself. The AdaBoost algorithm does not need to be modified in any sense. However,
this method fails if the weak learner cannot achieve at least 50% accuracy on all classes when run
on hard problems. In AdaBoost.MH, proposed by Schapire and Singer, the weak learner receives a
distribution of weights which is on the data and the classes ,. In general, this weight willexpress how hard it is to classify into its correct class (if ,=). Schapire and Singer alsoproposed AdaBoost.MO, which partitions the multi-class problem into a set of binary problems.
This method can be implemented by use of error correcting output codes (ECOC) decomposition
(Casagrande 2005).
3AdaBoost in MIRAdaBoost has been used in a number of MIR problems in recent years. Dixon et al. presented a
method of genre classification with automatically rhythmic using AdaBoost (Dixon et al. 2004).
Casagrande described the approach of using multi-class AdaBoost to classify audio files based on
extracted features (Casagrande 2005). Bergstra et al. presented an algorithm that predicts
musical genre and artist from an audio waveform, using ADABOOST to select from a set of audio
features (Bergstra et al. 2006). Ecket al. proposed a method for predicting the social tags formusic recommendation directly from MP3 files using Adaboost (Eck et al. 2007). Bertin-Mahieux
et al. extended the work of Eck et al. by replacing the AdaBoost batch learning algorithm with the
FilterBoost, an online version of AdaBoost (Bertin-Mahieux et al. 2008). Overall, AdaBoost has
-
7/29/2019 AdaBoost-1
4/4
been proved to be an effective machine learning algorithm for music classification.
Reference
1. De Haan, Gerard. Digital Video Post Processing. Eindhoven, 2010.
2. Bishop, Christopher M. Pattern Recognition and Machine Learning.
Springer-Verlag New York, Inc., 2006.
3. Bergstra, J., N. Casagrande, D. Erhan, D. Eck, and B. Kgl. Aggregate
Features and AdaBoost for Music Classification. Machine Learning 65,
no. 2 (2006): 473-84.
4. Bertin-Mahieux, T., D. Eck, F. Maillet, and P. Lamere. "Autotagger: A
Model for Predicting Social Tags from Acoustic Features on Large
Music Databases." Journal of New Music Research 37, no. 2 (2008):
115-35.5. Casagrande, Norman. 2005. Automatic music classification using
boosting algorithms and auditory features. Computer and operational
research Department University of Montreal Montreal PhD Thesis.
6. Dixon, S., F. Gouyon, and G. Widmer. 2004. "Towards characterization
of music via rhythmic patterns." In Proceedings of the 5th International
Conference on Music Information Retrieval (ISMIR) 509-516.
7. Eck, D., P. Lamere, T. Bertin-Mahieux, and S. Green. "Automatic
Generation of Social Tags for Music Recommendation."Advances in
neural information processing systems 20, no. 20 (2007): 1-8.