Download - Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1
![Page 1: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/1.jpg)
1
Class Imbalance in Text Classification
Project ID: 08
Elham JebalbareziNedjma Ousidhoum
![Page 2: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/2.jpg)
2
Outline
• Class Imbalance• Algorithms for Class Imbalance• Text Classification• Feature selection for text classification• Experiments• Results• Discussion
![Page 3: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/3.jpg)
3
The Class Imbalance Problem(1)
• Common problem in Machine Learning• Almost all the instances belong to one major class
and the rest belong to the minor class.• Imbalance Level= |Majority Class|/|Minority Class|.
It can be huge (order of 106).• Applications
detecting oil spills, text classification, fraud detection and many medical applications such as automatic diagnosis
![Page 4: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/4.jpg)
4
The Class Imbalance Problem(2)
• Many classification algorithms are sensitive to the imbalanced class distribution
• Class imbalance is taken into account in the design of new classifiers
• Solutionscost-sensitive learning, data resampling,
feature selection.
![Page 5: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/5.jpg)
5
Cost-Sensitive Algorithms
• Penalties assigned to mistakes made by classification algorithms.
• Assign different asymmetric misclassification costs to classes. The penalty is higher when the mistake is made on the minority class, to emphasize the correct classification of minority instances.
• Cost- sensitive learning does not modify the class distribution
![Page 6: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/6.jpg)
6
Data Resampling
• Learning instances in the majority class and minority class are manipulated in order to balance the class distribution.
• Effective but may introduce noise or remove useful information.
![Page 7: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/7.jpg)
7
Data ResamplingOversampling
• Duplicates the minority class for more effect on the machine learning algorithm.
• Might be effective but may be prone to overfitting.
• Variants: SMOTE (Synthetic Minority Oversampling Technique), MSMOTE (Modified SMOTE), …
![Page 8: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/8.jpg)
8
Data ResamplingUndersampling
• Using a subset of the majority class to train the classifier.
• Many majority class examples are ignored so that the training set becomes more balanced and the training process becomes faster.
• Effective but may discard useful information.• There are variants of undersampling. E.g. One-
sided undersampling
![Page 9: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/9.jpg)
9
Bagging/Boosting
• Bootstrapping is random sampling with replacement
• Bagging is aggregating classifiers induced over independently drawn bootstrap samples.
• Boosting is to focus on difficult samples by giving a higher weight parameter
![Page 10: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/10.jpg)
10
Feature Selection
• Feature selection is able to improve the performance of naive Bayes and regularized logistic regression on imbalanced data.
• The challenges of feature selection and imbalanced data classification meet when the dataset to be analyzed is of high-dimensionality and highly imbalanced class distribution
![Page 11: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/11.jpg)
11
Text Classification
• Sorting natural language texts or documents into predefined categories based on their content.
• Applications automatic indexing, document organization, text filtering, hierarchical categorization of
web pages, spam filtering, …• Class Imbalance is common in text classification
(e.g)
![Page 12: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/12.jpg)
12
Feature Selection in Text Classification
• Common in text classification because it can improve text classification.
• Select features using different metrics (TF, Chi-square, information gain) for a nearly optimal classification
• We can use positive/negative features
• Combining positive and negative features might be useful
![Page 13: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/13.jpg)
13
Experiments
• We implemented random oversampling, Random undersampling, SMOTE, MSMOTE, One sided Undersampling.
• Our approachWe combined feature selection and resampling by:
1. Calculating Term Frequency2. Applying a resampling Algorithm
• Dataset Reuters-21578.• Chosen Evaluation Metrics
precision=tp/tp+fp , recall=tp/tp+fn, f-measure=2.recall.precision/recall+precision
![Page 14: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/14.jpg)
14
ExperimentsData
![Page 15: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/15.jpg)
15
ExperimentsRandom Oversampling
![Page 16: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/16.jpg)
16
ExperimentsSMOTE
![Page 17: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/17.jpg)
17
ExperimentsMSMOTE
![Page 18: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/18.jpg)
18
ExperimentsRandom Undersampling
![Page 19: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/19.jpg)
19
ExperimentsOne-sided Undersampling
![Page 20: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/20.jpg)
20
Results(1)
Without Sampling
Random oversampling
Random undersampling
one sided undersampling
smote msmote
Precision0.03191 0.09259 0.14772 0.15957 0.0909 0.0434
Recall0.14285 0.23809 0.61904 0.71428 0.2380 0.09523
F-Measure
0.05217 0.13333 0.2385 0.26086 0.1315 0.0597
No feature selection
![Page 21: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/21.jpg)
21
Results(2)
Without Sampling
Random oversampling
Random undersampling
one sided undersampling
smote msmote
Precision1 0.6111 0.0884 0.0851 0.5 0.5384
Recall0.0476 0.5238 0.6190 0.7619 0.5238 0.3333
F-measure
0.0909 0.5641 0.1547 0.1531 0.5116 0.4117
100 features selected after using TF
![Page 22: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/22.jpg)
22
Results(3)
Without Sampling
Random oversampling
Random undersampling
one sided undersampling
smote msmote
Precision0.0476 0.1777 0.0937 0.1666 0.16666 0.4
Recall0.0476 0.38095 0.2857 0.5238 0.3809 0.5714
F-Measure
0.0476 0.2424 0.1411 0.2528 0.2318 0.4705
500 features selected after using TF
![Page 23: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/23.jpg)
23
Discussion
• Feature selection improves oversampling.
• Feature selection also improves undersampling recall.
• Adding more features does not always improve the results.
![Page 24: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1](https://reader035.vdocuments.net/reader035/viewer/2022081501/5697c0151a28abf838ccde59/html5/thumbnails/24.jpg)
24
Thank you!