topic clustering of stemmed transcribed arabic broadcast news authors ahmed abdelaziz jafar (o6u)...
TRANSCRIPT
Topic Clustering of Stemmed Transcribed Arabic Broadcast News
AuthorsAhmed Abdelaziz Jafar (O6U)
Prof. Mohamed Waleed Fakhr (AAST)
Prof. Mohamed Hesham Farouk (Cairo University)
Outlines
• Motivations• Challenges• Objectives• Research Procedure• Experimental Results• Conclusion
Motivations
• Why Topic Clustering of Transcribed Broadcast News
– The amount of audible news broadcasted on TV channels, radio stations and on the Internet are growing rapidly.
– This rapid growth demands reliable and fast techniques to organize and store those vast amounts of news in order to facilitate future processing.
Motivations (cont’d)
• Why Transcribing News:
– News is important, thus archiving it is also important.– Prepared news stories are carefully edited and highly
structured.
• Why Arabic Language:
– Arabic is one of the six most populous languages in the world.
– Automatic transcription and processing of Arabic documents is an active research field due its complex morphological nature.
Challenges
• Speech Transcription Challenges:
– Transcription Errors: – Transcription errors include:
– Word Deletion Errors– Word Insertion Errors– Word Misidentification (Substitution)Errors – Minor Spelling Errors
– The main causes of such errors results from drawbacks of the ASR system.
Challenges
• Speech Transcription Challenges:
– Grammatical Errors:– Use of grammatically incorrect sentences.– Common problem in conversational speech.
– Out of Vocabulary Problem (OOV):– Presence of unknown words that appear in the speech but
not in the recognition vocabulary of the ASR.– The daily growth of natural languages is the main cause of
such problem.
– Combination of the previously mentioned problems.
Objectives
• Achieve automatic topic clustering of transcribed speech documents.
• Overcome the negative effect of some of the transcription errors by using stemming techniques with the aid of a Chi-square-based similarity measure.
Research Procedure
Transcribed Documents
Preprocessing Steps
Tokenization
Stop Words Removal
Stemming
Weighted Matrix Construction
Words Formatting
N-topics identified
Audio FilesASR System
Transcription Process
Clustering Algorithm
Clustering-Based Topic Identification
Similarity Measure
Research Procedure (cont’d)
Transcribed Speech
Documents
Audio FilesASR System
• Transcription Process
• Dragon Dictation- Free application made by Nuance Communications, currently available
only on iOS platform.
- Speaker independent recognizer that supports many languages including Arabic.
- Open domain recognizer, hence not require writing a grammar.
Research Procedure (cont’d)
• Transcription Process• ASR System Output (Transcribed Documents)
– 1000 Transcribed news stories:– collected from various Arabic news networks broadcast:
Al-Jazeera, Al-Arabiya, and BBC Arabic.– divided into five general topics: arts and culture,
economics, politics, science, and sports.
– The average length of the original audible news story is about two minutes.
Research Procedure (cont’d)
• Transcription Process• ASR System Evaluation
– The ASR system is evaluated using The Word Error Rate (WER), which is commonly used to measure speech recognition accuracy. It is based on the frequency of occurrences of three types of errors: Substitutions, Insertions, and deletions.
– WER is calculated as follows:
Research Procedure (cont’d)
• Transcription Process• ASR System Evaluation using WER
#Reference Words#Substitution
s#Insertion #Deletion
68720 17327 2105 574
WER % 29.1123399
Research Procedure (cont’d)
Preprocessing Steps
Tokenization
Stop Words Removal
Stemming
Weighted Matrix Construction
Words Formatting
Transcription Process
• Preprocessing• Impact of Stop Words Removal Step on
the Transcription Errors
Reference Words Substitutions Insertion Deletion
30040 3607 1831 766
WER % 20.6524634%
Research Procedure (cont’d)
Preprocessing Steps
Tokenization
Stop Words Removal
Stemming
Weighted Matrix Construction
Words Formatting
• Preprocessing Transcription Process
• Unify all different shapes of the same letter to one form.
• Also to remove some unwanted suffixes ( وا,ا ,و ) in order to fine-tune the input for the stemming step.
Research Procedure (cont’d)
Preprocessing Steps
Tokenization
Stop Words Removal
Stemming
Weighted Matrix Construction
Words Formatting
Transcription Process
• Preprocessing• Used Stemming Techniques
• Light Stemming: Light stemming does not deal with patterns or infixes; it is simply the process of stripping off prefixes and/or suffixes.
• Root-Based Stemming: Removes suffixes, infixes and prefixes and uses pattern matching to extract the roots.
• Rule-Based Light Stemming: Hybrid technique between light and rule-based stemming.
Research Procedure (cont’d)
Preprocessing Steps
Tokenization
Stop Words Removal
Stemming
Weighted Matrix Construction
Words FormattingD1 D2 D3 … Dj
W1 CW11 CW12 CW13 … CW1j
W2 CW21 CW22 CW23 … CW2J
W3 CW31 CW32 CW33 … CW3J
… … … … … …
Wi CWi1 CWi1 CWi1 … CWij
According to Okapi method Combined Weight of the word (CW) is calculated as follows:
• Preprocessing Transcription Process
Research Procedure (cont’d)
N-topics identified
Clustering Algorithm
Clustering-Based Topic Identification
• Topic Identification
Transcription Process
Preprocessing
• Chi-square Similarity Measure• Cosine Similarity Measure
• Basic k-means algorithm• Spectral clustering algorithm (Shi–Malik)
Similarity Measure
Experimental Results
• Experiments– four test scenarios are evaluated:
– Without applying stemming– when light-stemming is applied– when root-based stemming is applied– when rule-based stemming is applied
– In each scenario, the dataset is divided into smaller subsets of sizes ranged from 50 to 200 documents per topic category.
Experimental Results (cont’d)
• Experiments– The clustering algorithms are applied on all the subsets in each
scenario two times per subset:– one time with the use of the Chi-square similarity
measure.– the other time with the use of the popular cosine
similarity.
– The accuracy of the clustering is evaluated for each subset, and then the average accuracy is calculated among all the subsets.
Experimental Results (cont’d)
• Results (Transcribed Documents)
Clustering
Approach/Similarity
Measure
Average Accuracy
Non-
Stemmed
Light-
Stemmed
Root-
Stemmed
Rule-
Stemmed
k-Means /Cosine 39.42% 44.61% 54.41% 60.04%
k-Means/Chi-square 44.3% 47.6% 56.5% 63.35%
Spectral Clustering/Cosine 45.62% 50.96% 65.57% 71.33%
Spectral Clustering/Chi-
square46.5% 53.8% 68.9% 76.11%
Experimental Results (cont’d)
• Results (Original Documents)
Clustering
Approach/Similarity
Measure
Average Accuracy
Non-
Stemmed
Light-
Stemmed
Root-
Stemmed
Rule-
Stemmed
k-Means /Cosine 62.2% 64.63% 68.06% 76.84%
k-Means/Chi-square 65.9% 67.97% 72.84% 79.05%
Spectral Clustering/Cosine 72.2% 74.97% 80.77% 85.15%
Spectral Clustering/Chi-
square74.87% 76.85% 82.74% 87.21%
Experimental Results (cont’d)
• Results Evaluation– By comparing the accuracy results, and by observing the
clustering confusion matrix for each clustering scenario for original and transcribed data it is concluded that:
– in both sets of data, there are documents causing clustering confusion.
– The existence of topic overlaps in the original data is the main cause of such confusion.
– The information loss due to the transcription errors is increasing the confusion even more in the transcribed data.
Experimental Results (cont’d)
• Results Evaluation
Arts Economics Politics Science SportsArts 156 13 21 8 2
Economics 25 102 39 22 12Politics 3 3 191 2 1Science 21 9 15 149 6Sports 17 5 11 7 160
222 132 277 188 181
Arts Economics Politics Science SportsArts 170 10 13 6 1
Economics 21 125 33 15 6Politics 2 4 193 1 0Science 16 6 10 167 1Sports 9 5 7 2 177
218 150 256 191 185
Confusion Matrices
Sample 1:•Original text divided into subsets of 200 Docs. •Rule-based stemming is applied.•Spectral clustering algorithm is applied
Sample 2:•Transcribed text divided into subsets of 200 Docs.•Rule-based stemming is applied.•Spectral clustering algorithm is applied
Experimental Results (cont’d)
• Experiments (Phase 2)– Fuzzy c-means algorithm and Possibilistic Gustafson-Kessel (GK)
algorithm are applied on both the transcribed and the original data, and the membership matrix is analyzed to evaluate the amount of confusing documents in each topic.
– A document is considered confusing to the clustering process if:– its membership degrees to all clusters are under a certain
predefined threshold.– if its membership degrees to all clusters are convergent.
– By determining which documents are affecting the clustering accuracy, they can be excluded.
Experimental Results
• Experiments (Phase 2)
Confusing documents detected in the original data
a) Confusing documents detected by fuzzy c-means
b) Confusing documents detected by possibilistic GK algorithm
Experimental Results (cont’d)
• Experiments (Phase 2)
a) Confusing documents detected by fuzzy c-means
b) Confusing documents detected by possibilistic GK algorithm
Number of confusing documents detected in the transcribed data
Experimental Results (cont’d)
• Results (Phase 2)– The average clustering accuracy improved to a maximum of
79.34% and 90.52% respectively for the remaining data after using fuzzy c-means and maximum of 85.62% and 92.26% respectively for the remaining data after using possibilistic GK algorithm.
– In both cases, the maximum average accuracy is obtained when spectral clustering is used on rule-based stemmed data.
– Manual categorization can be considered a solution to categorize the excluded documents.
Conclusion
• Research Contributions:
– Utilizing stemming to overcome the negative effects of some of the transcription errors (misidentification errors) existing in the Arabic transcribed text.
– stemming techniques have improved the accuracy of all clustering algorithms applied on the transcribed Arabic documents in all scenarios by an average of 19.7%.
– Rule-based light stemming has improved the accuracy of the clustering process by an average of 23.75%.
– Root-based and light stemming techniques improved the accuracy of the clustering process by an average of 17.39% and 5.28% respectively.
Conclusion (cont’d)
• Research Contributions:
– Utilizing Chi-square similarity measures as helping method to stemming in order to eliminate some of the transcription errors existing in the Arabic transcribed text.
Conclusion (cont’d)
• The research has showed that:
– Rule-based light stemming has improved the accuracy of the clustering process more than the other stemming techniques.
– The spectral clustering algorithm achieved more accuracy than the k-means algorithm in all cases.
– Chi-square similarity method is superior to the popular and traditional cosine similarity and it is best utilized by the spectral clustering algorithm.
Conclusion (cont’d)
• The research has showed that:
– Applying the fuzzy c-means and the possibilistic GK algorithm on both the transcribed and original data has revealed some of the characteristics of the data.
– Economics topic has the biggest number of confusing documents.
– Arts and Science have the second and third places in the number of occurrences of confusing documents.
– Politics topic has the second least confusing documents, and it is the most topic that received wrong-clustered documents from all other categories.
References
[1] Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133.
[2] Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118.
[3] L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text REtrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570.
[4] S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999.
[5] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126.
[6] Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. 'Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002.
[7] Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000.
[8] Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395-416, December 2007.
[9] Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000.
[10] Dragon Dictation Application on iOS https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8[11] Kanaan, G.; Al-Shalabi, R.; Ababneh, M.; Al-Nobani, A., "Building an effective rule-based light stemmer for Arabic
language to improve search effectiveness," Innovations in Information Technology, 2008. IIT 2008. International Conference, pp.312,316.
[12] D.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proc. IEEE CDC, pages 761–766, San Diego, CA, USA, 1979.
Thank You
Questions