topic clustering of stemmed transcribed arabic broadcast news authors ahmed abdelaziz jafar (o6u)...

Topic Clustering of Stemmed Transcribed Arabic Broadcast News

AuthorsAhmed Abdelaziz Jafar (O6U)

Prof. Mohamed Waleed Fakhr (AAST)

Prof. Mohamed Hesham Farouk (Cairo University)

Outlines

• Motivations• Challenges• Objectives• Research Procedure• Experimental Results• Conclusion

Motivations

• Why Topic Clustering of Transcribed Broadcast News

– The amount of audible news broadcasted on TV channels, radio stations and on the Internet are growing rapidly.

– This rapid growth demands reliable and fast techniques to organize and store those vast amounts of news in order to facilitate future processing.

Motivations (cont’d)

• Why Transcribing News:

– News is important, thus archiving it is also important.– Prepared news stories are carefully edited and highly

structured.

• Why Arabic Language:

– Arabic is one of the six most populous languages in the world.

– Automatic transcription and processing of Arabic documents is an active research field due its complex morphological nature.

Challenges

• Speech Transcription Challenges:

– Transcription Errors: – Transcription errors include:

– Word Deletion Errors– Word Insertion Errors– Word Misidentification (Substitution)Errors – Minor Spelling Errors

– The main causes of such errors results from drawbacks of the ASR system.

Challenges

• Speech Transcription Challenges:

– Grammatical Errors:– Use of grammatically incorrect sentences.– Common problem in conversational speech.

– Out of Vocabulary Problem (OOV):– Presence of unknown words that appear in the speech but

not in the recognition vocabulary of the ASR.– The daily growth of natural languages is the main cause of

such problem.

– Combination of the previously mentioned problems.

Objectives

• Achieve automatic topic clustering of transcribed speech documents.

• Overcome the negative effect of some of the transcription errors by using stemming techniques with the aid of a Chi-square-based similarity measure.

Research Procedure

Transcribed Documents

Preprocessing Steps

Tokenization

Stop Words Removal

Stemming

Weighted Matrix Construction

Words Formatting

N-topics identified

Audio FilesASR System

Transcription Process

Clustering Algorithm

Clustering-Based Topic Identification

Similarity Measure

Research Procedure (cont’d)

Transcribed Speech

Documents

Audio FilesASR System

• Transcription Process

• Dragon Dictation- Free application made by Nuance Communications, currently available

only on iOS platform.

- Speaker independent recognizer that supports many languages including Arabic.

- Open domain recognizer, hence not require writing a grammar.


• Transcription Process• ASR System Output (Transcribed Documents)

– 1000 Transcribed news stories:– collected from various Arabic news networks broadcast:

Al-Jazeera, Al-Arabiya, and BBC Arabic.– divided into five general topics: arts and culture,

economics, politics, science, and sports.

– The average length of the original audible news story is about two minutes.


• Transcription Process• ASR System Evaluation

– The ASR system is evaluated using The Word Error Rate (WER), which is commonly used to measure speech recognition accuracy. It is based on the frequency of occurrences of three types of errors: Substitutions, Insertions, and deletions.

– WER is calculated as follows:


• Transcription Process• ASR System Evaluation using WER

#Reference Words#Substitution

s#Insertion #Deletion

68720 17327 2105 574

WER % 29.1123399


Preprocessing Steps

Tokenization

Stop Words Removal

Stemming


Words Formatting


• Preprocessing• Impact of Stop Words Removal Step on

the Transcription Errors

Reference Words Substitutions Insertion Deletion

30040 3607 1831 766

WER % 20.6524634%


Preprocessing Steps

Tokenization

Stop Words Removal

Stemming


Words Formatting

• Preprocessing Transcription Process

• Unify all different shapes of the same letter to one form.

• Also to remove some unwanted suffixes ( وا,ا ,و ) in order to fine-tune the input for the stemming step.


Preprocessing Steps

Tokenization

Stop Words Removal

Stemming


Words Formatting


• Preprocessing• Used Stemming Techniques

• Light Stemming: Light stemming does not deal with patterns or infixes; it is simply the process of stripping off prefixes and/or suffixes.

• Root-Based Stemming: Removes suffixes, infixes and prefixes and uses pattern matching to extract the roots.

• Rule-Based Light Stemming: Hybrid technique between light and rule-based stemming.


Preprocessing Steps

Tokenization

Stop Words Removal

Stemming


Words FormattingD1 D2 D3 … Dj

W1 CW11 CW12 CW13 … CW1j

W2 CW21 CW22 CW23 … CW2J

W3 CW31 CW32 CW33 … CW3J

… … … … … …

Wi CWi1 CWi1 CWi1 … CWij

According to Okapi method Combined Weight of the word (CW) is calculated as follows:

• Preprocessing Transcription Process


N-topics identified

Clustering Algorithm

Clustering-Based Topic Identification

• Topic Identification


Preprocessing

• Chi-square Similarity Measure• Cosine Similarity Measure

• Basic k-means algorithm• Spectral clustering algorithm (Shi–Malik)

Similarity Measure

Experimental Results

• Experiments– four test scenarios are evaluated:

– Without applying stemming– when light-stemming is applied– when root-based stemming is applied– when rule-based stemming is applied

– In each scenario, the dataset is divided into smaller subsets of sizes ranged from 50 to 200 documents per topic category.

Experimental Results (cont’d)

• Experiments– The clustering algorithms are applied on all the subsets in each

scenario two times per subset:– one time with the use of the Chi-square similarity

measure.– the other time with the use of the popular cosine

similarity.

– The accuracy of the clustering is evaluated for each subset, and then the average accuracy is calculated among all the subsets.


• Results (Transcribed Documents)

Clustering

Approach/Similarity

Measure

Average Accuracy

Non-

Stemmed

Light-

Stemmed

Root-

Stemmed

Rule-

Stemmed

k-Means /Cosine 39.42% 44.61% 54.41% 60.04%

k-Means/Chi-square 44.3% 47.6% 56.5% 63.35%

Spectral Clustering/Cosine 45.62% 50.96% 65.57% 71.33%

Spectral Clustering/Chi-

square46.5% 53.8% 68.9% 76.11%


• Results (Original Documents)

Clustering

Approach/Similarity

Measure

Average Accuracy

Non-

Stemmed

Light-

Stemmed

Root-

Stemmed

Rule-

Stemmed

k-Means /Cosine 62.2% 64.63% 68.06% 76.84%

k-Means/Chi-square 65.9% 67.97% 72.84% 79.05%

Spectral Clustering/Cosine 72.2% 74.97% 80.77% 85.15%

Spectral Clustering/Chi-

square74.87% 76.85% 82.74% 87.21%


• Results Evaluation– By comparing the accuracy results, and by observing the

clustering confusion matrix for each clustering scenario for original and transcribed data it is concluded that:

– in both sets of data, there are documents causing clustering confusion.

– The existence of topic overlaps in the original data is the main cause of such confusion.

– The information loss due to the transcription errors is increasing the confusion even more in the transcribed data.


• Results Evaluation

Arts Economics Politics Science SportsArts 156 13 21 8 2

Economics 25 102 39 22 12Politics 3 3 191 2 1Science 21 9 15 149 6Sports 17 5 11 7 160

222 132 277 188 181

Arts Economics Politics Science SportsArts 170 10 13 6 1

Economics 21 125 33 15 6Politics 2 4 193 1 0Science 16 6 10 167 1Sports 9 5 7 2 177

218 150 256 191 185

Confusion Matrices

Sample 1:•Original text divided into subsets of 200 Docs. •Rule-based stemming is applied.•Spectral clustering algorithm is applied

Sample 2:•Transcribed text divided into subsets of 200 Docs.•Rule-based stemming is applied.•Spectral clustering algorithm is applied


• Experiments (Phase 2)– Fuzzy c-means algorithm and Possibilistic Gustafson-Kessel (GK)

algorithm are applied on both the transcribed and the original data, and the membership matrix is analyzed to evaluate the amount of confusing documents in each topic.

– A document is considered confusing to the clustering process if:– its membership degrees to all clusters are under a certain

predefined threshold.– if its membership degrees to all clusters are convergent.

– By determining which documents are affecting the clustering accuracy, they can be excluded.

Experimental Results

• Experiments (Phase 2)

Confusing documents detected in the original data

a) Confusing documents detected by fuzzy c-means

b) Confusing documents detected by possibilistic GK algorithm


• Experiments (Phase 2)

a) Confusing documents detected by fuzzy c-means

b) Confusing documents detected by possibilistic GK algorithm

Number of confusing documents detected in the transcribed data


• Results (Phase 2)– The average clustering accuracy improved to a maximum of

79.34% and 90.52% respectively for the remaining data after using fuzzy c-means and maximum of 85.62% and 92.26% respectively for the remaining data after using possibilistic GK algorithm.

– In both cases, the maximum average accuracy is obtained when spectral clustering is used on rule-based stemmed data.

– Manual categorization can be considered a solution to categorize the excluded documents.

Conclusion

• Research Contributions:

– Utilizing stemming to overcome the negative effects of some of the transcription errors (misidentification errors) existing in the Arabic transcribed text.

– stemming techniques have improved the accuracy of all clustering algorithms applied on the transcribed Arabic documents in all scenarios by an average of 19.7%.

– Rule-based light stemming has improved the accuracy of the clustering process by an average of 23.75%.

– Root-based and light stemming techniques improved the accuracy of the clustering process by an average of 17.39% and 5.28% respectively.

Conclusion (cont’d)

• Research Contributions:

– Utilizing Chi-square similarity measures as helping method to stemming in order to eliminate some of the transcription errors existing in the Arabic transcribed text.


• The research has showed that:

– Rule-based light stemming has improved the accuracy of the clustering process more than the other stemming techniques.

– The spectral clustering algorithm achieved more accuracy than the k-means algorithm in all cases.

– Chi-square similarity method is superior to the popular and traditional cosine similarity and it is best utilized by the spectral clustering algorithm.


• The research has showed that:

– Applying the fuzzy c-means and the possibilistic GK algorithm on both the transcribed and original data has revealed some of the characteristics of the data.

– Economics topic has the biggest number of confusing documents.

– Arts and Science have the second and third places in the number of occurrences of confusing documents.

– Politics topic has the second least confusing documents, and it is the most topic that received wrong-clustered documents from all other categories.

References

[1] Ibrahim Abu El-Khair, “Effects of stop words elimination for Arabic information retrieval: a comparative study,” International Journal of Computing & Information Sciences, 2006, pp. 119-133.

[2] Eiman Al-Shammari and Jessica Lin, “A novel Arabic lemmatization algorithm,” Proceedings of the second workshop on Analytics for noisy unstructured text data, 2008, pp. 113-118.

[3] L.S. Larkey and M.E. Connell, “Arabic information retrieval at UMass in TREC-10,” Proceedings of the Tenth Text REtrieval Conference (TREC-10)”, EM Voorhees and DK Harman ed, 2001, pp. 562-570.

[4] S. Khoja and R. Garside, “Stemming Arabic text,” Lancaster, UK, Computing Department, Lancaster University, 1999.

[5] Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. “Okapi at TREC-3,” Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994. NIST SP 500-225, 1995, 109-126.

[6] Oktay Ibrahimov, Ishwar Sethi, and Nevenka Dimitrova, “A novel similarity based clustering algorithm for grouping broadcast news,” Proc. of SPIE Conf. 'Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, 2002.

[7] Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques”, University of Minnesota, 2000.

[8] Ulrike von Luxburg, “A tutorial on spectral clustering,” Springer Statistics and Computing, vol. 17, Issue 4, pp 395-416, December 2007.

[9] Jianbo Shi and Jitendra Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, August 2000.

[10] Dragon Dictation Application on iOS https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8[11] Kanaan, G.; Al-Shalabi, R.; Ababneh, M.; Al-Nobani, A., "Building an effective rule-based light stemmer for Arabic

language to improve search effectiveness," Innovations in Information Technology, 2008. IIT 2008. International Conference, pp.312,316.

[12] D.E. Gustafson and W.C. Kessel. Fuzzy clustering with a fuzzy covariance matrix. In Proc. IEEE CDC, pages 761–766, San Diego, CA, USA, 1979.

https://itunes.apple.com/us/app/dragon-dictation/id341446764?mt=8

Thank You

Questions

topic clustering of stemmed transcribed arabic broadcast news authors ahmed abdelaziz jafar (o6u)...

Documents

transcription errors

transcribed news stories

automatic transcription

transcribing news

processing of arabic

grammatical errors

prepared news stories

vast amounts of news