[ieee education (iccse 2011) - singapore, singapore (2011.08.3-2011.08.5)] 2011 6th international...

Text Cohesion Visualizer

Chakarida Nukoolkit, Praewphan Chansripiboon Pornchai Mongkolnam, Richard Watson Todd* Computer Science Program, School of Information Technology

School of Liberal Arts* King Mongkut’s University of Technology Thonburi

Bangkok, 10140 Thailand [email protected], [email protected], [email protected], [email protected]

Abstract — In this paper, we describe the concept and design of a novel visualization tool to aid in academic writing of English as a Second Language students. The tool makes use of theory in classroom discourse, WordNet API, and linguistics rules given by a linguistics expert, by analyzing English language essays for their linguistic bond counts and links within and between paragraphs. These linguistic indicators reveal the structure and flow of essays, clusters of main ideas, as well as incoherent sentences, which obstruct essay unity. The lack of essay unity is one of the most common writing errors of English as a Second Language learners. The output of the system is shown as several kinds of visualizations that provide writing feedback to users, as well as an autocorrect functionality to improve essay unity. Novice English learners may benefit greatly from this system.

Index Terms — Text cohesion, English writing, Visualization, Text mining

I. INTRODUCTION

Among those who are learning English as a second language, one of the greatest difficulties they encounter is essay writing, because they often cannot think in English to such an extent, and they would rather translate the passage from their native language into English. Furthermore, they may include everything that comes to mind at the moment simply to fill space, resulting in a lack of unity within that piece of writing. In this paper, we describe the concept and design of a prototype system developed to help analyze the lexical coherence of essays based on their structure and provide visualized output as writing feedback to users.

II. RELATED WORK

English-language writing aids are not an entirely new concept. Nowadays, there are several writing aids such as spelling and grammar checkers, and thesauruses, built in to typical word processing software. However, these tools tend to address the common writing problems of native English language users [1] and focus on surface features of language easily identifiable by computers. There is clearly a need for writing aids that address more global issues for learners of English. One such issue is organization of writing, a crucial

issue in producing quality writing on which learners receive little feedback from either computer writing aids or teachers [2]. Writing organization involves text cohesion, a problem for many learners of English whose writing abruptly jumps back-and-forth between separate ideas, or fails to properly connect separate ideas. Previous work in developing English writing aids, such as [3], both provide minimal coverage of global writing issues and do not make use of today’s visualization technology. Therefore, to improve on essay cohesion, more detailed insights into text cohesion must be analyzed and visualized.

Cohesion in text shows the relationships between words and forms the organization of ideas that creates a logical flow throughout an essay. The concept of using lexical cohesion in text analysis is not novel; it is one of the techniques used when studying natural language processing [4-6]. There are some applications based on lexical cohesion. For instance, previous work in [7] measures cohesion through quantitative indicators, and uses it to perform text segmentation of narratives into coherent scenes. The work in [8, 9] introduces interdisciplinary work of computational linguistics by performing lexical cohesion analysis using a computer-based corpus named WordNet.

In our proposed system, we base our text cohesion analysis upon the work in [10, 11], which analyzes the lexical ties in discourse. The analysis process starts from identifying the sentences as the unit of analysis. Then, identify reiterations of lexical terms and create the links between sentences. After that, identify bonded sentences and clusters. However, our work is aimed to aid in English essay writing; therefore, we have to adapt the previous work in [11] by augmenting the process with some additional linguistics rules given by an expert in teaching English to second-language learners. After identifying the bonded sentences and clusters, re-representing the findings in a new form will help users to understand the cohesion analysis results better. This feature has been implemented in our output visualization.

A similar recent work of using visualization to represent paragraph closeness for academic writing support can be found in [12]. This work displays paragraph closeness or topic flow

978-1-4244-9718-8/11/$26.00 ©2011 IEEE

The 6th International Conference onComputer Science & Education (ICCSE 2011)August 3-5, 2011. SuperStar Virgo, Singapore

205

ThC 1.6

in a circular grid map, with arrows pointing from current paragraph to next paragraph. However, this tool is aimed at providing only a general assessment of essay writing, and leaves most of the work of essay improvement to the user. In contrast, in our proposed system, we also provide autocorrection features wherever possible in the essay being analyzed. This should be considered a major additional advantage to learners of English as a second language. In addition, we have also taken special care in designing our visualization to be effective and allow users to easily detect cohesion errors down to the sentence level; we use a meaningful color scheme of text to represent the different types and different degrees of discovered text cohesion problems. The importance of using appropriate text colors for better text understanding has been suggested in earlier work of [13].

We discuss the work flow and methodology of the proposed system in the following section.

III. METHODOLOGY

The Text Cohesion Visualizer system is implemented as a desktop application by using Java programming language, cooperating with Stanford Part Of Speech (POS) tagger API [14] and the lexical database called WordNet API [15]. The system architecture is shown in Fig. 1.

Figure 1: Text Cohesion Visualizer system architecture

The work flow of the system starts with receiving input text from users, then splits the given text into paragraphs and sentences (preprocessing), matches keywords, creates a bond table, analyzes the bond table, and lastly provides visualized outputs. Furthermore, the system also has an autocorrect function to adjust the text as recommended by the analysis results, such as removing unnecessary sentences.

A. Preprocessing After the program splits text into sentences, with the use of

the POS Tagger, all the words in each sentence are tagged for their part of speech. Once we know the part of speech of each word, we can eliminate the words that are not a noun, pronoun, verb, adjective, or adverb. Auxiliary verbs are also removed. However, the program needs to keep two sets of text: one for processing, and another one for displaying results to the user – the same as the original text.

B. Matching keywords In the matching keywords step, we count the number of matched words (link) between any two sentences and record the data in an internal matrix. There are four types of matching: repetition – match exactly the same word; complex repetition (or word families) – match the words that share a common base with different prefixes and suffixes; paraphrase (synonyms and hypernyms) – which can be done through WordNet API; and pronoun.

C. Creating bond table As the result of the matching keywords step, an

internal matrix (link table) is created. This matrix is then used to generate the bond table, which is the key in text cohesion analysis. The bond table, in other words, is like a table of boolean value indicating whether or not there is a bond between sentences. We can form the bond table by comparing the number of links between any two sentences against the threshold value (the minimum number of links necessary to establish a bond between sentences). The process of converting link to bond is shown in Fig. 2.

Figure 2: The process of converting link to bond

Visualization

206

ThC 1.6

IV. RESULTS AND DISCUSSION

After creating an internal bond table, the system also

provides output visualization of the bond table to the user, as shown in Fig. 3. The visualization lets users mouse over the sentence number and see the corresponding original sentences. In accordance with the linguistic expert’s suggestion, there are six types of sentences that the system can interpret from the bond table, as shown in the variety of text colors in Fig. 4.

Figure 3: Output – Visualization of a bond table, linking to original

sentences.

Figure 4: Main Output – Visualization of highlighted paragraphs.

From Fig. 4, which shows the main output of the Text Cohesion Visualizer, some sentences that need improvement are highlighted with different colors according to their error type. The meanings of each color are as follows: • Red – This sentence appears to have no bond at all with other sentences in this text. So, it should be removed from this text.

• Cyan – This type of sentence is at the end of a paragraph. It appears to have bonds with some sentences in the paragraph that follows it, but no bonds with other sentences in the same paragraph. Therefore, it can likely be moved to the beginning of the next paragraph. • Pink – As opposed to Cyan, this type of sentence is at the beginning of a paragraph. However, it appears to have bonds with sentences in the paragraph that comes before it, but no bonds with other sentences in the same paragraph. The suggestion for this type is to move it to the end of the preceding paragraph. • Yellow – This is a special case. We consider the whole paragraph to be one unit. If there is no bond between sentences in any two paragraphs, those paragraphs are highlighted in yellow. It means two seemingly separate, unconnected topics are in the same text. The program cannot correct this type of problem automatically. Correction depends on the writer’s preference. To improve it, the writer can either rewrite one paragraph or add text linking the two topics/paragraphs. • Magenta – It has a similar meaning as the Red-highlighted sentences, but with a smaller scope. A highlight in this color indicates that the sentence appears to have no bond with other sentences in the same paragraph, but still has bonds with sentences in other paragraphs. The writer can move it to another paragraph in the text or simply delete it from the current paragraph. • Blue – This color indicates a new topic has been introduced. It means that the sentence appears to have no bond with any preceding sentences. It suggests that the writer thought of and inserted the new idea immediately, without concern for whether it is related to the preceding text. It may be improved by adding some linking words. In the special case of discovering isolated clusters or paragraphs, the whole paragraph will be highlighted in yellow. Normally, two paragraphs are highlighted. The user can then see that these two paragraphs are not related to each other. However, there may be cases in which three paragraphs of a single text are highlighted. In such cases, it is possible that one paragraph is not related to the other two paragraphs, while the other two paragraphs are related to each other. If that happens, how can the users know which paragraphs are related and not related to each other? The solution is to generate a node graph as shown in Fig. 5.

Figure 5: Output – Visualization of a paragraph bond.

207

ThC 1.6

For the paragraph bond, the node graph illustrates the relationship between yellow highlighted paragraphs only. A node represents a paragraph, while a line indicates a bond between sentences. The absence of a line between any two nodes means there is no bond between those two paragraphs. We need this kind of output because sometimes more than two paragraphs in a text may be highlighted in yellow, and it may be the case that not all of them have no bond with each other. As you can see in Fig. 5, P2 (Paragraph 2) has no bond with P8 or P10, while there is a bond between those two paragraphs. In addition, the proposed system also provides users with the ability to adjust the link threshold value as shown in Fig. 6. From this figure, the higher the link threshold level is, the greater the number of links required for a bond. You can see that more highlighted text colors are generated as the threshold is increased.

Figure 6: Change in output visualization according to the change of threshold value

V. CONCLUSION AND FUTURE WORK

We applied the concept of text cohesion analysis in our Text Cohesion Visualizer prototype system in order to measure the quality of text and provide feedback for the academic writing of English as a Second Language learners. Overall, the proposed application can detect the cohesion errors in text correctly as experts indicate. With the use of intuitive visualized outputs showing highlighted paragraphs, bond tables, paragraph bonds, text clusters, and cohesion flow, the system helps students know which parts of their essay have cohesion errors, and for what reasons. This leads to improvement of the student’s writing, which is the objective of this system. However, the program is not 100-percent correct, because interpretation of each essay is subjective, and there are some limitations in the step of matching keywords, but its accuracy is at an acceptable level according to expert opinion.

In future work, we first plan to improve the process of matching keywords for more accurate results by augmenting the existing process with more specific linguistic rules. Second, we plan to further develop this prototype system as a web application so it can be easily accessed by a large number of users. Finally, we need to perform a more thorough usability evaluation of the proposed system.

ACKNOWLEDGMENT The authors would like to acknowledge Mr. Anthony

French for proofreading and the School of Information Technology and the School of Liberal Arts KMUTT, for supporting this work.

REFERENCES [1] Hsien-Chin Liou, “ Investigation of using text-critiquing programs in a process-oriented writing class”. CALICO Journal, Volume 10, Issue 4, pp.17-38, 1993. [2] Lesa A. Stern, Amanda Solomon “Effective faculty feedback: the road less traveled”. Assessing Writing, Volume 11, Issue 1, pp. 22-41, 2006 [3] Etienne Cornu, Natalie Kubler, Franck Bodmer, et. al., “Prototype of a Second Language Writing Tool for French Speakers Writing in English”, Journal of Natural Language Engineering, Volume 2, Issue 3, pp. 211-228, 1996. [4] Aleksander Szwedek, “Lexical Cohesion in Text Analysis”, Journal of Poznan Studies in Contemporary Linguistics, Volume 11, pp. 95-100, 1980. [5] Jane Morris, Graeme Hirst, “Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of Text”, Journal of Computational Linguistics, Volume 17, Issue 1, March 1991, pp. 21-48 [6] Ildiko Berzlanovich, “Lexical Cohesion and the Organization of Discourse”, First year Ph.D Progress report, University of Groningen, 2008. [7] Hideki Kozima, “Computing Lexical Cohesion as a Tool for Text Analysis”, Ph.D Thesis, University of Electro-Communications, Japan, 1993. [8] Elke Teith, Peter Fankhauser, “WordNet for Lexical Cohesion Analysis”, Proceedings of the Second International WordNet Conference, pp. 326-331, 2004. [9] Elke Teith, Peter Fankhauser, “Exploring Lexical Patterns in Text: Lexical Cohesion Analysis with WordNet”, Proceedings of 2005 International Conference of Interdisciplinary Studies on Information Structure, pp. 129-145, 2005. [10] Michael Hoey, “Patterns of Lexis in Text”, Oxford, Oxford University Press, 1991. [11] Richard Watson Todd, “Topics in Classroom Discourse”, Ph.D Thesis, Liverpool University, 2003.

208

ThC 1.6

[12] Stephen T. O’Rourke, Rafael A. Calvo, “Visualizing Paragraph Closeness for Academic Writing Support”, Proceedings of the Ninth IEEE International Conference on Advanced Learning Technologies, pp. 688-692, 2009. [13] Wibke Weber, “Text Visualization – What Colors Tell About a Text”, Proceedings of the Eleventh IEEE International Conference on Information Visualization, pp. 354-362, 2007. [14] The Stanford Natural Language Processing Group, “The Stanford Log-Linear Part Of Speech (POS) Tagger API”, http://nlp.stanford.edu/index.shtml, Accessed January 2010. [15] Princeton University, “About WordNet”, http://wordnet.princeton.edu, 2010.

209

ThC 1.6

[ieee education (iccse 2011) - singapore, singapore (2011.08.3-2011.08.5)] 2011 6th international...

Documents