simplifying text processing with grammatically aware...

Simplifying Text Processing with

Grammatically Aware Regular Expressions

Rafal Rzepka † Tyson Roberts ‡ Kenji Araki †† Graduate School of Information Science and Technology, Hokkaido University

{kabura,araki}@media.eng.hokudai.ac.jp‡ Google Japan

[email protected]

AbstractIn our paper we introduce Grammatically Aware Regular expression (GARE) and describe its usage usingexamples from moral consequences retrieval task. GARE is an extension to the regular expression conceptthat overcomes many of the difficulties with traditional regexp by adding Normalization (e.g., searchingall grammatical forms with basic form of a verb or adjective is possible) or POS awareness (e.g. searchingonly for adjectives after “wa” particle is possible). We explain how it works, what makes it more expressivefor natural language, and how it solves a number of matching cases that traditional regular expressionscannot solve on their own.

1 Introduction

As the research topic, text mining becomes more andmore popular, and this task is an interesting subjectwhen teaching students Natural Language Process-ing techniques. One of the most common patternmatching and extraction methods, regular expres-sion, is useful for shallow parsing tasks, but its diffi-culty discourage many undergraduates (See Fig. 1 tosee the complexity of traditional regexp). Also, whenusing in more complicated tasks, regexp lacks flex-ibility, especially in situations where deeper gram-matical parsing is required. As a solution to theseproblems, we introduce grammatically aware regularexpression (GARE) and describe its usage examplesfrom an ongoing task of moral consequences retrieval.GARE utilizes dependency parsing which allows toretrieve related tokens which are impossible to dis-tinguish with surface form information only. By thispaper we want to introduce the system to JapaneseNLP researchers and illustrate its functions with fewexamples from our research[1] on retrieving humanmoral behavior patterns1.

1.1 Classic Regular Expression

Regular expressions (often abbreviated to regex orregexp) are used by researchers familiar with toolslike UNIX grep but for most lay people trying to useit to determine if a particular pattern occurs in alarger text is a difficult task. A regular expression is

1Due to lack of space, richer set of full matching exampleswill be presented on the poster this paper describes

a formal language used to describe a particular regu-lar grammar that can thereafter be read by a regularexpression parser to examine if a portion matchingthe described grammar exists in a target sequence. Itexamines it and identifies parts that match the pro-vided specification. Typical examples of such specifi-cation are concatenation, alternation, grouping, lit-eral and repetition. One can find a date within asentence searching for repeated digits between par-ticular words like names of months, etc. However,when dealing with natural languages, affixes, inflec-tions and other linguistic phenomena change surfaceforms which always makes a user to extend the spec-ification. Linguistic information (e.g. part of speech,inflection information, and any other tagged data) isnot available and one have to use external data toperform a successful matching.

1.2 Moral Judgment Retrieval Task

Examples used in this paper come from a task toextract moral consequences of human actions. It isa search task where input is an action as “to kill aman” or “to steal a car” and output is a emotionalreaction of bloggers combined with a set of otherconsequences presented by manually set keywords:prise/punishment, impact on society, possibility tobe imprisoned, etc. So if majority of bloggers reactnegatively to a given action, it is labeled as unmoral.Enhanced regular expression allows for deeper searchand refining output as one can easily extend or limitverb forms or calculate data credibility by weight-ing down expressions showing lower probability (e.g.

言語処理学会第18回年次大会発表論文集 (2012年3月) ￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣￣ ̄

― 867 ― Copyright(C) 2012 The Association for Natural Language Processing. All Rights Reserved 　　　　　　　　　　　　　　　　　　　　

Figure 1: Brevity and clearness difference betweenclassic regexp and GARE.

Figure 2: The final tree is of variable depth, andword objects potentially contain other words.

“maybe”, “probably”, “perhaps”, etc.). Below weexplain how it is possible.

2 MuCha System

The base of GARE[3] is MuCha system[2], which wasdeveloped for Japanese language implementation ofMIT Media Lab’s ConceptNet[4]. Named in honor ofthe CaboCha project on which the underlying pars-ing tree is based, the MuCha is a modular suite writ-ten in the Python programming language. ThoughMuCha is implemented for Japanese, it can be ex-tended to any language that has a working depen-dency parser. Its interface is inspired by pyparsing2,a general parsing library, and there are two mainmodules to the system: the Tree Parser and the Pat-tern Matcher.

2http://pyparsing.wikispaces.com/

2.1 Tree Module

The tree module is an extended version of the treestructure described by the CaboCha dependencystructure analyzer. The tree first builds an ut-terance by copying CaboCha’s shallow chunk/nodetree. These CaboCha tokens are then abstracted into“Word” nodes. Since Japanese is highly inflected,the idea of a word can be comprised of a variablenumber of tokens (See Fig. 2 and 3). To achievean abstraction, each token node is first wrapped ina Word node, and then custom combination rulesbased on Japanese grammar are applied recursivelyto build up words. Japanese is generally agglutina-tive, and inflections can change the part of speechof the entire word each time an affix is encoun-tered. This can happen repeatedly for highly in-flected words. Classes in MuCha describe the prop-erties and idiosyncrasies of each type and this al-lows easy implementation of several major featuresas normalization of verb and adjective conjunctiveforms or enumeration of objects in an utterance byword, token, or chunk. In the Chunk phase, the it-erator operates by running rules over each child in achunk. Initially, the children of each chunk are to-ken objects. The rules themselves work by wrappingthe child nodes with appropriate word nodes, some-times recursively. By the end of this phase, all to-ken objects are wrapped in word objects. The Wordphase, which comes next, operates on the precondi-tion that all children of chunks have been wrapped in(possibly recursive) word nodes. This phase is con-cerned with fixing certain idiosyncrasies of the previ-ous phase and combining complex conjugations. Thetree structure is made up of several types of nodes ina strict hierarchy: utterance, chunk, word, and to-ken objects. Each has its own object representation,derived from a common object type, TreeNode. Thechunk class is contained within utterance nodes andcontains word nodes. They represent a partition ofthe utterance that happens across linguistic bound-aries. Words are branch nodes that are children ofchunks and parents of other word nodes (complexwords) or tokens (simple words). Lastly, tokens con-sist of only a single class, and are not overridden.Tokens are the children of word nodes, and containthe basic information about linguistic objects.

2.2 Verb and Adjective Normaliza-tion

A simple scheme is implemented: normalized verbconjugations are reported as part of the verb, sepa-rated by the swung-dash character ∼, but arbitrarynormalization algorithms can be implemented (SeeFig. 4). For example:

Kare wo yurushi taku nari masen deshita .


Figure 3: Visualization of MuCha : Tree Module : CaboCha Tree.

I didn’t want to forgive him. [Polite]Kare wo yurushi taku nara na katta .I didn’t want to forgive him. [Plain]both become:Kare wo yurusu∼tai∼naru∼nai∼ta

In Japanese, inflections often have different conjunc-tive surface forms depending on the type root theyare attached to. As shown above, polite verb formsin Japanese regularly use conjugations that, whileidentical in a semantic sense, carry different surfaceforms from plain verb forms. Normalization helps tosolve this problem by creating an extendable regularsyntax for representing the Japanese language ina normalized form. The tree nodes that use thenormalize functionality are called in the followingway. First is to call the normalize() function on thenode, passing a normalizer object as an argument.Then, if the node is branch, call join() on childpairs, or join with type() on multiple children,which calls normalize() on them in turn. If not, thenext step is to normalize self as string and returnresult. In this way, all of MuChas nodes are able tobe normalized in a flexible way, and the user is ableto write new normalizers if desired.

2.3 MuCha Pattern Matching Mod-ule

The Pattern Matching module allows patterns tobe written in a way similar to regular expressions,

Figure 4: A simplified version of the normalizationclass.

but it is much more powerful. The parser itself is arecursive descent parser which is capable of parsingcontext-sensitive grammars (allows full backtrack-ing), but written to do as much lazy calculation aspossible. Patterns are written in straight Pythonusing an interface inspired by the pyparsing librarymentioned before. The tokens on which the moduleoperates are not strings, but the enumerated wordsfrom the original parse tree and this makes itspattern matching different from classic approaches.For example

Kitto kare ni ha kakoku na batsu ga youi sareteru ndarou .


Figure 5: Simple example of matching pattern forthe sentence Kitto kare ni ha kakoku na batsu ga youisareteru n darou.

(Surely a severe punishment has been prepared forhim.)Would operate on an object stream:<Adverb: Kitto> <Noun: kare> <Particle:ni> <Particle: ha> <SimpleAdjective: kakoku><Da: na> <Noun: batsu> <Particle: ga><NominalVerb: youi sareteru> <Noun: n><ComplexVerb: darou> <Punctuation: .>This implementation differs from a traditional reg-ular expression because the grammar is representedin program code in the Python language as opposedto a separately coded string. Elements (tokens)and operators are represented by Python objectsdirectly. Thanks to such approach, implementationis simpler than creating an interpreter for a regularexpression engine based on a string-based grammarand arbitrary Python code can be used for filtering.Non-linear matching and sub-matching are alsopossible; the readability of matching grammarsare improved and can be easily built step by step.However, if traditional regular expression semanticsneed to be preserved – user can override Pythonoperators.

3 Conclusions and FutureWork

In this paper we have briefly described GARE, animproved text matching technology using a novelmethod of matching. It improves Regular Expressionin Natural Language tasks by allowing grammaticalconstructs and other features to be used, which iscurrently impossible in regexp systems. It is devel-oped to be easier to use, re-use and to learn as writ-ing patterns is far less time consuming and requiresless code. It should be suitable for use in educationand research - we are currently working on an onlinematching patterns tester. Because MuCha3 is simpleto extend and modify (by exploiting the Python ob-

3MuCha 1.0.6 is available for download and testing here:http://arakilab.media.eng.hokudai.ac.jp/local/local/Links files/mucha package.1.0.6.tar.gz (your feedback isgreatly appreciated).

ject inheritance model), a user can define his or herown patterns. We are planning to use it not only forconsequences retrieval but also to extract conceptsfitting the OpenMind CommonSense (OMCS[5]) en-tries which are laboriously input by human volun-teers. For instance “AtLocation” node candidatescan be matched bySubject() + ComplexNoun() + NiPart() + (Iru() —Aru()).To use MuCha’s extendability even further, we arealso planning to add semantic categories (by us-ing WordNet[6] data, for example) to allow limitingComplexNoun() to one kind of noun:Subject() + PLACE() + NiPart() + (Iru() —Aru()).We also need to work on optimization, as the sys-tem currently concentrates on complicated matches,not on the matching speed (on 2008 MacBook Proaverage time was approximately 68ms per match).

References

[1] Komuda, R., Ptaszynski, M., Momouchi, Y.,Rzepka, R. and Araki, K.: “Machine Moral De-velopment: Moral Reasoning Agent Based on Wis-dom of Web-Crowd and Emotions”, InternationalJournal of Computational Linguistics Research,Volume: 1 , Issue: 3, pp: 155-163 (2010)

[2] Roberts, T,. Rzepka, R. and Araki, K.: “AJapanese Natural Language Toolset Implementa-tion for ConceptNet”, the Proceedings of Com-monsense Knowledge in the AAAI 2010 Fall Sym-posium (Technical Report FS-10-02), pp. 88-89,Arlington, USA (2010)

[3] Roberts, T., Rzepka, R. and Araki, K.: “In-troducing Grammatically Aware Regular Expres-sions”. In Proceedings of the Pacific Associa-tion For Computational Linguistics, paper No 12(2011)

[4] Havasi, C., Speer, R. and Alonso, J.: “Concept-Net 3: a Flexible, Multilingual Semantic Networkfor Common Sense Knowledge”, in Proceedings ofRecent Advances in Natural Languages Process-ing, pp. 277-293 (2007)

[5] Singh, P., Lin, T., Mueller, E. T., Lim, G.,Perkins, T., & Zhu, W.: “Open Mind CommonSense: Knowledge acquisition from the generalpublic”. In Proceedings of ODBASE’02. LectureNotes in Computer Science. Heidelberg: Springer-Verlag. (2002).

[6] Fellbaum, C.: “WordNet: An Electronic LexicalDatabase”, Cambridge, MA: MIT Press (1998)


simplifying text processing with grammatically aware...

Documents