failed queries: a morpho-syntactic analysis based on transaction log files

21
“Failed Queries: a Morpho-Syntactic Analysis Based on Transaction Log Files” Anna Mastora 1 , Maria Monopoli 2 and Sarantos Kapidakis 1 1 Laboratory on Digital Libraries & Electronic Publishing, Department of Archives & Library Sciences, Ionian University, 72 Ioannou Theotoki str., 49100, Corfu, Greece 2 Library Section, Economic Research Department, Bank of Greece, 21 El. Venizelos ave., 10250, Athens, Greece {mastora, sarantos}@ionio.gr, [email protected] 1 st Workshop on Digital Information Management 30-31 March 2011, Corfu, Greece This research has been co-financed by the European Union (European Social Fund – ESF) and Greek national funds through the Operational Program "Education and Lifelong Learning" of the National Strategic Reference Framework (NSRF) - Research Funding Program: Heracleitus II. Investing in knowledge society through the European Social Fund.

Upload: giannis-tsakonas

Post on 17-Jan-2015

698 views

Category:

Education


1 download

DESCRIPTION

Presentation in the First Workshop on Digital Information Management. The workshop is organized by the Laboratory on Digital Libraries and Electronic Publication, Department of Archives and Library Sciences, Ionian University, Greece and aims to create a venue for unfolding research activity on the general field of Information Science. The workshop features sessions for the dissemination of the research results of the Laboratory members, as well as tutorial sessions on interesting issues.

TRANSCRIPT

  • 1. 1st Workshop on Digital Information Management30-31 March 2011, Corfu, Greece Failed Queries:a Morpho-Syntactic AnalysisBased on Transaction Log FilesAnna Mastora1, Maria Monopoli2 and Sarantos Kapidakis11Laboratory on Digital Libraries & Electronic Publishing, Department of Archives & Library Sciences, Ionian University, 72 IoannouTheotoki str., 49100, Corfu, Greece2Library Section, Economic Research Department, Bank of Greece, 21 El.Venizelos ave., 10250, Athens, Greece {mastora, sarantos}@ionio.gr, [email protected] research has been co-financed by the European Union(European Social Fund ESF) and Greek national funds throughthe Operational Program "Education and Lifelong Learning" ofthe National Strategic Reference Framework (NSRF) - ResearchFunding Program: Heracleitus II. Investing in knowledgesociety through the European Social Fund.

2. Presentation outlineIntroductionAims and ObjectivesRelated ResearchDefinitions and MethodologyResultsConclusionsFuture Work2 1st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 3. Aims and ObjectivesThe aim of this study is to elaborate on theprocedure needed in order to analyze morpho-syntactically the typing-error queries submittedduring the search processThe objectives of the study are twofold Explore the extent and types of failed queries due totyping errors Explore the feasibility of their morpho-syntacticanalysis31st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 4. Related research (i)Information Retrieval techniques do not workeffectively at all timesNot working effectively includes: Not retrieving relevant documents (low Recall), or, Retrieving non relevant documents (low Precision)Part of studying what is not retrieved is theanalysis of Failed queries or Failure analysisSignificant interest has been expressed onfailed queries as the outcome of subjectsearching41st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 5. Related research (ii)Natural Language Processing is essential,especially in highly inflectional languages,such as GreekPart of Speech (PoS) tagging andmorpho-syntactic analysis of words arevaluable tools when mechanisms areapplied for word disambiguation (eithermorpho-syntactic or word-sense driven)51st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 6. Definitions and Limitations (i)What constitutes a failed query? It depends. Some approaches consider: Precision and Recall, applying retrieval effectiveness measures User satisfaction, applying users criteria to measure if a query was failed Transaction Log files analysis (with or without relevance judgments) Critical incident technique, using direct observation of human behavior6 1st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 7. Definitions and Limitations (ii)Failed query: a query with zero hits A bit arbitrary (yet practical) sincezero hits cannot solidly define a query as failednot all non-zero hits queries can be defined as successful For the purpose of this study this is a rather safedefinition sincethe database was known to include relevant itemsthe information need was relevant to the context of thedatabasewe used transaction log files without relevance judgments We analyze the queries morpho-syntactically, i.e. asbag of words, detached from any semanticdesignation (well most of the times)71st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 8. Definitions and Limitations (iii)Morpho-syntactic analysis: cognitive process thatconstitutes an intermediate layer between morphological andsyntactic analysis and aims to assign unambiguous morpho-syntactic information to words of textsMorpho-syntactic information: the morphologicalorigin and the morpho-syntactic properties of a word (e.g. theword is the genitive singular form of the masculinenoun [])Inflectional languages: with a high morpheme-per-word ratioMorpheme:the smallest meaningful linguistic unit8 1st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 9. Methodology: Experimental settingIn vitro experiment (27 participants, undergrads, Dpt. ofArchives & Library Sciences)13 given information needs related toenvironmental issues The queries submitted for this purpose formed the terms corpusSubject search only, Simple searchBibliographic database of the EvonymosEcological Library (customized)Transaction Log Files (1 xml log file/ per user/ persession)Language used: Greek (highly inflectional)91st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 10. Methodology: The Workflow101st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 11. Methodology: Analysis specifications Manual designations to categories of failed queries, all examined one-by-one Use of Open Xerox tokenizer and morphological analyzer (PoS tags also included)Use of MS Word dictionary For more definitions and a full list of citations, please, refer to the full paper of this presentation hosted at the workshops website111st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 12. Results (i) 1,284 queries were submitted overall 36% were failed queries (i.e. 459) Further categorization of Failed Queries Further categorization of Typing errors121st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 13. Results (ii) Typing error queries: 90 Result after tokenization: 156 tokens Morpho-syntactic analysis of tokens (terms as submitted): 20/156 identified and analyzed*Poor performance in identifying the terms*131st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 14. Results (iii) Error correction of tokens using MS Word dictionary At this point we interfered with the results assigning the semantically correct word141st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 15. Results (iv) Morpho-syntactic analysis of corrected tokens: 139/156 identified and analyzed *Good performance in identifying the terms*151st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 16. An example16 1st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 17. Conclusions (i) 36% of the queries submitted returned zero hits 19.6% of failed queries were due to typing errors Typing error queries require more steps in the process towards morpho-syntactic analysis Tools for morpho-syntactic analysis of the Greek language need to be rich in tags in order to work adequately. This affects the complexity of the tool but it is essential since Greek is a highly inflectional language and cannot be analyzed easily. Transaction Log files offer a good starting point but need further analysis and support from other tools in order to be used for successful word-sense disambiguation17 1st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 18. Conclusions (ii) MS Word dictionary Was consulted in order to correct ~70% of the typing errors. The remaining were the part of submitted phrases with no typing error. In 45.5% of the cases, first MS Word suggestion was correct. Seems to favor at all instances plural masculine over singular feminine, i.e. for , it first suggests and second . Does not recognize named entities181st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 19. Conclusions (iii) Open Xerox Works well for identifying and characterizing the segments Is not enriched with many tags which leads to multiple suggestions. The goal in these cases is the less possible suggestions for each segment Does not recognize capitalized words or words with no accent marks, meaning that the text submitted must be preprocessed to meet certain requirements Does not recognize named entities191st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 20. Future Work Deal with matters such asNamed entities recognitionLanguage identificationWord-sense disambiguation in order to achieve higher rates of morpho- syntactic analysis and apply automated mechanisms in this process Match the analyzed corpus of queries to Knowledge Organization Systems to assist query expansion20 1st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011 21. Thank you for your attention! Any questionsContact: Anna Mastora Laboratory on Digital Libraries & Electronic Publishing, Department of Archives & Library Sciences, Ionian University, 72 Ioannou Theotoki str., 49100, Corfu, Greece [email protected] 1st Workshop on Digital Information Management, Corfu, Greece, 30-31 March 2011