sinai-gir a multilingual geographical ir system university of jaén (spain) josé manuel perea...
TRANSCRIPT
![Page 1: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/1.jpg)
SINAI-GIRSINAI-GIR
A Multilingual Geographical IR SystemA Multilingual Geographical IR System
University of Jaén (Spain)
José Manuel Perea Ortega
CLEF 2008, 18 September, Aarhus (Denmark)
Computer Science Department
![Page 2: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/2.jpg)
Introduction
• Preliminary work of SINAI in GeoCLEF: – 2006: query expansion using gazetteers and
thesaurus [García-Vega et al., 2007]– 2007: filtering documents based on manual rules
[Perea-Ortega et al., 2007]
• GeoCLEF 2008:– Filtering documents using new manual rules and
new approachs (query reformulation, keywords and hyponyms extraction, query geo-expansion)
GeoCLEF 2008, Aarhus
![Page 3: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/3.jpg)
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
![Page 4: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/4.jpg)
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Translates the queries from other languages into English
We have used SINTRAM (SINai TRAnslation Module) [García-Cumbreras et al., 2007]
It works with different online machine translators
![Page 5: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/5.jpg)
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Preprocessing: stemming, stopwords, POS The toponyms are extracted (NER) Two indexes are generated:
• Locations• Keywords
![Page 6: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/6.jpg)
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Query Preprocessing: stemming, stopwords, removes irrelevant information
The toponyms are extracted (NER) Spatial relations finder based on manual rules Query reformulation based on POS tagging and
query parsing subtask Geo-expansion using a gazetteer Keywords/Hyponyms detection
![Page 7: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/7.jpg)
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Lemur as index-search engine
Okapi with PRF as weighting function
![Page 8: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/8.jpg)
Multilingual Query
English collection
IR SubsystemIR Subsystem
GeoNames
Final Re-Ranked Documents retrieved
TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER
English Query (Q)
Q1
Q2Q3
Collection Collection PreprocessingPreprocessing
subsystemsubsystem
GeoNames
VALIDATORVALIDATOR
Documents retrieved
Keywords and geo-information
extracted
Keywords and geo-information
extracted
SINAI-GIR System overview
Filter the list of documents recovered by the IR subsystem, applying different manual rules and using the geographical data detected in the query
Re-rank the documents using predefined weights for each rule and the keywords/hyponyms detected in the query
![Page 9: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/9.jpg)
Experiments description
• SINAI has participated in mono and bilingual tasks with a total of 15 experiments15 experiments:– MONO-EN: 9 experiments– BILI-X2EN: 6 experiments
• Combining the content of topic labels: TD or TDN• BaselineBaseline: Q1 without applying any filtering or re-
ranking process• Other experimentsOther experiments:
– Filtering and re-ranking of the fusion list of the documents recovered by the Q1, Q2 and Q3
– Using keywords and/or hyponyms in the re-ranking process
GeoCLEF 2008, Aarhus
![Page 10: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/10.jpg)
MONO-EN results
GeoCLEF 2008, Aarhus
Best result: baselinebaseline (no filtering and no re-ranking)
In some filtering experiments the use of keywords improves the results
Best results using only the TD topic labels
![Page 11: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/11.jpg)
BILI-X2EN results
GeoCLEF 2008, Aarhus
Best result: baselinebaseline (no filtering and no re-ranking) with Portuguese topics
Best results using only the TD topic labels
![Page 12: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/12.jpg)
Conclusions
• The baseline experiment seems to work well because we include the geo-information in the retrieval process
• The filtering of documents does not seem to work well because we include the geo-information in the query and we are re-ranking documents which maybe are not relevant with respect to their content
• The use of keywords for re-ranking the documents retrieved could be interesting because in some experiments it improves the results obtained without using them
• Query reformulation could be also interesting because for some topics it retrieves valid documents which are not retrieved with the default query
GeoCLEF 2008, Aarhus
![Page 13: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/13.jpg)
TextMESS at GeoCLEF 2008
• Spanish TextMESS projectTextMESS project (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies): joint participation by the Polytechnic University of Valencia and University of Jaén (SINAI)
• Method employed: merging algorithm based on merging algorithm based on fuzzy Borda voting schemefuzzy Borda voting scheme, taking as input the , taking as input the two document lists returned by both systemstwo document lists returned by both systems
• Second best result in the monolingual English task
GeoCLEF 2008, Aarhus
![Page 14: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/14.jpg)
Thank you
GeoCLEF 2008, Aarhus
sinai.ujaen.es
![Page 15: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer](https://reader035.vdocuments.net/reader035/viewer/2022062712/56649c755503460f94929769/html5/thumbnails/15.jpg)
• References
– García-Vega, Manuel and García-Cumbreras, Miguel A. and Ureña-López, L.A. and Perea-Ortega, José M. GEOUJA System. The first participation of the University of Jaén at GEOCLEF 2006. In LNCS, volume 4730, pages 913-917. Springer-Verlag, 2007.
– Perea-Ortega, Jose M. and García-Cumbreras, Miguel A. and García-Vega, Manuel and Montejo-Ráez, Arturo. GEOUJA System. University of Jaén at GEOCLEF 2007. In Proceedings of the Cross Language Evaluation Forum (CLEF 2007), page 52, 2007.
– García-Cumbreras, Miguel A. and Ureña-López, L. Alfonso and Martínez-Santiago, Fernando and Perea-Ortega, José M. BRUJA System. The University of Jaén at the Spanish task of QA@CLEF 2006. In LNCS, volume 4730, pages 328-338. Springer-Verlag, 2007.
GeoCLEF 2008, Aarhus
http://sinai.ujaen.es