project aide
DESCRIPTION
Computer Aided Document Indexing System ( CADIS ) with Eurovoc Bojana Dalbelo Ba šić Faculty of Electrical Engineering and Computing University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]. Project AIDE. - PowerPoint PPT PresentationTRANSCRIPT
Bruxelles, 2006-03-10
Computer Aided Document Indexing System (CADIS) with Eurovoc
Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]
Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]
Bruxelles, 2006-03-10
Project AIDE
idea for a project
September 2004, conference at JRC, Ispra
interdisciplinary collaboration of 3 institutions
Croatian Information Documentation Referral Agency (HIDRA)
Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb
Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb
Bruxelles, 2006-03-10
AIDE – collaborating institutions HIDRA
collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia
coordinator Maja Cvitaš, M.A.
ZEMRIS
research in the field of artificial intelligence, neural networks, machine learning, data and text mining
coordinators prof. Bojana Dalbelo Bašić andJan Šnajder
ZZL
computational linguistic research and building language technologies for Croatian
coordinator prof. Marko Tadić
Bruxelles, 2006-03-10
AIDE – project objective
Development of intelligentsystem for automatic indexingof the official documentation
of the Republic of Croatiawith descriptors from Eurovoc
thesaurus
Bruxelles, 2006-03-10
AIDE – how? automatic indexing, how?
program which “learns to index”
Joint Research Center of EC (JRC), Ispra, Italy at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)
compiling a corpus of Croatian indexed documents for machine learning of automatic indexing with Eurovoc descriptors
situation with Croatian documentation in 2004. there were only few hundreds of documents indexed manual indexing: painfully slow
Bruxelles, 2006-03-10
AIDE – how?
how could we speed up the manual indexing?
plan:
to develop a workstation for computer aided document indexing
conduct the research and development of algorithms in the field of computational linguistics/language technologies
insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS)
Bruxelles, 2006-03-10
CADIS features
Enhanced user interface
list of descriptors appearing in document
Bruxelles, 2006-03-10
CADIS features
Integration of corpus analysis
greyed n-grams are statistically relevant in the corpus
Bruxelles, 2006-03-10
CADIS features
Manual marking of significant n-grams — important step towards automatic indexing
Bruxelles, 2006-03-10
Further development CADIS for other languages?
already for Croatian and English
usable for other languages without linguistic module
cooperation needed with respective language technology experts for development of linguistic module for other languages
partners for EU project proposals for the next step
AIDE
research on machine learning and text-mining
use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc
establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia
Bruxelles, 2006-03-10
Conclusion
CADIS is unique in Europe
Web info at:
HIDRA: www.hidra.hr/hidra/aide/aide.htm
ZEMRIS: textmining.zemris.fer.hr
for download contact: [email protected]
Bruxelles, 2006-03-10
Computer Aided Document Indexing System (CADIS) with Eurovoc
Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]
Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]