21. - 23. 2. 2007
VŠB - Technická univerzita Ostrava
Text Mining Services for Text Mining Services for Trialogical LearningTrialogical Learning
Pavel Smrž1, Ján Paralič2, Peter Smatana2, Karol Furdík2
1: Brno University of Technology, FIT, Božetěchova 2, 612 66 Brno,
University of Economics, Prague, W.Churchill Sq.4, 130 67 Praha, Czech Republic,
2: Technical University of Košice, Centre for Information Technologies,
Letná 9, 040 01 Košice, Slovakia
{Jan.Paralic, Peter.Smatana, Karol.Furdik}@tuke.sk
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 2
ContentsContents
KP-Lab project
Trialogical Learning and Activity Theory
Semantic Web Knowledge Middleware
Text Mining Services
• Pre-processing
• Learning Ontologies
• Classification
Future work
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 3
Full title: Knowledge Practices Laboratory
www.kp-lab.org• Integrated EU funded FP6 IST project No. 27490• Starting date: February 1st, 2006• Duration: 5 years• 22 partners from 14 countries
Main goal: creating a learning system aimed at
facilitating innovative practices of sharing, creating and
working with knowledge in education and workplaces.
KP-Lab ProjectKP-Lab Project
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 4
Trialogical LearningTrialogical Learning
Challenge - to capture innovative practices of both learning and working with knowledge, so-called knowledge practices.
Trialogical Learning focuses on the social processes by which learners collectively enrich/transform their individual and shared cognition.
Activity theory:• the object-orientedness of human
activity, • mediation through cultural-
historically developed tools of intelligent activity,
• contradictions emerging between the elements of activity systems.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 5
Knowledge ArtefactsKnowledge Artefacts
KA - a central notion of Trialogical Learning• Mediators of all activities and tasks among learners;• Capture and preserve the shared knowledge within a community.
Forms:• Physical resources / tools (documents, SW code, ...);• Concept maps, taxonomies, ontologies, domain models;• Plans, scientific theories, languages.
Goal of KP-Lab project: to provide a platform (tools & methodology) for creation and transformation of KA‘s in the trialogical manner.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 6
Scientific ChallengesScientific Challenges
1. Facilitating knowledge-creating learning beyond knowledge
acquisition and social participation
2. Expanding and elaborating the "trialogical" object of educational
activity
3. Eliciting the development of trialogical agencies
4. Facilitating horizontal and vertical boundary crossing
5. Developing tools for deliberate transformation of knowledge practices
6. Specifying design-principles of trialogical technologies
7. Developing methods regarding research on longitudinal
transformation of knowledge practices
8. Creating an open, developing community of trialogical technologies
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 7
Semantic Web Knowledge MiddlewareSemantic Web Knowledge Middleware
SWKM goal - to facilitate knowledge creation processes by supporting advanced interactions of collaborating learners with knowledge artefacts, i.e. discovery, access, evolution, recommendation, and mining.
Generic modules:• Knowledge Repository - scalable persistent services for large
volumes of knowledge artefacts' descriptions and ontologies; • Knowledge Mediator - services for handling the main registry,
discovery, and evolution for KP-Lab knowledge artefacts; • Knowledge Matchmaker - services supporting interactions of KP-
Lab users with knowledge artefacts employing their semantic descriptions.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 8
SWKM ArchitectureSWKM Architecture
Features:• adopts SOA
principles;
• built upon the RDFSuite OS platform;
• data: RDF, accessed by RQL / RUL.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 9
Text Mining in the KP-LabText Mining in the KP-Lab
Text mining services - intelligent access and manipulation with the knowledge artefacts; to assist users in creating or updating the semantic descriptions of KP-Lab knowledge artefacts.
TMS fundamental tasks:• Ontology learning - extraction of conceptual maps (clustering), i.e. an
automatic extraction of significant terms from KA's textual descriptions and converting them to a structure of concepts and their relationships.
• Classification of knowledge artefacts - grouping a given set of artefacts into predefined or ad hoc categories.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 10
Schema of Text Mining ServicesSchema of Text Mining Services
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 11
Pre-processingPre-processing
Preprocessing phase - transforming data into the appropriate form. It consists of several language-dependent NLP steps that provide annotations of the plain-text resources.
Unified modules: • tokenization, stemming (or lemmatization, e.g. in CZ/SK), elimination
of stop words, POS (part-of-speech) tagging.
Individual modules: (crucial for some methods of ontology learning)• chunking, WSD (word-sense disambiguation), full syntactic analysis.
GATE (http://www.gate.ac.uk/) - a platform for NLP, provides:• an architecture, or organisational structure, for NLP software;• a framework, or class library, which implements the architecture;• a development environment built on top of the framework.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 12
Ontology Learning (1)Ontology Learning (1)
1. Conversion to a plain text format• Structural info in source file is used as metainformation in next steps.
2. Processing by GATE• Tokenization, sentence boundaries, POS tagging (Brill‘s tagger),
named entity recognition, Charniak's syntactic analyser.
3. Significant terms (concepts) identification• A background domain model, created from additional textual resources.
4. Semantic relations identification• A set of pre-defined (or automatically identified) patterns and co-
occurrence statistics are used
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 13
Ontology Learning (2)Ontology Learning (2)
5. Ontology merging• The extracted structure is combined with the global domain ontology
(stored in KP-Lab knowledge repository). The mechanism of the explicit uncertain knowledge representation is used in this step.
6. Visualisation• Combination of the gained
qualitative data and the relevance weights.
• The selection of the most suitable visualisation form depends on the needs of KP-Lab users; the simple view in a graphical form is the proposal.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 14
Ontology Learning (3)Ontology Learning (3)
7. Export to other formats• Standard OWL export routines are supported currently. The emerging
BayesOWL and FuzzyOWL formats are under development.
Creation of the training set - background model:• 2-billion-word GigaCorpus for English;• 600-million-word corpus for Czech;• additional relevant documents provided by users.
Data simulation - using Wikiversity & Wikipedia texts.
Scenarios:
1. Collaborative acquiring of knowledge in a company
2. Description of a field of interest. Creation of an essay for a given topic(s) in an academic environment.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 15
ClassificationClassification
Task is to automatically organize a set of knowledge artefacts into predefined or ad hoc categories - existing or new concepts of an ontology.
Classification is supervised by a model, created from a training set of semantically annotated artefacts. The model contains a set of parameters (weights, rules, etc.) created in the process of training and used in the classification of unknown examples.
Algorithms to be used:• simple term matching, kNN, SVM, Winnow, Perceptron, Naive Bayes
(multinomial and binomial), boosting, decision rules, and decision trees (various combinations of growing and pruning methods).
Implementation platform: JBowl library
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 16
JBowl LibraryJBowl Library
JBowl - Open Source library in Java, provides support for:
• intelligent information retrieval, summarization, and information extraction from textual documents;
• text mining, clustering, categorization, classification tasks.
Main characteristics:• extendable modular architecture;
• platform for pre-processing (incl. NLP methods) and indexing of large textual collections;
• functions for creation and evaluation of text mining models (for both supervised or non-supervised algorithms).
Web: http://sourceforge.net/projects/jbowl/
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 17
JBowl Library - ArchitectureJBowl Library - Architecture
modelsmodels
datadata
analysisanalysis
Tokenization Sentence chunking NP chunkingPOS tagging
Statistics TF IDF Term selection
categorization clustering keyword extraction/ summarization
information extraction
utilsutils
BLASMatrixesCollections
documentsdocuments
Lucene index ThesaurusXML
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 18
JBowl Library - UsageJBowl Library - Usage
JBowl provides:• Text categorization method for the active learning, allowing to reduce
the number of training examples.
• Heuristics that selects examples according to the confidence of the classifier prediction for the given example. This heuristic does not require a validation set and can be used effectively to select a small set of labeled examples.
• Integration of several classification methods, evaluation.
• Tools for NLP (incl. Slovak linguistic resources and tools).
Scenario for use of classification service:• Annotation of new or updated artefacts - system can suggest
suitable concepts from one or more ontologies to be assigned as metadata or conceptual description to the artefact.
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 19
Solving multilinguality - find a minimal set of NLP resources that are satisfactory for the (basic) functionality of the text-mining services.
Increasing efficiency: requirement of synchronous SOA system - e.g. by the use of the Extensible Messaging and Presence Protocol (XMPP)
Classification: Selection of most appropriate algorithms in the context of the automatic annotation of the artefacts according to the semantics codified in several ontologies. (with limited availability of training data)
Ontology learning: to concentrate on the better ways of ontology merging (incl. the need to combine extracted relations with the ones from existing domain ontologies).
Implementation of the first prototype of the SWKM (M24), testing and evaluation.
Future WorkFuture Work
21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava
Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík
# 20
Thank you Thank you !!Questions?Questions?
http://www.kp-lab.org
Further information: