tool for text-‐based terminology
TRANSCRIPT
![Page 1: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/1.jpg)
SketchengineTOOL FOR TEXT-‐BASED TERMINOLOGY
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 2: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/2.jpg)
Using texts for term miningv Building a corpus.vChoosing textsvConverting into common format (txt)vAnnotationv Croatian: http://nlp.ffzg.hr/api-‐for-‐our-‐language-‐technologies/
vAlignmentv CAT tools (SDL, memoQ) or LF ALigner, https://sourceforge.net/p/aligner/wiki/Home/
v Searching the corpus.v Concordance tools: AntConc (free), Wordsmith (€), ParaConc (free)vWeb-‐based corpus workbench: Sketchengine, http://www.sketchengine.co.uk
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 3: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/3.jpg)
What is the Sketchengine?v Very powerful corpus workbench: https://www.sketchengine.co.uk/
v Provides access to multiple pre-‐compiled corpora (British National Corpus, hrWaC, DGT corpora and many more)
v NOT free, but not expensiveJ (5,99 € per month)
v Allows the creation of ad hoc corpora from web texts
v Supports TMX import (for bilingual texts!)
v Provides ways to extract terminology semi-‐automatically
v Online tutorials: https://www.sketchengine.co.uk/sketch-‐engine-‐video-‐tutorials/
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 4: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/4.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 5: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/5.jpg)
Simple concordances
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 6: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/6.jpg)
Other query types
v simple: searches for word and its inflected forms
v lemma: searches for all words with this lemma
v phrase: for searching multiple words
v word: to search for a specific wordform
v character: to search for a string of characters
v CQL: corpus query language
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 7: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/7.jpg)
WordSketches
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 8: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/8.jpg)
Thesaurus – similar words
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 9: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/9.jpg)
Keyword extraction
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 10: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/10.jpg)
Term queries in the DGT parallel corpusv simple queries: ribolov, brancin, grdobina
v lemma queries: ribolov -‐> ribolova, ribolovu, ribolov
v parallel query:
v querying using CQL syntax:
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 11: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/11.jpg)
Basic CQLv Typical format: [attribute="value"], e.g. [lemma=“riba”]
v Specifying word class or case: [tag=“N.*”] (any noun), [tag=“A.*”] (any adjective)
v Regular expressions: v . (dot) matches any single characterv * (asterisk) matches 0-‐100 repetitionsv + (plus) matches 1-‐100 repetitionsv {n,k} specifies exact range of repetitions, from n to k
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 12: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/12.jpg)
[lemma=“rad”]
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 13: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/13.jpg)
[tag=“A.*”][lemma=“riba”]
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 14: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/14.jpg)
"ulov.*" []{0,3} [tag="N.*"]
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 15: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/15.jpg)
Challengesv Search for verbs occurring before the word “ugovor” with up to 2 words in between.
v Search for words ending with “anje”.
v Search for defining contexts containing a noun in the nominative case followed by “je” followed by an adjective and noun in the nominative case.
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 16: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/16.jpg)
Looking for definitionsv Exploit typical definition patterns: v[X] is a [Y]v [X] is defined as [Y]v [X] is a kind of [Y]v …
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 17: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/17.jpg)
WebBootCatv Tool to create text collections from web pages
v User provides keywords & optionally selects sites to crawl
v When the corpus is compiled it can be used for queries or download.
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 18: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/18.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 19: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/19.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 20: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/20.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 21: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/21.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 22: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/22.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 23: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/23.jpg)
TMX Uploadv Allows you to create corpora from your translation memories
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 24: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/24.jpg)
Terminology extractionv Works for languages with a predefined “term grammar”
v Manage corpus -‐> Keywords and terms
v Terms can be exported into TBX or CSV
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 25: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/25.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 26: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/26.jpg)
Exercisev Use the corpus-‐derived information on the following slides to create a term entry for “bluetongue”.
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 27: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/27.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 28: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/28.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB
![Page 29: TOOL FOR TEXT-‐BASED TERMINOLOGY](https://reader031.vdocuments.net/reader031/viewer/2022021923/586469601a28ab0e30936569/html5/thumbnails/29.jpg)
INTEGRA TERMINOLOGY MANAGEMENT, ZAGREB