tool for text-‐based terminology

Post on 29-Dec-2016

227 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

SketchengineTOOL  FOR  TEXT-­‐BASED  TERMINOLOGY

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Using  texts  for  term  miningv Building  a  corpus.vChoosing   textsvConverting   into  common   format  (txt)vAnnotationv Croatian:  http://nlp.ffzg.hr/api-­‐for-­‐our-­‐language-­‐technologies/

vAlignmentv CAT  tools  (SDL,  memoQ)  or  LF  ALigner,  https://sourceforge.net/p/aligner/wiki/Home/

v Searching  the  corpus.v Concordance  tools:  AntConc (free),  Wordsmith   (€),  ParaConc (free)vWeb-­‐based  corpus  workbench:  Sketchengine,  http://www.sketchengine.co.uk

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

What  is  the  Sketchengine?v Very  powerful  corpus  workbench:  https://www.sketchengine.co.uk/

v Provides  access  to  multiple  pre-­‐compiled  corpora  (British  National  Corpus,  hrWaC,  DGT  corpora  and  many  more)

v NOT  free,  but  not  expensiveJ (5,99  € per  month)

v Allows  the  creation  of  ad  hoc  corpora  from  web  texts

v Supports  TMX  import  (for  bilingual  texts!)

v Provides  ways  to  extract  terminology  semi-­‐automatically

v Online  tutorials:  https://www.sketchengine.co.uk/sketch-­‐engine-­‐video-­‐tutorials/

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Simple  concordances

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Other  query  types

v simple:  searches  for  word  and  its  inflected  forms

v lemma:  searches  for  all  words  with  this  lemma

v phrase:  for  searching  multiple  words

v word:  to  search  for  a  specific  wordform

v character:  to  search  for  a  string  of  characters

v CQL:  corpus  query  language

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

WordSketches

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Thesaurus  – similar  words

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Keyword  extraction

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Term  queries  in  the  DGT  parallel  corpusv simple  queries:  ribolov,  brancin,  grdobina

v lemma  queries:  ribolov -­‐>  ribolova,  ribolovu,  ribolov

v parallel  query:  

v querying  using  CQL  syntax:  

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Basic  CQLv Typical  format:  [attribute="value"],  e.g.  [lemma=“riba”]

v Specifying  word  class  or  case:  [tag=“N.*”]  (any  noun),  [tag=“A.*”]  (any  adjective)

v Regular  expressions:  v .  (dot)  matches  any  single  characterv *  (asterisk)  matches  0-­‐100  repetitionsv +  (plus)  matches  1-­‐100  repetitionsv {n,k}  specifies  exact  range  of  repetitions,   from  n  to  k

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

[lemma=“rad”]

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

[tag=“A.*”][lemma=“riba”]

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

"ulov.*"  []{0,3}   [tag="N.*"]

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Challengesv Search  for  verbs  occurring  before  the  word  “ugovor”  with  up  to  2  words  in  between.

v Search  for  words  ending  with  “anje”.

v Search  for  defining  contexts  containing  a  noun  in  the  nominative  case  followed  by  “je”  followed  by  an  adjective  and  noun  in  the  nominative  case.

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Looking  for  definitionsv Exploit  typical  definition  patterns:  v[X]  is  a  [Y]v [X]  is  defined  as  [Y]v [X]  is  a  kind  of  [Y]v …

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

WebBootCatv Tool  to  create  text  collections  from  web  pages

v User  provides  keywords  &  optionally  selects  sites  to  crawl

v When  the  corpus  is  compiled  it  can  be  used  for  queries  or  download.

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

TMX  Uploadv Allows  you  to  create  corpora  from  your  translation  memories

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Terminology  extractionv Works  for  languages  with  a  predefined  “term  grammar”

v Manage  corpus  -­‐>  Keywords  and  terms

v Terms  can  be  exported  into  TBX  or  CSV

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

Exercisev Use  the  corpus-­‐derived  information  on  the  following  slides  to  create  a  term  entry  for  “bluetongue”.

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

INTEGRA  TERMINOLOGY  MANAGEMENT,  ZAGREB

top related