shebanq roma-2013-10-01

49
Data Archiving and Networked Services SHEBANQ Dirk Roorda - researcher @ DANS,TLA System for HEBrew Text: ANnotations for Queries and Markup TEI pre-conference workshop: Query Roma – 2013-10-01

Upload: dirk-roorda

Post on 26-Jun-2015

162 views

Category:

Technology


0 download

DESCRIPTION

SHEBANQ project (half-way) as a use case in querying language resources. The corpus is the text of the Hebrew Bible with linguistic features, packaged in de special text database and converted to LAF

TRANSCRIPT

Page 1: Shebanq roma-2013-10-01

Data Archiving and Networked Services !

SHEBANQ !

Dirk Roorda - researcher @ DANS,TLA !

System for HEBrew Text: ANnotations for Queries and Markup !

TEI pre-conference workshop: Query !Roma – 2013-10-01 !

Page 2: Shebanq roma-2013-10-01

Overview

1.  Context: text, data, research in Hebrew Bible

2.  MdF database model, MQL query language

3.  Sharing the research process

4.  CLARIN-NL project: SHEBANQ

5.  Towards new tools

Page 3: Shebanq roma-2013-10-01

1 (of 5) Context

Text, data and research in the Hebrew Bible

Page 4: Shebanq roma-2013-10-01

VU Amsterdam

Eep Talstra Centre for Bible and Computer

text + linguistic features => database

database + research questions => publications

4 !

Page 5: Shebanq roma-2013-10-01

2 (of 5) MdF and MQL

•  MdF database model

•  MQL query language

Page 6: Shebanq roma-2013-10-01

Monad Object Feature

1977-now: Eep Talstra et al. ECA, WIVU. Print reference (Google Books)

1988-1994 Crist-Jan Doedens: Text Databases – One Database Model and Several Retrieval Languages (google books reference)

2004: Ulrik Petersen. Emdros - a text database engine for analyzed or annotated text. COLING

Page 7: Shebanq roma-2013-10-01

word objects

standardedition

text

monads(atomic chunks

of text)

lexeme_utf8= תישארold_lexeme_utf8= תישאר

vocalized_lexeme_utf8= תישארsurface_consonants_utf8= תישאר

graphical_lexeme_utf8= ישאר

׃ץראה תאו םימשה תא םיה.א ארב תישארב

1234567891011

23456789101112

84383

59559

34680

7763777638

40770

7 .. 511 .. 9

11 .. 5

11 .. 5

11 .. 1

11 .. 1

clause_atom_number=1clause_atom_relation=0

clause_atom_relation_daughter_tense=unknownclause_atom_relation_kind=No_relation

clause_atom_relation_mother_tense=unknownclause_atom_relation_preposition_class=none

clause_atom_type=xQtlindentation=0

phrase objects

Monad-Object-Feature

subphrase objects

phrase_atom objects

clause_atom objects

sentence objects

Page 8: Shebanq roma-2013-10-01

MQL query language

topographic, i.e:

query expression =~= query results w.r.t.

•  sequence

•  embedding

Page 9: Shebanq roma-2013-10-01

Example SELECT ALL OBJECTS !WHERE ![Clause ! [Phrase ! [Word FOCUS !" " "part_of_speech = verb AND !" " "lexeme = "FJM["] !

] ! .. ! [Phrase FOCUS !" "phrase_function = Objc OR !" "phrase_function = IrpO!

] ! .. ! [Phrase FOCUS !" "phrase_function = Objc OR !" "phrase_function = IrpO!

] !] !

!

Page 10: Shebanq roma-2013-10-01

3 (of 5) Sharing

Problem: how to share (intermediate) results of analysis

Solution: saving queries as annotations

Page 11: Shebanq roma-2013-10-01

Lock - in

scholarly-bi

bles.com!

Stuttgart Electronic Study Bible

⇒ massive dissemination

But

⇒ not the right dynamics for tool development

Page 12: Shebanq roma-2013-10-01

Leiden: international workshop biblical scholarship

Desiderata:

new tool development

text transmission (variants)

linguistic analysis (features)

even combined!

a short history: 2012

leiden loren

tz!

Page 13: Shebanq roma-2013-10-01

Hebrew Text in the Archive

urn:nbn:nl:u

i:13-ikjj-ek

!

Page 14: Shebanq roma-2013-10-01

Hebrew Text in the Archive

urn:nbn:nl:u

i:13-ikjj-ek

!

how can the people annotate

our work? !

Page 15: Shebanq roma-2013-10-01

Research Data Cycle

Page 16: Shebanq roma-2013-10-01

Research Data Cycle Text transmission, tradition, editorial

processes

Free University, theology faculty,

server department, WIVU project

!

NWO projects !NWO projects

religious communities

theol. scholars

theol. scholars

enlightened lay people

scholarly-

bibles.com!

Page 17: Shebanq roma-2013-10-01

Research Data Cycle Text transmission, tradition, editorial

processes

Free University, theology faculty,

server department, WIVU project

!

NWO projects !NWO projects

religious communities

theol. scholars

theol. scholars

CLARIN SHEBANQ

linguists

Wider public: Annotation,

Query Saving, via Linked Data

dig. hum

comp. hum

enlightened lay people

scholarly-

bibles.com!

Research Data Archiving

DANS

Page 18: Shebanq roma-2013-10-01

3 (of 5) Sharing (c’t’d)

Solution: Queries As Annotations

Page 19: Shebanq roma-2013-10-01

queries-as-annotations

model ! query ! example !

body ! query instruction !SELECT ALL OBJECTS WHERE [Word FOCUS part_of_speech = verb AND lexeme = "שים"] !

targets ! query results in context !

ו ישכם יעקב ב בקר ו יקח את ה אבן אשר שם מראשתיו ו ישם אתה מצבה ו יצק שמן

על ראשה

annotation ! published query ! qu123 (just an identifier) !

metadata !

researcher, date created, date last

run, research question !

Janet Dyk 2004-02-16 2012-01-27 Can the verb ים have a double שobject? - article in Foundations for Syriac Lexicography !

Page 20: Shebanq roma-2013-10-01

OpenAnnotation openannotati

on.org!

Page 21: Shebanq roma-2013-10-01

provenance

Page 22: Shebanq roma-2013-10-01

motivation

Page 23: Shebanq roma-2013-10-01

demonstrator datane

tworkservice

.nl/qaa!

Page 24: Shebanq roma-2013-10-01

demonstrator datane

tworkservice

.nl/qaa!

Page 25: Shebanq roma-2013-10-01

demonstrator datane

tworkservice

.nl/qaa!

Page 26: Shebanq roma-2013-10-01

demonstrator datane

tworkservice

.nl/qaa!

Page 27: Shebanq roma-2013-10-01

demonstrator

Page 28: Shebanq roma-2013-10-01

demonstrator

Page 29: Shebanq roma-2013-10-01

demonstrator

Page 30: Shebanq roma-2013-10-01

demonstrator

still missing:

saving queries

not semantic-web-enabled

sustainability

Page 31: Shebanq roma-2013-10-01

4 (of 5) Project

CLARIN-NL: SHEBANQ:

(A) Curation

(B) Demonstrator

Page 32: Shebanq roma-2013-10-01

SHEBANQ

System for Hebrew Text: ANnotations for Queries

CLARIN-NL project

data curation: LAF

demonstrator: query saver

#!/etc bc

s/g$/q/ !

Page 33: Shebanq roma-2013-10-01

Linguistic Annotation Framework

ISO 24612:2012

Nancy Ide, Laurent Romary

Page 34: Shebanq roma-2013-10-01
Page 35: Shebanq roma-2013-10-01
Page 36: Shebanq roma-2013-10-01
Page 37: Shebanq roma-2013-10-01
Page 38: Shebanq roma-2013-10-01

feature definitions

Page 39: Shebanq roma-2013-10-01

feature definitions

Page 40: Shebanq roma-2013-10-01

TEI ISO-FS schema

Page 41: Shebanq roma-2013-10-01

dcr:datcat on <fDecl> versus <f>

26,225,966 <f>s ! !2.5 GB redundant attribute material !!

Page 42: Shebanq roma-2013-10-01

5 (of 5) Project

CLARIN-NL: SHEBANQ: (B) Demonstrator

Page 43: Shebanq roma-2013-10-01

select all objects where

[clause [phrase phrase_function = Objc [word FOCUS tense = infinitive_absolute] ]]

Execute

Query executed

Passage

תאו םימשה תא םיהלא ארב תישארב׃ץראה

תיב הלעא יכ תוא המ והיקזח רמאיו׃הוהי

Controls

תיב הלעא יכ תוא המ והיקזח רמאיו׃הוהי

Gen 1:1

2Chron 3:4

Gen 1:1 תאו םימשה תא םיהלא ארב תישארב׃ץראה

תיב הלעא יכ תוא המ והיקזח רמאיו׃הוהי

Text

1Sam 12:4

Ex 23:2

Query results

Prev 2 3 65 ... 2241 Next21 313 results

Executing query ...

view in context

Save this query

Researcher Oliver Glanz

Date created 2013-08-25

Date last run 2013-08-25

Project Data and Tradition

Institute VU/Eep Talstra Centre for Bible and Computing

Reason irregular valency of ארב

Comments needs to be combined with query on םיהלא

Save PublishCancel

Name valency ארב

Edit Query

Page 44: Shebanq roma-2013-10-01

Passage

תאו םימשה תא םיהלא ארב תישארב׃ץראה

תיב הלעא יכ תוא המ והיקזח רמאיו׃הוהי

Controls

תיב הלעא יכ תוא המ והיקזח רמאיו׃הוהי

Gen 1:1

2Chron 3:4

Gen 1:1 תאו םימשה תא םיהלא ארב תישארב׃ץראה

תיב הלעא יכ תוא המ והיקזח רמאיו׃הוהי

Text

1Sam 12:4

Ex 23:2

Saved Query Results

Prev 2 3 65 ... 2241 Next21 313 results

view in context

Information on this query

Researcher Oliver Glanz

Date created 2013-08-25

Date last run 2013-08-25

Project

Institute

Reason

Comments

Name

Query Info

select all objects where

[clause [phrase phrase_function = Objc [word FOCUS tense = infinitive_absolute] ]]

MQL query text Persistent Identifier urn:nbn:nl:ui:13-scpm-ji

http://www.persistent-identifier.nl/?identifier=urn...

valency ארב

Data and Tradition

VU/Eep Talstra Centre for Bible and Computing

irregular valency of ארב

needs to be combined with query on םיהלא

Page 45: Shebanq roma-2013-10-01

datanetworks

ervice.nl/qa

a!

Page 46: Shebanq roma-2013-10-01

SHEBANQ: implementing Q-a-A

Page 47: Shebanq roma-2013-10-01

5 (of 5) Towards new tools

•  LAF tools

•  or generic graph algorithms

•  Emdros tools

•  or generic database technology

•  Linked Data tools

•  or generic SPARQL queries

Page 48: Shebanq roma-2013-10-01

Side conditions •  development close to the researchers

•  preferably in their own institutions

•  decent performance

•  within the scale of a laptop

•  usable to researchers

•  that is: non-programmers

•  persistence in mind

•  new results will be archived and re-enter the data cycle

Page 49: Shebanq roma-2013-10-01

thank you

[email protected]

slideshare.net/dirkroorda/

s/g$/q/ !

#!/etc bc Eep Talstra Centre for Bible and Computer!