overheads

21
Using XML files as real corpora making an XML database with the dbXML program http://www.dbxml.com

Upload: databaseguys

Post on 20-May-2015

141 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Overheads

Using XML files as real corpora

making an XML database with the dbXML program

http://www.dbxml.com

Page 2: Overheads

The dbXML program

• The dbXML program is one of a range of programs that lets you use a set of XML files as a database.

• The program is free and can be downloaded from the web.

• It is likely that many more programs like this will be springing up over the next couple of years.

Page 3: Overheads

Basic concepts

• Using a database requires the following basic concepts

– the set of files you are looking at is called a collection

– a collection of files must be indexed so that the program can find things quickly

– you ask questions by posting queries to the database manager

Page 4: Overheads

Using the dbXML program to manage an XML database

• Our starting point assumes that we have some set of marked-up XML files that we want to manage.

• We first set up these files as a database

• We then use the dbXML tool for extracting information from this database.

Page 5: Overheads

Example XML files in our data set

Page 6: Overheads

Steps…

• Now we will see:– how to add a collection of files to a database– how to index those files– how to ask queries to get information about

the content of those files

Page 7: Overheads

Getting started… (1)

• First, we need to start up the DBXML server program

This is the program the does all the actual work.

To do this:– Make sure you know where the dbxml folder is

– Run the program startup-server.bat in that folder (e.g., by double clicking on it).

– This should start the dbxml server with a message like:

dbXML 2.0 (Dragonfly)Logging to E:\junk\logging\dbXML.out

Page 8: Overheads

Getting started…(2)• Next, we turn a set of XML files into an XML

database. To do this we must start the dbxml administration program and tell it which files to use.– Start a DOS-Command window

– Make sure you know where the dbxml folder is

– Run the command ‘startup-command-line.bat’ that is in the dbxml folder

– This should then start the dbxml program and you should get something that looks like the window on the next slide…

Page 9: Overheads

The program when it starts…

Page 10: Overheads

The DBXML administration actions

• Now you can tell the program which files you want to include in your database.– To do this, you first have to login to the program:

You must use exactly this name and password for the moment!

– make a collection

– Finally, go to the collection and say that everyone is allowed to look at it and exit:

connect user= scott pass= tiger

mkcol myXMLfiles

col myXMLfilesgrant admin READ WRITE EXECUTE CREATEexit

Page 11: Overheads

The dbXML program proper

• With the administrative details aside, we can start the main program.

• Find the dbxml item in the normal program start menu from Windows and click on it.

• This should bring up the following window:

If it does not, or if you cannot find it, you will have to ask for help.

Page 12: Overheads

Finding your collection

Expand the items in the list under “localhost” until you find the collection that you made in the previous step.

Page 13: Overheads

Finding your collection

Page 14: Overheads

Adding files to your collection

Expand your collection to find the ‘documents’

Click on this.

Select ‘Documents>Import Documents’ from the menu bar.

You will then be asked which files are to be added to the collection.

Previous slide

Page 15: Overheads

When you have added your documents…

select them all at one go if possible

… you then have to index them…

Page 16: Overheads

Select the indexes folder in your collection…

Page 17: Overheads

Define an index as follows…

1. Give the index a name2. Then you must type “pattern=*@*” to index all

ELEMENTS + ATTRIBUTES3. and click on create.

1

2

3

Page 18: Overheads

… you can now ask questions about

their content

• using XPath

• XSLT

• full text

QUERY WINDOW

RESULT WINDOW

Page 19: Overheads

Selecting all ‘turns’ in the corpus

Page 20: Overheads

Selecting all ‘attrib’ in the corpus

Page 21: Overheads

The results….• are presented as

XML• therefore you can

pass them straight to a style sheet to look at them…