native xml database for information systems chris wallace smrg seminar feb 2006

Post on 22-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Native XML Databasefor Information Systems

Chris WallaceSMRG Seminar

Feb 2006

Chris Wallace, SMRG Seminar, Feb 2006

2

Exploring the design space

• “design as a conversation with the materials in the situation” (Schon)

• Native XML database (NXD)– Storing, querying and updating XML documents without

mapping into relations– Schema-free– Trees are to NXD what tables are to RDBMS– Tables are trees

• Information Systems– Focus on semi-structured data (mixture of simple data

items, text and complex nested structures)– Searching, derived data, visualisation– Process support– Large problem space variously supported by

spreadsheets, word documents, ad-hoc databases, increasingly web-integrated data.

Chris Wallace, SMRG Seminar, Feb 2006

3

eXist Native XML Database• Open source Java • European team of developers led by Wolfgang

Meier• Documents (files) are organised in collections

(folders) in a file store– XML Documents stored in an efficient, B+ tree structure

with indexes– Non-XML resources (XQuery, CSS, JPEG ..), etc can be

stored as binary• Deployable in different ways

– Embedded in a Java application– Part of a Cocoon pipeline– As web application in Apache/Tomcat– With embedded Jetty HTTPserver (as on stocks)

• Multiple Interfaces– REST – to Java servlet – SOAP– XML:RPC

Chris Wallace, SMRG Seminar, Feb 2006

4

NXD case studies

• FOLD– modules, programmes, scheme operations,

staff, organisational structures, events

• Family photos and history– Integration of meta-data on family photos with

family history (births, deaths and marriages)

• ISD3 Assignment – a web-based calculator– e.g. a currency converter

Chris Wallace, SMRG Seminar, Feb 2006

5

Research Work

• Development of the FOLD (Faculty OnLine Data) - a pilot project for UWE

• Teaching students and staff in XML languages (XML Schema, XSLT, XQuery) and NDX database design

• Links with other eXist projects• SPA2006 Workshop on NDX• XML Prague (eXist)

Chris Wallace, SMRG Seminar, Feb 2006

6

Research Areas

• Design practice for NDX– ‘Pattern language’ to help map from conceptual

model to multiple XML schemes– Identifier design– Structuring documents by responsibility and

versions

• NDX in organisational use– Social effects of distributed responsibility– Visualisation of complex relationships – Handling integrity problems – accept

inconsistency as a way of life– Management of veracity

Chris Wallace, SMRG Seminar, Feb 2006

7

The FOLD

• Faculty OnLine Data• Technologies

– eXist– (Java) – not yet– XQuery – XSLT– CSS– PHP – to be eliminated

Chris Wallace, SMRG Seminar, Feb 2006

8

The FOLD (2)

• Scope – Module and Programme specifications– Modular Schema operations (runs)– Staff– Organisational structure– Events

• Functionality– Highly linked– (Integrating UWE sources)– (Personalized Interface)

Chris Wallace, SMRG Seminar, Feb 2006

9

FOLD - Modules and Programmes

+ Module

- moduleCode : String

+ Module Specification

- version : Year

- faculty : Faculty

- field : Field

- title : String

- credits : CreditsType

- level : LevelType

- syllabus : RestrictedHTML

- readingStrategy : RestrictedHTML

+ 1..1+ 1..*

+ definition

+ ProgrammeStructure

- version : Year

+ Programme

- programmeCode : String

- ucasCode : String [0..1]

+ 1..1

+ 1..*+ s tructure

+ Stage

+ 1..1

+ 1..* {ordered}

+ OptionGroup

- id : String

- comment : String [0..1]

- minCredits : int

- maxCredits : int

+ 1..1

+ 1..* {ordered}

+ Core

+ 1..1

+ 1..* {ordered}

+ 1..*

+ 1..*

+ core

+ Option

+ 1..1

+ 1..* {ordered}

+ 1..*

+ 1..*

+ optional

+ Module Combination

- comment : String

+ 1..1

+ 0..1+ pre-requis ite

+ 1..1

+ 0..1

+ co-requisite

+ 1..*

+ 1..*

+ e

xpre

ssio

n

This is a boolean expression such as ( m1 and m2 and (m4 or (m5 and m6))

+ Learning Outcome

- assessed in Comp A : boolean

- assessed in Comp B : boolean

- specification : RestrictedHTML

- outcomeType : Learning Outcome

+ 1..1

+ 1..* {ordered}

+ Reading item

+ Book

- authors : String

- title : String

- year : String

- source : String

+ WebSite

- url : URL

- text : String

+ 1..1

+ 1..1

+ 1..1

+ 1..*+ Excluded

The FOLD

Chris Wallace, SMRG Seminar, Feb 2006

10

Fold Design Issues

• Conceptual Modelling• Conceptual – Logical – Physical mapping• Identifiers• Relationships and links• Versioning• Editing• Views• Responsibilities• Processes

Chris Wallace, SMRG Seminar, Feb 2006

11

Mapping from Conceptual modelto the Logical and physical layers

• What criteria to use in breaking up the whole model into – Logical

• Entity – a logical compound structure– Physical

• Documents – a physical aggregation of entity instances• Collections – a physical aggregation of documents

• Examples– Module Specification [moduleCode]

• Module Spec is an Entity• Each Module Spec is a Document

– Module Run [moduleCode/year/runNo]• Module Run is an Entity• Set of Module Runs for a Field is a Document

• Issues– Where to develop Schemas?– No logical data in the physical – purely for convenience

Chris Wallace, SMRG Seminar, Feb 2006

12

Conceptual Modelling

• Conventional normalised data model• Generality issue e.g. Module run

– Roles as Attributes• <ModuleLeader>Stewart Green</ModuleLeader>

– Roles as Entities• <role><title>Module Leader</title><person>Stewart Green</person></role>

– Entities enable meta data, but defeat use of tables for data entry

• Need views

• Attributes v elements – a Conceptual/logical mapping issue– <Module code=“UFIEKG-20-3” level=“3”>…– <Module><ModuleCode>UFIEKG-20-3</ModuleCode>..

Chris Wallace, SMRG Seminar, Feb 2006

13

Conceptual Modelling Tools

• UML class model closest to suitable conceptual model– Allows multi-valued attributes– Distinguished relationship kinds

• Composition• Bi-directional associations• Uni-directional associations (for multiplicity resolution)

– QSEE/Rose• No identifiers (primary keys) ??• No indication of mapping to attributes or elements• No mapping into Entites• No mapping into Documents and Collections

Chris Wallace, SMRG Seminar, Feb 2006

14

Identifiers• Principle adopted – use naturally occurring identifiers wherever possible

– Persons : “Ian Beeson”– Rooms : “3P14”

• Plus– Reduces gap between RW domain and system– Names in minutes of meetings, on spreadsheets are readable– )

• Minus– Duplicates

• Duplicates not tolerable in the RW either, resolved through RW negotiation within a RW namespace e.g. the Faculty

• Mergers generate duplicates– Aliases– Not all entities have unique identifiers

• Programmes – ISIS Primary Award and UCAS are candidates but don’t work

• ?– All names need namespace – “Ian Beeson” at CEMS at UWE– Need to replace multiple naming conventions with a single naming scheme (e.g.

initials)– URN’s and semantic web

Chris Wallace, SMRG Seminar, Feb 2006

15

Alias handling

– Problem handling aliases in staff data• Currently a person can have multiple names

–first is the prime• Better is a separate alias table

– Lookup the base table– If not find, try the alias table

Chris Wallace, SMRG Seminar, Feb 2006

16

Relationships and Links• Relationships need to be implemented

– One – Many • RDBMS – primary key on the One side becomes foreign key on the

Many side• NXD – choose which side on the basis of complexity and

responsibility– Sequence (modules in a stage)– Complex (pre-requisite expression)

– Many-Many• RDBMS – intersection table • NXD– as for one-many • or either side as appropriate – Groups and subgroups

• Issues– Referential integrity

• RDBMS – ‘eager’ – data not allowed in unless links OK, links maintained through updates– integrity failures transient, repair outside database

• NXD – ‘lazy’– store the data and provide on-demand or on-trigger validation– Integrity failures can be persisted (XLinkit) and repair is inside

database

Chris Wallace, SMRG Seminar, Feb 2006

17

Versioning

• Based on Yearly cycle– Base Year set in user’s session– Default set in system config

• Two different approaches– Module Run, Coursework Elements..

• Explicit version identifier– ModuleCode/Year/RunNo– Selection is explicit [Year= $year]

– Module Specification, Programme Structure• Implicit version defined by sequence of versions

Chris Wallace, SMRG Seminar, Feb 2006

18

Implicit Versioning

2002

2005

2007

Versions

Year=2006 Latest version =2005

Latest version =2002Year=2004

Chris Wallace, SMRG Seminar, Feb 2006

19

Implicit Versioning

let $specPath := "/db/versionTest", $currentYear := "2005", $moduleCode := request:request-parameter("moduleCode",""),

$year := request:request-parameter("year",$currentYear),

(: get the set of possible versions for this module :) $modspecs := collection($specPath)/moduleSpecification [ModuleCode=$moduleCode] [Version <= $year],

(: select the version with the highest version number :) $modspec := $modspecs[Version = max($modspecs/Version)] return $modspec

Chris Wallace, SMRG Seminar, Feb 2006

20

Editing• Table structured Document editing

– Allows maintenance using familiar Spreadsheet tools (Excel 2003)– Schema is induced by Excel– Accommodations

• Multi-valued fields as concatenated values– XPath Join and tokenise functions– Embedded separator problem (a name with ‘,’ as a legitimate character)– Defeats indexing

• Optional elements increase table width• Formatting choices not maintained (e.g.Freeze-Window)

• Structured Document editing– Allows maintenance with Word without a schema

• With difficulty –not schema awareness– Use InfoPath to create desktop form based on schema

• Need to redo if schema changes• In-situ Updates

– With Xquery-generated forms and update– With XForms

Chris Wallace, SMRG Seminar, Feb 2006

21

Views

• Views arise from the need for de-normalisation– Coursework Element

• As a simple element– Key : moduleCode/Year/runNo/elementNo– Data: due date

• As a derived complex element– SuggestedHours (computed from Hours table)– Late date (computed from UWE calendar)– Weighings (extracted from relevant specification)– Module Leader (extracted from Module Run)

• Views as transient or materialize• View definition• View Maintenance

Chris Wallace, SMRG Seminar, Feb 2006

22

Chris Wallace, SMRG Seminar, Feb 2006

23

declare function fold:courseworkElement($moduleCode, $year, $runNo, $elementNo) { let $mod := fold:moduleSpecification($moduleCode,$year), $run := fold:moduleRun($moduleCode,$year,$runNo), $elementRun := fold:elementRun($moduleCode,$year,$runNo,'B', $elementNo) , $elementSpec := $mod/Assessment/FirstAttempt/Components/ComponentB/Element[position() = $elementNo], $dueDate := $elementRun/DueDate, $returnDate := fold:workingDays($dueDate,20), $componentWeight := $mod/Assessment/Weighting/ComponentWeightB, $weightInComponent := data($elementSpec/Weight), $weightInModule := round($weightInComponent * $componentWeight div 100), $load := fold:load($mod/Level), $hrs := round(data($mod/UWERating) div data($load/Credits) * $weightInModule div 100 * data($load/Hours)) return<CourseworkElement> <ModuleCode>{$moduleCode}</ModuleCode> {$mod/Title} <RunNo>{$runNo}</RunNo> {$run/ModuleLeader} {$run/InternalModerator} {$run/ExternalExaminer} <Component>CW</Component> <ElementNo>{$elementNo}</ElementNo> {$elementSpec/Description} <SuggestedHours>{$hrs}</SuggestedHours> <WeightInComponent>{$weightInComponent}</WeightInComponent> <WeightInModule>{$weightInModule}</WeightInModule> <DueDate>{data($dueDate)}</DueDate> <ReturnDate>{data($returnDate)}</ReturnDate></CourseworkElement>

};

Chris Wallace, SMRG Seminar, Feb 2006

24

Process support

• Short term – Process support– Form generation– Linkage to process documentation

• Medium term – Process monitoring– Online capture of significant dates

• Coursework hand-in date• Date exam sent to moderator• Date coursework returned to students

– Derived information• Workload prediction based on coursework schedule and

student numbers• Display of latest coursework returned and SMS message to

students

• Long term- Process management – Workflow – Process enactment software

Chris Wallace, SMRG Seminar, Feb 2006

25

Short-term • Session based logins to personalise the interface and

specify parameters (currentYear) • Form generation as passive documents

– Update through the form an obvious extension• Extend operational data with date-based status

– Date-returned-to students • If set (work has been returned)

– Date used to generate page of coursework recently returned – Date used to monitor conformance to target return date(!)

• Link Forms to textual/graphical process description– Coursework from setting to field board– How to specialise a generic description?

• By level• By module• By field

Chris Wallace, SMRG Seminar, Feb 2006

26

Responsibilities

• Responsibility allocation– Admin / architect decision– Physical level design for responsibility

• All Module Runs in a Field in one document• Modules and Programme Structures in Field Collections

(within Year)– Group access rights

• For IS Field - ISAdmin– Anne Moggridge– Peter Rawlings– Lilly Cooke– Tracey Davis

• Need for check-in check-out of documents– WebDav (Web Folders)

Chris Wallace, SMRG Seminar, Feb 2006

27

Conclusion

• Slide from prototype to production• Pluses and Minuses of user enthusiasm• Go for ‘low-hanging fruit’• Pay attention to the learning process

– XQuery, XSLT are non-trivial languages because deeply unlike Java/PHP

• Reflection forced by presentations and workshops

top related