u of r extensible catalog team metacat. problem domain
Post on 21-Dec-2015
224 views
TRANSCRIPT
U of R eXtensible Catalog
Team MetaCat
Problem Domain
A Modern Library
• Card catalogs are stored on a computer
• Card catalogs store metadata about books Subject Author(s)
• Searching for a book is done via an OPAC (Online Public Access Catalog) Example: http://albert.rit.edu/
Card Catalog Metadata
• Two types of records A bibliographic record represents a book, and
is linked to multiple authority records. An authority record represents a single author
or subject.
• Metadata has been hand-typed by librarians across the country MARC: MAchine Readable Cataloging (XML),
specifies for both bib. and auth. record formats Dublin Core: also XML format, but only bib.
records
Metadata Issues
• Since metadata has been hand-typed, it may be inconsistent
• An author could be: “Mark Twain” “Twain, Mark” “M. Twain” “Samuel Clemens”
• If a user searches for “Mark Twain”, the search may not return all related books
Goals
• Bibliographic Record Author field
Name Date of Birth, Death
• Authority Record Authorized Form Alternate Forms:
Alternate form 1 Alternate form 2 …
See Also References to other
authority records
Sponsor’s Solution
Iterative Process Flow
Requirement Elicitation
Requirement Analysis
Define Architecture
Update Release Plan
produce SRS &acceptance tests
Subsystem DesignIdentify Integration
Tests
Implementation
Integration
Acceptance Testing
Delivery
For each release:
Update Documentation
Metrics
• Effort by type of activity• Test metrics (JUnit)• Defects by types
Effort by Type
Meeting Development Documentation
Before ~40 hrs 0 0
1/12-1/18 45 29 2
1/18-1/25 20 43 5
1/26-2/1 (R1) 24 41 4
2/2-2/8 20 31 7
2/9-2/15 (R2) 24 2 0
Total 133 146 18
Hours spent on activities
0
5
10
15
20
25
30
35
40
45
50
Before 1/12-1/18
1/18-1/25
1/26-2/1(R1)
2/2-2/8
2/9-2/15(R2)
Time
Hou
r Meeting
Development
Documentation
Effort by Type
Issuetracker
Initially, all the issues are not recorded properly.
Issue Tracker is used to track1. Issues (design, documentation, process)2. Bugs3. Discussions (new features, nice to have)
Issuetracker
Defects by Type
Status
• 3.1 Import a record into database (R1) FR-1.1: The system shall parse the XML
record. (R2) FR-1.2: The system shall store the
information that obtained from parsing the XML record into MySQL database.
(R1) FR-1.3: The system shall be able to import multiple records at once. (Batch processing)
(R1) FR-1.4: The system shall normalize strings.
Status cont.
• 3.2 Matching records (R1) FR-2.1: The system shall create a new authority
record. (R2) FR-2.2: The system shall match two strings and give a
confidence level of the matching. (R2) FR-2.3: The system shall store the results of the
matching that includes the degree of certainty, and the link(s) matched authorized record(s).
(R1) FR-2.4: The system shall identify all unprocessed records in the records database. The unprocessed records are the records that have not yet been matched against.
(R1) FR-2.5: The system shall create a new authority record, and store it in the database.
Status cont.
(R1) FR-2.6: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the degree of certainty is above auto-accept threshold.
(R2) FR-2.7: The system shall mark the record to be reviewed by a person if the degree of certainty is between auto-accept threshold and auto-reject threshold.
FR-2.8: The system shall create a new authority record using the information from the current record, and create a link between those two records if the degree of certainty is below auto-reject threshold.
(R1) FR-2.9: The system shall analyze unprocessed records on demand.
(R1) FR-2.10: The system shall attempt to match records first by comparing authority names.
(R2) FR-2.11: The system shall attempt to match records by comparing alternative names if the first attempt (FR-2.10) failed.
Status cont.
• 3.5 Review possible matches (R2) FR-5.1: The system shall gather a collection of
records that are marked to review from the database. The questionable matches have the degree of certainty level between auto-accept threshold and auto-reject threshold.
(R2) FR-5.2: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.
(R2) FR-5.3: The system shall replace the data in authority-controlled fields with its authorized form and store the link to its authorized form if the user approves the matching.
Our Solution
Architecture
API
MyS
QL
DB
Exporter
DA
O (
Dat
a A
cces
s O
bje
ct)
-match
GUI
«subsystem»Matcher
«subsystem»Import
Matcher
• In NACM, we need to be able to match Bibliographic records (books) to Authorized records (authors).
• The information in the records may not always match exactly, or may match multiple records!
Matching Problems
• Different forms of the same name Nate verses Nathan, typos
• Different authors with the same name George Bush (41) versus George Bush
(43)
• Aliases or pen names Samuel Clemens verses Mark Twain
Matching Problems
• To assist in matching different forms of an author’s name, Authority records have a list of alternate names in addition to the authorized form.
• Alternate names may not be distinct.
Matcher Design
• We need a matching strategy that is easy to extend to add new matching rules, while still being fast.
Matching Subsystem
MatchStrategy
• Abstract class that defines the basics of a matching rule• Matching method• Match confidence
• All matching strategies extend this class
StringTransformer
• Abstract class for string manipulation rules• String transform method• Transformation confidence
• All string manipulation rules extend this class
MatchDriver
• Handles performing a match• Creates pairs of strategies &
transformations• Sorts Pairs based on overall confidence• Iterates through the pairs looking for
matches
Matcher Extensibility
• Adding new rules• Extend MatchStrategy or
StringTransformer• implement new matching or
transforming rules• Assign a confidence• Add to MatchDriver
• MatchDriver takes care of the rest
Importer
• Takes in input streams and parses them to extract authority and bibliographic data
• Uses a SAX parser into a Document Object Model (DOM) object
• Data is extracted from document, normalized, and inserted into the database
Importer
MySQL data model
record_types
PK id
name
names
PK id
orig_string nor_string
authority_records
PK id
processed generated xml_hashcode orig_xmlFK1 record_type_idFK2 authority_name_id
authority_records_alter_forms
PK,FK1 namePK,FK2 record_id
authority_records_see_also
PK,FK1 namePK,FK2 record_id
bib_records
PK id
processed xml_hashcode orig_xmlFK1 record_type_id
bib_records_titles
PK,FK1 namePK,FK2 record_id
bib_records_authors
PK,FK1 namePK,FK2 record_id
bib_records_subjects
PK,FK1 namePK,FK2 record_id
authority_records_links
PK id
approved flagged rejectedFK1 auth_record_idFK2 bib_record_id evidence time_found time_verifed approvedby percent_confidenceFK3 string_id
bib_records_author_links
PK,FK1 bib_record_idPK,FK2 auth_link_id
bib_records_subjects_links
PK,FK2 bib_record_idPK,FK1 auth_link_id
Using Hibernate
• Transparent Data Persistence• Manages relationships between
entities• Benefits
Query caching Lazy-loading of associated entities Automatic flagging of changes Programmatic API for complex queries
How it Works
• Define Schema• Define Domain Model• Use XML to map fields in classes to
columns in tables Define cascading behavior
Hibernate Caveats
• Designed with transactions in mind But, we use batch processing!
• Query language lacks some of the power of SQL
• Not 100% transparent Design and use of domain model is
affected
Results Viewing GUI
+refresh()+sortBy(in column : int)+updateLinkCountLabel()
ResultsTable
+getValueAt(in row, in column)
ResultsTableModel
FilterControls
PagingControls
SelectedLinkControls
Filter
+findAllWithFilter()
AuthorityLinkDAO
AuthorityLink
Creates and lays outa JTable and otherGUI components
*
-creates
*
-database
gui.resultsGUI
• A table displaying all created links• Can be filtered, sorted, and paged
Future Plans
• Verify that matching algorithm is doing the right things
• Implement string transformers• Create new XC records• Merge and update records with new
data upon import• Configuration files for the system
Demo!