nov 13, 20071 master’s thesis defense bibliographic tools in the context of www and latex...
TRANSCRIPT
Nov 13, 2007 1
Master’s Thesis Defense
Bibliographic ToolsIn The Context Of WWW
And LaTeXMunushree Thummala
Committee membersDr. Prabhaker Mateti (Advisor)Dr. Thomas HartrumDr. T.K. Prasad
Nov 13, 2007 2
Agenda Introduction BiBTeX Primer Bibliographic Tool Survey Requirements for the BiBTeXTools Design Discussion Conclusion Future Work Questions & Answers Session Demonstration
Nov 13, 2007 3
Introduction Preparing academic papers Collecting bibliographic entries Tools used to prepare the papers Common problems
Nov 13, 2007 4
BibTeX Primer
What is BibTeX? Helps prepare the References section in their documents Defines entry types and required/optional fields Uses “style” files to define the format of references Standards for publications are specified in style files
Used with LaTeX Latex collects \cite{}s in the .tex file BibTeX extracts corresponding references from .bib file BibTeX formats and sorts according to the .bst style Output of BibTeX program is LaTeX formatted text
Nov 13, 2007 5
Sample BibTeX entry@mastersthesis{Thummala-2007,
author = {Munushree Thummala},title = {Bibliographic tools in the context of WWW and \latex},month = {November},
year = {2007},school = {Wright State University},
OPTkey = {}, OPTtype = {}, OPTaddress = {}, OPTnote = {}, OPTannote = {},
advisor ={Prabhaker Mateti}}
Nov 13, 2007 6
Contribution Of Thesis Evaluation of Bibliographic tools BiBTeX to Database Suite of Tools
Database to store BibTeX entries LoadBiBTeX BibSearch Discovery of Duplicate BiBTeX entries Normalization of BiBTeX entries
Text to BiBTeX Translation TextToBiBTeX command line tool & API PDFrefsToBiBTeX command line tool Integration of TextToBiBTeX into Aigaion
Nov 13, 2007 7
Bibliographic Tools There are 100+ tools In this thesis: 87 are reviewed Tools were evaluated for the following:
Formats supported Navigating, Searching and Sorting capabilities Ease of maintaining bibliographic entries Duplicate discovery Import/Export to other formats
Nov 13, 2007 8
Bibliographic Tools Web browser based tools
Aigaion, Bibsonomy, CiteULike, Zotero, BibORB, Basilic, PubsOnline, etc.
Desktop/Small scale tools JabRef, KBibTeX, TkBibTeX, BibDB, BibEdit,
Open Office Bibliographic Manager, Tellico, etc. Commercial tools
Scholar’s Aid, Bookends, NotaBene, ProCite, etc.
Utilities Bib2html, Bibclean, Bp, Bibdup, Sixpack, etc.
Nov 13, 2007 10
Aigaion Web application, Open source Easy to use Supports basic editing features Supports Multiple Users Native format is BiBTeX Organizes references by Topics & Sub Topics Maintains a list of authors to eliminate duplication Duplicate discovery present in import feature
Nov 13, 2007 14
Zotero Firefox Browser Extension Easy to use Organizes entries in collections Captures bibliographic entries from
websites automatically Some drawbacks
Loses BiBTeX citation keys and custom fields while importing
Not well suited for managing BiBTeX bibliographies
Local storage
Nov 13, 2007 19
Bibsonomy Web browser based, hosted service Easy to use References
Users upload refs and bookmarks to Bibsonomy Made available to other users Tagged with keywords for categorization and search Can be exported as BiBTeX
Browser shortcuts to capture entries from web
Nov 13, 2007 24
JabRef Desktop Application Easy to use Multiple bib files can be edited Search online:
CiteSeer, Medline, IEEExplore, ArXiv.org Native format is BibTeX Auto generate BiBTeX keys Imports/Exports multiple formats
Nov 13, 2007 28
CiteuLike Web browser based, hosted service Easy to use References
Users upload refs to CiteULike Made available to other users Tagged with keywords for categorization and search Can be exported as BiBTeX
Browser shortcuts to capture entries from web cite the current article
Nov 13, 2007 32
Requirements for New Tools Text to BiBTeX translation
Translating free style text into BibTeX Customizing the translation Certainty of Recognition measure Extract references section from PDF papers Provide an API for other developers to integrate
free style translation into their applications Command line invocation GUI also Normalized BiBTeX output
Nov 13, 2007 33
Requirements (Contd. 2) Database of Bibliographic entries
Database to store BiBTeX files Tool to Detect duplicates Command line invocation Normalized BiBTeX output
Nov 13, 2007 34
Requirements (Contd. 3) Search and Generate BiBTeX files
Flexible searches Command line invocation Outputs BiBTeX format Normalized BiBTeX output
Platform Independent
Nov 13, 2007 35
Database on Local Machine Tables to store
BiBTeX entries lookup data for text to BiBTeX translation search index data for fast and flexible
searching
Nov 13, 2007 36
Database Of BiBTeX Entries A schema to store BiBTeX entries
including string macros Ability to specify a tag for each entry
Tag defaults to .bib filename
Nov 13, 2007 37
Database Of Lookup Data A database Schema to store lookup tables Lookup Tables:
Author Sub Names Journal Names Publishers Cities States Months Organizations
Nov 13, 2007 38
Database Of Search Indexes A database Schema to store BiBTeX
Search Index data Stores data as sequence of tokens Provides ability to search
Any field(s) Any keyword(s) Citation key also stored as tokens
Nov 13, 2007 39
LoadBiBTeX Tool Loads BiBTeX files into the database and
updates the search index tables Loads the lookup tables used by Text to
BiBTeX tool Detects duplicates
Nov 13, 2007 40
LoadBibTeX– Loads BiBTeX Files Program Usage
LoadBiBTeX –loadentries –bibtag thesis2007 –bibfile thesis.bib
Any entries that have errors are not loaded and are shown in the output
Updates the index tables used by the BibSearch tool
Nov 13, 2007 41
LoadBibTeX– Populate Lookup Tables Program Usage
LoadBiBTeX –loadauthors –loadpublishers –loadjournals –bibfile thesis.bib
Only new values are loaded The above command does not load the
BiBTeX entries
Nov 13, 2007 42
LoadBibTeX– Duplicate Discovery Program Usage
LoadBiBTeX –dupdisc –bibtag thesis2007 –bibfile thesis.bib
The BiBTeX entries in thesis.bib are read and compared to the entries in the database corresponding to the bibtag thesis2007
Any entries considered to be duplicates are displayed for the user
Nov 13, 2007 43
BibSearch – Searching The Database Program Usage
BibSearch –bibtag thesis2007 –fields author –keywords Donald Knuth
The database is searched for entries with the tag “thesis2007” and the words “Donald” and “Knuth” in the “author” field
The resulting BiBTeX entries and any required @String constructs are normalized and written to the output
Nov 13, 2007 44
Normalization Make BiBTeX entries consistent
Some of the rules Citation Keys are consistent Fields are enclosed in {} to preserve formatting Month field abbreviations are expanded Missing required fields are indicated to the user
appropriately Order of the fields in the output
Where is it implemented? In whichever tool a particular rule makes sense Spread across TextToBiBTeX, LoadBibTeX, BibSearch
Nov 13, 2007 45
Normalization (Example 2) @mastersthesis{Thummala2007,
title = “Bibliographic tools in the context of WWW and \latex”,
year = 2007,school = “Wright State University”,month = “Nov”,author = “Munushree Thummala”,advisor = “Prabhaker Mateti”,
}
@MASTERSTHESIS{Thummala-2007,AUTHOR = {{Munushree} {Thummala}},TITLE = {{Bibliographic} tools in the context of {WWW} and \latex},MONTH = {November},YEAR = {2007},SCHOOL = {{Wright} {State} {University}},ADVISOR= {{Prabhaker} {Mateti}},
}
Nov 13, 2007 46
Normalization (Example 3) @InCollection{ lawrence01access, author = "Steve Lawrence",
title= "Access to Scientific Literature", journal = "The {\it Nature} Yearbook of Science and Technology", editor = "Declan Butler", publisher = "Macmillan", address = "London, England", pages = "86-88", year = 2001
} @INCOLLECTION{ Lawrence-2001, AUTHOR = {{Steve} {Lawrence}},
TITLE = {{Access} to {Scientific} {Literature}},BOOKTITLE= {},YEAR = {2001},JOURNAL = {The {\it Nature} {Yearbook} of {Science} and
{Technology}},EDITOR = {{Declan} {Butler}},PUBLISHER= {{Macmillan}},ADDRESS = {{London}, {England}},PAGES = {86-88},
}
Nov 13, 2007 47
Text to BiBTeX Translation What are Free Style References and where would
authors find these ? References at the end of academic papers References on Internet sites like CiteSeer A jotted-down text description
How do authors benefit from this translation ? No need to manually convert to BiBTeX Significantly better accuracy Speeds the process of translating multiple references
Nov 13, 2007 48
Text to BiBTeX Translation (Contd. 2) Ways to translate free style text
Write a routine to analyze the strings and guess the fields
Develop Language Grammar Recursive Descent Parser
Which method did we pick? Recursive Descent Parsing Tried other methods with varying degrees of
success
Nov 13, 2007 49
Text to BiBTeX Translation (Contd. 3) How does the Parser work?
Extent = A sequence of tokens Field type = An extent that matches the set of
okTokens for that field and ends when a notOkToken (including a delimiting token) is hit.
Backtrack: If the current token in an extent does not match the field, it is backtracked to the beginning token, and given a chance to match other field types.
Unrecognized: If the current token does not match any field type, it is appended to the unrecognized field list and the above process is repeated starting at the next token.
Nov 13, 2007 50
Text to BiBTeX Translation (Contd. 4) How is a series of tokens recognized as a field?
Author, Journal fields - lookup table and heuristics Title field - quoted strings or heurisitics Pages field –
[PAGES.|PP.|P.] <number [–][–number]> Year field - a four digit number between 1900 and 2100 Volume field –
[VOL. | VOLUME] <number> Number field –
[NO. | NUMBER] <number> Abbrev field –
<volume>(<number>):<startpage>–[-]<endpage> Edition field-
EDITION<number> or <number> EDITION Publisher field, Place, State - Lookup table
Nov 13, 2007 51
Text to BiBTeX Translation (Contd. 5) A lexical analyzer tokenizes:
Holland, J. H. Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, MI (1975).
Holland , J . H
. Adaptation In Natural And
Artificial Systems . The University
Of Michigan Press , Ann
Arbor , MI ( 1995
) .
Nov 13, 2007 52
Text to BiBTeX Translation (Contd. 6) Author Field Recognition
“Holland” was present in author lookup table “J.”, “H.” are initials and the author is recognized as present in
the form lastname, firstname Author Field is set to “J.H. Holland”
Holland , J . H
. Adaptation In Natural And
Artificial Systems . The University
Of Michigan Press , Ann
Arbor , MI ( 1995
) .
Nov 13, 2007 53
Text to BiBTeX Translation (Contd. 7) Title Field Recognition
Since “Adaptation” is not recognized as a possible starting token of any other field, tokens are gathered till the next punctuation as title field
Holland , J . H
. Adaptation In Natural And
Artificial Systems . The University
Of Michigan Press , Ann
Arbor , MI ( 1995
) .
Nov 13, 2007 54
Text to BiBTeX Translation (Contd. 8) Publisher Field Recognition
The sequence of tokens “The” “University”, “of”, “Michigan” and “Press” represent a valid publisher name in the publishers lookup table
Thus “The University of Michigan Press” is publisher field
Holland , J . H
. Adaptation In Natural And
Artificial Systems . The University
Of Michigan Press , Ann
Arbor , MI ( 1995
) .
Nov 13, 2007 55
Text to BiBTeX Translation (Contd. 9) Place and State Field Recognition
The sequence of tokens “Ann” and “Arbor” represents a valid place name in the cities lookup table
The token “MI” represents a valid state name in the states lookup table
Holland , J . H
. Adaptation In Natural And
Artificial Systems . The University
Of Michigan Press , Ann
Arbor , MI ( 1995
) .
Nov 13, 2007 56
Text to BiBTeX Translation (Contd. 10) Year Field Recognition
The token “1995” is a valid year value in the range 1900 - 2100. As such it becomes the year field
Holland , J . H
. Adaptation In Natural And
Artificial Systems . The University
Of Michigan Press , Ann
Arbor , MI ( 1995
) .
Nov 13, 2007 57
Text to BiBTeX Translation (Contd. 11) Citation Entry Type
Since there are no distinguishing fields recognized, the entry type is defaulted to Misc
CORN calculations Author field is fully recognized a CORN of 100 Title field follows Author field a CORN of 100 Publisher field is in lookup table a CORN of 100 There are no required fields for Misc entry type. So
multiplier is 1 Entry CORN = AVG ( Author + Title + Publisher) *
multiplier = 100
Nov 13, 2007 58
Text to BiBTeX Translation (Contd. 12)-- Entry CORN = 100 Author = 100 Title = 100 -- Publisher = 100@MISC{Holland-1975
AUTHOR = {{J}. {H}. {Holland}}TITLE = {{Adaptation} in {Natural} and
{Artificial} {Systems}}YEAR = {1975}PUBLISHER = {{The} {University} of
{Michigan} {Press}}PLACE = {{Ann} {Arbor}}STATE = {MI}
}
Nov 13, 2007 59
Text to BiBTeX Translation Example 1 Werner Damm and Bernhard Josko. A sound and relatively
complete Hoare-logic for a language with higher type procedures. Acta Informatica, 20:59-101, 1983.
-- Entry CORN = 87 Author=50 Title = 100 Journal = 100 Pages = 100 @ARTICLE{Damm-Josko-1983, AUTHOR = {{Werner} {Damm} and {Bernhard} {Josko}}, TITLE = {{A} sound and relatively complete {Hoare}-logic
for a language with higher type procedures}, YEAR = {1983}, JOURNAL = {{Acta} {Informatica}}, PAGES = {59-101}, VOLUME = {20}, }
Nov 13, 2007 60
Text to BiBTeX Translation Example 2 Collins R. J. and Jefferson D. R. "AntFarm: towards simulated evolution."
In: C. G. Langton, C. Taylor, J. D. Farmer, and S. Rasmussen (Eds.), Artificial Life II, Vol. X of SFI Studies in the Sciences of Complexity. Redwood City, CA: Addison-Wesley, 1991, pp.579-601.
@INPROCEEDINGS{J-R-1991, AUTHOR = {{Collins} {R}. {J.} and {Jefferson} {D}. {R.}}, TITLE = {{AntFarm}: towards simulated evolution.}, YEAR = {1991}, EDITOR = {{G}. {Langton} and {C}. {Taylor} and {J}. {D}. {Farmer} and {S}. {Rasmussen}}, PAGES = {579-601}, PUBLISHER = {{Addison} - {Wesley}}, JOURNAL = {{In}: {C}}, PLACE = {{Redwood} {City}}, STATE = {CA}, OPTERRORFIELD0 = {{Artificial} {Life} {II}}, OPTERRORFIELD1 = {{Vol}. {X} of {SFI} {Studies} in the {Sciences} of {Complexity}}, }
}
Nov 13, 2007 61
Correctness Of Recognition Number CORN for entire BiBTeX entry is based on
CORN for each field recognized Completeness of the entry (% of required fields
present) CORN is calculated for:
Author field Editor field Title field Journal field Publisher field Pages field
Nov 13, 2007 62
CORN – Example 1
@INPROCEEDINGS{Wegener-2002, AUTHOR = {{I}. {Wegener}}, TITLE = {{Methods} for the {Analysis} of {Evolutionary}
{Algorithms} on {PseudoBoolean} {Functions}}, BOOKTITLE = {}, YEAR = {2002}, PUBLISHER = {{Kluwer} {Academic} {Publishers}}, JOURNAL = {{In}: {Evolutionary} {Optimization}}, }
Nov 13, 2007 63
CORN – Example 1 (Contd.)
Author, Title and Publisher were correctly recognized and their field CORN is set to 100 each.
The journal field was recognized due to the presence of string “In:”. As such it is assigned a CORN of 50.
The required field “Booktitle” is not present so the multiplier is ¾.
This reduces the entry CORN to 65. (100+100+100+50)/4*3/4
Nov 13, 2007 64
CORN – Example 2
@MISC{Luckham-1990, AUTHOR = {{David} {Luckham}}, TITLE = {{Programming} with {Specifications}}, YEAR = {1990}, EDITION = {1}, OPTERRORFIELD0 = {Springer}, OPTERRORFIELD1 = {Berlin},
}
Nov 13, 2007 65
CORN – Example 2 (Contd.)
One of the Author names is not fully recognized and hence reduces the CORN for author field to 1/2*100 = 50
Title is correctly recognized and its field CORN is set to 100.
Year and Edition fields are correctly recognized but do not impact entry CORN.
Entry CORN = (100+50)/2 = 75. Since the entry type is MISC, the multiplier is 1.
Nov 13, 2007 66
CORN – Example 3
@INPROCEEDINGS{Collins-Jefferson-1990, AUTHOR = {{Robert} {J}. {Collins} and {David} {R}. {Jefferson}}, TITLE = {{AntFarm}: {Towards} simulated evolution}, BOOKTITLE = {}, YEAR = {1990}, PAGES = {579--601}, MONTH = {February}, PUBLISHER = {{Addison} - {Wesley}}, JOURNAL = {{In} {Artificial} {Life} {II}: {Proceedings} of the
{Workshop} on {Artificial} {Life}}, PLACE = {{Santa} {Fe}}, STATE = {NM}, }
Nov 13, 2007 67
CORN – Example 3 (Contd.)
Author names are fully recognized and hence CORN is set to 100.
Title is correctly recognized and its field CORN is set to 100.
Pages is recognized and the page range is valid so CORN is 100.
Journal is recognized with a heuristic, so CORN is set to 50.
Publisher is publishers lookup table, so CORN is set to 100.
Entry CORN = (100+100+50+100+100)/5 *(3/4)= 67. The multiplier ¾ is due to the missing booktitle required field.
Nov 13, 2007 68
TextToBiBTeX API SetupDbConnection setInputString setMarkupStream –re colorized HTML setBiBTeXStream –re BiBTeX entries textToBiBTeX – text to BiBTeX translation getEntriesCount getBibTeXEntryFieldCount getBibTeXEntryField
Nov 13, 2007 69
TextToBiBTeX API (Contd.) Java library jar Non-java programs can invoke
TextToBiBTeX PDFrefsToBiBTeX
Nov 13, 2007 70
TextToBiBTeX Command line tool Free style input in a file BiBTeX output Marked up HTML output Uses TextToBiBTeX API Usage:
TextToBiBTeX <txt file> [bib file]
Nov 13, 2007 71
PDFrefsToBiBTeX Command line tool PDF file as input BiBTeX output Marked up HTML output Uses 3rd party tool PDFBox for parsing
PDF file Uses TextToBiBTeX API Usage:
PDFrefsToBiBTeX [-clean] <pdf file> [bib file]
Nov 13, 2007 72
Integrating into Aigaion Free Style translation functionality
integrated into Aigaion Free Style recognition from PDF files
Logic to clean the text recognized from PDF Synchronizing TextToBiBTeX lookup tables
with entries from Aigaion database
Nov 13, 2007 79
Conclusion Tool Survey
Evaluated over 80 tools Tool Recommendations
Database of BiBTeX entries Store BiBTeX files as database entries Searching is based on token level instead of
string level which yields good results Duplicates are detected logically instead of
string comparisons
Nov 13, 2007 80
Conclusion (Contd.) Text to BiBTeX translation
TextToBiBTeX saves scholar’s time and effort by relieving them from the burden of translating and maintaining BiBTeX entries
TextToBiBTeX API allows other tools to reuse free style functionality
Integrated into Aigaion tool Converted PDF references into BiBTeX format
Nov 13, 2007 81
Future Work Better duplicate detection by letting the
users configure the base rules for detecting duplicates
Recognizing more variations in Free style text
Recognizing more fields Optimizing the database loading speed for
BiBTeX entries