Download - BibPro: A Citation Parser System
CSCLab
BibPro: A Citation Parser System
www.iis.sinica.edu.tw
Introduction
Integrating bibliographical information Metadata
author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc.
Citation string Thousands of variations
More than 2,000 formats in Endnote Citation Parsing Problem
automatically recognize individual fields from a given citation string
A template citation parser
www.iis.sinica.edu.tw
Goal
CitationChomsky, Noam. 1956. Three models for the description oflanguage. IRE Transactions on Information Theory. 2(3) 113--124.
BibPro
MetaDataAuthor: Chomsky, NoamTitle: Three models for the description of languageVenue: IRE Transactions on Information TheoryVolumn: 2Issue: 3Page: 113-124Date:
www.iis.sinica.edu.tw
Machine Learning
Condition Random Field F. Peng, A. McCallum. Accurate information extraction from research papers
using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336.
Support Vector Machine Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A.
Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48.
Hidden Markov Model K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model
structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42.
Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60.
Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.
www.iis.sinica.edu.tw
Knowledge Base
A tree-like knowledge representation scheme that organizes the knowledge of reference concepts in a hierarchical fashion
Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006.
A knowledge base automatically constructed from an existing set of sample metadata records of a given area
E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.
www.iis.sinica.edu.tw
Template Base
Keep citation style as a template ParaCite http://paracite.eprints.org/ I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua
Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548.
Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.
www.iis.sinica.edu.tw
Key Idea
Encode a citation string into a template for BLAST Keeping citation style information into a
protein sequence Utilizing bioinformatics sequence tools
BLAST Using Domain Knowledge
Reserved word Knowledge database (optional) Blocking rule (common sense knowledge)
www.iis.sinica.edu.tw
Question
How many symbols can be used in a protein sequence?23 symbols used in BLAST
How many fields should be extracted from a citation?choose the most common used field
Which punctuation marks are treated as partition marksbase on domain knowledge
How do we transform a citation into a protein sequence and retain its structure feature?define a encode table
www.iis.sinica.edu.tw
Encoding Procedure
Encoding Citation
Regular ExpressionExtraction
Keyword Encoding
Author Venue EditorVolume Page Date
Publisher Instituation
Http ISBN
Database Encoding(optional)
www.iis.sinica.edu.tw
Encoding Table Classification Symbol Representation
Contents
Extracted Field
A AuthorT Title
L Venue (Journal, BookTitle, Technical Report)
V VolumeW IssueP PageY Date (Year Month)
Unknow FieldX Single UnknowB Continuous UnknowN Numeral
Punctuation Mark
Partition Mark
R ,D .G "E 'C :Z ;H -
BracketsI ( [ < {K ) ] > }
Misc Q / _ ! @ # $ % ^ & * + = \ | ? 。 ~
Others F EditorS InstitutionM Publisher
www.iis.sinica.edu.tw
Encoding Knowledge
A [AUTHOR] Name abbreviation
T [TITLE] Length of Blocking
L [VENUE] booktitle(conference): Proceedings Proc Workshop Conf
Conference Symposium Sympos Symp International Intern Annual Annu
journaltitle: Transactions Trans Journal techtitle(thesis): Tech rep Rpt TR Master Masters Ph PhD
Thesis thesis Dissertation dissertation V [VOLUME]
volume: Volume volume Vol vol Vo vo issue: Number number Nr nr No no NO Nos
P [PAGE] page: pp page pages PP Page Pages pg PG
www.iis.sinica.edu.tw
Encoding Knowledge
Y [DATE] month: January February March April May June July August
September October November December Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept
year: 1900-2010 F [EDITOR]
editor: eds Eds editors Editors editor Eds ED Ed ed edited S [INSTITUTION]
institution: University Univ Department Dept Corporation M [PUBLISHER]
publisher: Press Pub Publishers Inc Publications
www.iis.sinica.edu.tw
Tokenizing and Encoding Citation
M . Bianchini , P . Frasconi , and M . Gori , " Learning in multilayered networks used as autoassociators , " IEEE Transactions on Neural Networks , vol . 6 , pp . 512 -515 , March 1995 .
M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
A D X R A D X R A .X GRXD X X X X X X X
RXX L X X V RN YD DP YND H N RGR
www.iis.sinica.edu.tw
Blocking Mechanism
After encoding the citation, we can utilize semi-structured characteristic of citation by some special pattern of sequence Using blocking rule to merge special pattern
into a single unit e.g ADXRA A
www.iis.sinica.edu.tw
Blocking(1/2)
M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
D X R A D X R .X GRXD X X X X X X
RXX X X RN YD DYND C N RGR
A B
L V P
A
M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
A D X R A D X R A .X GRXD X X X X X X X
RXX L X X V RN YD DP YND H N RGR
M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
D X R A D X R A .X GRXD X X X X X X
RXX X X RN MD DYND C N RGR
A B
L V P
www.iis.sinica.edu.tw
M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
D X R A D X R A .X GRXD X X X X X X
RXX X X RN MD DYND C N RGR
A B
L V P
Blocking(2/2)
Index Form ARGBRGLRVRPRYD
(keep the blocking area information: start position and end position
e.g. “A” start:0 end:11 )
www.iis.sinica.edu.tw
A record in the Template Database A citation item with both citation string and
metadata Style Form Index Form
Once the template database has been constructed, BibPro can provide the citation parsing service on-the-fly
Template Database
www.iis.sinica.edu.tw
Citation Style Template
Index Form (Unknown Answer)
Style Form (Known Answer)M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
D X R A D X R A .X GRXD X X X X X X
RXX X X RN MD DYND C N RGR
A T
L V P
M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
D X R A D X R A .X GRXD X X X X X X
RXX X X RN MD DYND C N RGR
A B
L V P
www.iis.sinica.edu.tw
Parsing (Template Matching)
Citation String
INDEXFORM
INDEXFORM
STYLEFORM
Search ToolBLAST
INDEXFORM
STYLEFORM
INDEXFORM
STYLEFORM
.
...
System Preprocess Online parsing
www.iis.sinica.edu.tw
Finding Citation Style Templates
Using Score mechanism Finding similar citation style templates
Blast Score Matrix which fields exist in query citation the order of partition mark represent a citation style Choose by IndexForm
Align query citation with citation style template Score Matrix (dynamic programming)
Content Symbol map to Content Symbol Partition Mark map to Partition Mark Choose the most suitable citation style template
according to alignment between IndexForm and StyleForm
www.iis.sinica.edu.tw
Classification Symbol Representation
Contents
Extracted Field
A AuthorT TitleL Venue (Journal, BookTitle, Technical ReportV Volume IssueP PageY Date (Year Month)
Unknow FieldX Single UnknowB Continuous UnknowN Numeral
Punctuation Mark
Partition Mark
R ,D .G "E 'C :Z ;H -
BracketsI ( [ < {K ) ] > }
Misc Q / _ ! @ # $ % ^ & * + = \ | ? 。 ~
Others F EditorS InstitutionM PublisherW
www.iis.sinica.edu.tw
(Query) Index Form ARGBRGLRVRPRYD
|||:||||||||||
(Template) Style Form ARGTRGLRVRPRYD
ARGTRGLRVRPRYD
Parsing (Alignment Extraction)
M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators
,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,
D X R A D X R A .X GRXD X X X X X X
RXX X X RN MD DYND C N RGR
A T
L V P
www.iis.sinica.edu.tw
System Architecture
QueryCitation
EncodingCitation
BLAST
Template DataBase
AlignmentExtraction
MetaData
INDEX FORM
Encode Table(DataBase)
Template Filter
EncodingCitation
SearchEngine
(CiteSeer)
Parsing System Template Generating System
PostProcessing
Blocking Rule
Pre-Processing(Blocking)
Pre-Processing(Blocking)
Citations MetaData
INDEXFORM
STYLEFORM
CitationPreProcessing
TemplateMatching
BibTex
STYLEFORM
www.iis.sinica.edu.tw
Experiment
Dataset INFOMAP Dataset
A total of 160,000 citation records were collected from digital libraries on the Web
Citation string data was generated for each of the six citation styles (APA, IEEE, ACM, MISQ, JMIS, and ISR)
Cora Dataset 500 records with diversity citation style
Flux-CIM Dataset 2000 HS-domain records 300 CS-domain records
www.iis.sinica.edu.tw
Experiment
Evaluation Token-Level
A is the number of true positive tokens B is the number of false negative tokens C is the number of false positive tokens D is the number of true negative tokens
Field-Level
RecallPrecisionRecallPrecision2measure-F
BAARecall
CAAPrecision
fieldsofnumberTotalfieldsextractedcorrectlyofNumberAccuracy
www.iis.sinica.edu.tw
INFOMAP Dataset Result
Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 99.38% 99.37% 99.38% 98.14%
Title 99.58% 97.23% 98.39% 95.02%
Venue 98.00% 97.85% 97.92% 96.36%
Volume 96.44% 97.83% 98.90% 97.36%
Issue 98.89% 95.58% 97.21% 94.04%
Page 99.58% 98.93% 99.25% 98.16%
Date 99.29% 99.59% 99.44% 99.16%
Average 98.74% 98.06% 98.64% 96.89%
www.iis.sinica.edu.tw
Cora Dataset Result
Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 95.87% 97.75% 96.79% 90.99%
Title 97.45% 95.15% 96.29% 90.46%
Venue 91.77% 91.21% 91.47% 79.77%
Volume 87.57% 87.94% 87.72% 85.64%
Issue 79.96% 93.85% 86.22% 77.51%
Page 97.15% 97.01% 97.07% 95.78%
Date 96.91% 97.16% 97.03% 94.04%
Average 92.38% 94.30% 93.22% 87.74%
www.iis.sinica.edu.tw
Flux-CIM HS Domain Dataset
Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 93.45% 99.74% 96.49% 93.30%
Title 97.45% 95.11% 96.26% 92.09%
Venue 96.21% 99.29% 97.71% 98.65%
Volume 99.70% 98.64% 99.17% 98.85%
Issue n/a n/a n/a n/a
Page 100.00% 97.07% 98.51% 96.99%
Date 98.38% 100.00% 99.18% 99.00%
Average 97.53% 98.31% 97.89% 96.48%
www.iis.sinica.edu.tw
Flux-CIM CS Domain Dataset
Token-Level Field-Level
Precision Recall F-Measure Accuracy
Author 97.89% 98.26% 98.05% 96.64%
Title 99.30% 98.12% 98.70% 96.32%
Venue 98.40% 98.80% 98.60% 90.89%
Volume 83.71% 85.77% 84.71% 82.13%
Issue 89.01% 88.56% 86.90% 87.75%
Page 99.47% 96.91% 98.17% 96.17%
Date 91.74% 98.71% 95.07% 89.59%
Average 94.22% 95.02% 94.31% 91.35%
www.iis.sinica.edu.tw
Conclusion
Parsing citation is a challenging problem Diversity in citation formats
We present a template-based parser Parser System
http://csclws.iis.sinica.edu.tw:8080/input.jsp Template Generator System
http://csclws.iis.sinica.edu.tw:8080/tpin.jsp
www.iis.sinica.edu.tw
Reference
F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336.
Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48.
K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42.
Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60.
Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006.
E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.
I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548.
Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.
Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.
S. F. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman. A basic local alignment search tool. J. Mol. Biol., 215, 1990, 403-410.
Needleman, S. B. and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 1970, 443-453.
CSCLab