bibpro: a citation parser system

32
CSCLab BibPro: A Citation Parser System

Upload: kieve

Post on 12-Feb-2016

62 views

Category:

Documents


0 download

DESCRIPTION

BibPro: A Citation Parser System. Introduction. Integrating bibliographical information Metadata author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc. Citation string Thousands of variations More than 2,000 formats in Endnote - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BibPro: A Citation Parser System

CSCLab

BibPro: A Citation Parser System

Page 2: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Introduction

Integrating bibliographical information Metadata

author, title of the article, title of the book containing the paper, journal name, month and year of publication, etc.

Citation string Thousands of variations

More than 2,000 formats in Endnote Citation Parsing Problem

automatically recognize individual fields from a given citation string

A template citation parser

Page 3: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Goal

CitationChomsky, Noam. 1956. Three models for the description oflanguage. IRE Transactions on Information Theory. 2(3) 113--124.

BibPro

MetaDataAuthor: Chomsky, NoamTitle: Three models for the description of languageVenue: IRE Transactions on Information TheoryVolumn: 2Issue: 3Page: 113-124Date:

Page 4: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Machine Learning

Condition Random Field F. Peng, A. McCallum. Accurate information extraction from research papers

using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336.

Support Vector Machine Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A.

Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48.

Hidden Markov Model K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model

structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42.

Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60.

Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.

Page 5: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Knowledge Base

A tree-like knowledge representation scheme that organizes the knowledge of reference concepts in a hierarchical fashion

Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006.

A knowledge base automatically constructed from an existing set of sample metadata records of a given area

E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.

Page 6: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Template Base

Keep citation style as a template ParaCite http://paracite.eprints.org/ I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua

Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548.

Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.

Page 7: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Key Idea

Encode a citation string into a template for BLAST Keeping citation style information into a

protein sequence Utilizing bioinformatics sequence tools

BLAST Using Domain Knowledge

Reserved word Knowledge database (optional) Blocking rule (common sense knowledge)

Page 8: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Question

How many symbols can be used in a protein sequence?23 symbols used in BLAST

How many fields should be extracted from a citation?choose the most common used field

Which punctuation marks are treated as partition marksbase on domain knowledge

How do we transform a citation into a protein sequence and retain its structure feature?define a encode table

Page 9: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Encoding Procedure

Encoding Citation

Regular ExpressionExtraction

Keyword Encoding

Author Venue EditorVolume Page Date

Publisher Instituation

Http ISBN

Database Encoding(optional)

Page 10: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Encoding Table  Classification Symbol Representation

Contents

Extracted Field

A AuthorT Title

L Venue (Journal, BookTitle, Technical Report)

V VolumeW IssueP PageY Date (Year Month)

Unknow FieldX Single UnknowB Continuous UnknowN Numeral

Punctuation Mark

Partition Mark

R ,D .G "E 'C :Z ;H -

BracketsI ( [ < {K ) ] > }

Misc Q / _ ! @ # $ % ^ & * + = \ | ? 。 ~

Others  F EditorS InstitutionM Publisher

Page 11: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Encoding Knowledge

A [AUTHOR] Name abbreviation

T [TITLE] Length of Blocking

L [VENUE] booktitle(conference): Proceedings Proc Workshop Conf

Conference Symposium Sympos Symp International Intern Annual Annu

journaltitle: Transactions Trans Journal techtitle(thesis): Tech rep Rpt TR Master Masters Ph PhD

Thesis thesis Dissertation dissertation V [VOLUME]

volume: Volume volume Vol vol Vo vo issue: Number number Nr nr No no NO Nos

P [PAGE] page: pp page pages PP Page Pages pg PG

Page 12: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Encoding Knowledge

Y [DATE] month: January February March April May June July August

September October November December Jan Feb Mar Apr Jun Jul Aug Sep Oct Nov Dec Sept

year: 1900-2010 F [EDITOR]

editor: eds Eds editors Editors editor Eds ED Ed ed edited S [INSTITUTION]

institution: University Univ Department Dept Corporation M [PUBLISHER]

publisher: Press Pub Publishers Inc Publications

Page 13: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Tokenizing and Encoding Citation

M . Bianchini , P . Frasconi , and M . Gori , " Learning in multilayered networks used as autoassociators , " IEEE Transactions on Neural Networks , vol . 6 , pp . 512 -515 , March 1995 .

M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

A D X R A D X R A .X GRXD X X X X X X X

RXX L X X V RN YD DP YND H N RGR

Page 14: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Blocking Mechanism

After encoding the citation, we can utilize semi-structured characteristic of citation by some special pattern of sequence Using blocking rule to merge special pattern

into a single unit e.g ADXRA A

Page 15: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Blocking(1/2)

M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

D X R A D X R .X GRXD X X X X X X

RXX X X RN YD DYND C N RGR

A B

L V P

A

M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

A D X R A D X R A .X GRXD X X X X X X X

RXX L X X V RN YD DP YND H N RGR

M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

D X R A D X R A .X GRXD X X X X X X

RXX X X RN MD DYND C N RGR

A B

L V P

Page 16: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

D X R A D X R A .X GRXD X X X X X X

RXX X X RN MD DYND C N RGR

A B

L V P

Blocking(2/2)

Index Form ARGBRGLRVRPRYD

(keep the blocking area information: start position and end position

e.g. “A” start:0 end:11 )

Page 17: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

A record in the Template Database A citation item with both citation string and

metadata Style Form Index Form

Once the template database has been constructed, BibPro can provide the citation parsing service on-the-fly

Template Database

Page 18: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Citation Style Template

Index Form (Unknown Answer)

Style Form (Known Answer)M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

D X R A D X R A .X GRXD X X X X X X

RXX X X RN MD DYND C N RGR

A T

L V P

M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

D X R A D X R A .X GRXD X X X X X X

RXX X X RN MD DYND C N RGR

A B

L V P

Page 19: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Parsing (Template Matching)

Citation String

INDEXFORM

INDEXFORM

STYLEFORM

Search ToolBLAST

INDEXFORM

STYLEFORM

INDEXFORM

STYLEFORM

.

...

System Preprocess Online parsing

Page 20: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Finding Citation Style Templates

Using Score mechanism Finding similar citation style templates

Blast Score Matrix which fields exist in query citation the order of partition mark represent a citation style Choose by IndexForm

Align query citation with citation style template Score Matrix (dynamic programming)

Content Symbol map to Content Symbol Partition Mark map to Partition Mark Choose the most suitable citation style template

according to alignment between IndexForm and StyleForm

Page 21: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

  Classification Symbol Representation

Contents

Extracted Field

A AuthorT TitleL Venue (Journal, BookTitle, Technical ReportV Volume IssueP PageY Date (Year Month)

Unknow FieldX Single UnknowB Continuous UnknowN Numeral

Punctuation Mark

Partition Mark

R ,D .G "E 'C :Z ;H -

BracketsI ( [ < {K ) ] > }

Misc Q / _ ! @ # $ % ^ & * + = \ | ? 。 ~

Others  F EditorS InstitutionM PublisherW

Page 22: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

(Query) Index Form ARGBRGLRVRPRYD

|||:||||||||||

(Template) Style Form ARGTRGLRVRPRYD

ARGTRGLRVRPRYD

Parsing (Alignment Extraction)

M . Bianchini , P . Frasconi , M .and “,Gori. Learning in multilayered newtorks used as autoassociators

,NetworksIEEE Transactions on Neural vol ,6 March. .pp 1995512. - 515 ,“,

D X R A D X R A .X GRXD X X X X X X

RXX X X RN MD DYND C N RGR

A T

L V P

Page 23: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

System Architecture

QueryCitation

EncodingCitation

BLAST

Template DataBase

AlignmentExtraction

MetaData

INDEX FORM

Encode Table(DataBase)

Template Filter

EncodingCitation

SearchEngine

(CiteSeer)

Parsing System Template Generating System

PostProcessing

Blocking Rule

Pre-Processing(Blocking)

Pre-Processing(Blocking)

Citations MetaData

INDEXFORM

STYLEFORM

CitationPreProcessing

TemplateMatching

BibTex

STYLEFORM

Page 24: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Experiment

Dataset INFOMAP Dataset

A total of 160,000 citation records were collected from digital libraries on the Web

Citation string data was generated for each of the six citation styles (APA, IEEE, ACM, MISQ, JMIS, and ISR)

Cora Dataset 500 records with diversity citation style

Flux-CIM Dataset 2000 HS-domain records 300 CS-domain records

Page 25: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Experiment

Evaluation Token-Level

A is the number of true positive tokens B is the number of false negative tokens C is the number of false positive tokens D is the number of true negative tokens

Field-Level

RecallPrecisionRecallPrecision2measure-F

BAARecall

CAAPrecision

fieldsofnumberTotalfieldsextractedcorrectlyofNumberAccuracy

Page 26: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

INFOMAP Dataset Result

Token-Level Field-Level

Precision Recall F-Measure Accuracy

Author 99.38% 99.37% 99.38% 98.14%

Title 99.58% 97.23% 98.39% 95.02%

Venue 98.00% 97.85% 97.92% 96.36%

Volume 96.44% 97.83% 98.90% 97.36%

Issue 98.89% 95.58% 97.21% 94.04%

Page 99.58% 98.93% 99.25% 98.16%

Date 99.29% 99.59% 99.44% 99.16%

Average 98.74% 98.06% 98.64% 96.89%

Page 27: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Cora Dataset Result

Token-Level Field-Level

Precision Recall F-Measure Accuracy

Author 95.87% 97.75% 96.79% 90.99%

Title 97.45% 95.15% 96.29% 90.46%

Venue 91.77% 91.21% 91.47% 79.77%

Volume 87.57% 87.94% 87.72% 85.64%

Issue 79.96% 93.85% 86.22% 77.51%

Page 97.15% 97.01% 97.07% 95.78%

Date 96.91% 97.16% 97.03% 94.04%

Average 92.38% 94.30% 93.22% 87.74%

Page 28: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Flux-CIM HS Domain Dataset

Token-Level Field-Level

Precision Recall F-Measure Accuracy

Author 93.45% 99.74% 96.49% 93.30%

Title 97.45% 95.11% 96.26% 92.09%

Venue 96.21% 99.29% 97.71% 98.65%

Volume 99.70% 98.64% 99.17% 98.85%

Issue n/a n/a n/a n/a

Page 100.00% 97.07% 98.51% 96.99%

Date 98.38% 100.00% 99.18% 99.00%

Average 97.53% 98.31% 97.89% 96.48%

Page 29: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Flux-CIM CS Domain Dataset

Token-Level Field-Level

Precision Recall F-Measure Accuracy

Author 97.89% 98.26% 98.05% 96.64%

Title 99.30% 98.12% 98.70% 96.32%

Venue 98.40% 98.80% 98.60% 90.89%

Volume 83.71% 85.77% 84.71% 82.13%

Issue 89.01% 88.56% 86.90% 87.75%

Page 99.47% 96.91% 98.17% 96.17%

Date 91.74% 98.71% 95.07% 89.59%

Average 94.22% 95.02% 94.31% 91.35%

Page 30: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Conclusion

Parsing citation is a challenging problem Diversity in citation formats

We present a template-based parser Parser System

http://csclws.iis.sinica.edu.tw:8080/input.jsp Template Generator System

http://csclws.iis.sinica.edu.tw:8080/tpin.jsp

Page 31: BibPro: A Citation Parser System

www.iis.sinica.edu.tw

Reference

F. Peng, A. McCallum. Accurate information extraction from research papers using conditional random fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, 329-336.

Hui Han, Giles, C.L., Manavoglu, E., Hongyuan Zha, Zhenyue Zhang, Fox, E.A. Automatic document metadata extraction using support vector machines. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 37-48.

K. Seymore, A. McCallum, R. Rosenfeld. Learning hiddenMarkov model structure for information extraction. AAAI-99Workshop on Machine Learning for Information Extraction, 1999, 37-42.

Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, 2003, 49-60.

Min-Yuh Day et al. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems, 2006.

E. Cortez, A. S. da Silva, M. A. Goncalves, F. Mesquita, and E. S. de Moura. FLUX-CiM: exible unsupervised extraction of citation metadata. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 215{224, Vancouver, BC, Canada, 2007. ACM.

I-Ane Huang, Jan-Ming Ho, Hung-Yu Kao, and Shian-Hua Lin. Extracting citation metadata from online publication lists using BLAST. In PAKDD, 2004, 539-548.

Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao, Jan-Ming Ho, BibPro: A Citation Parser Based on Sequence Alignment Techniques, ainaw,pp.1175-1180, aina workshops 2008, 2008.

Erik Hetzner. A simple method for citation metadata extraction using hidden Markov models. JCDL 2008.

S. F. Altschul, W. Gish, W. Miller, E. Myers and D. Lipman. A basic local alignment search tool. J. Mol. Biol., 215, 1990, 403-410.

Needleman, S. B. and Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 1970, 443-453.

Page 32: BibPro: A Citation Parser System

CSCLab