bibliographic metadata (including citation)

Download Bibliographic metadata (including citation)

If you can't read please download the document

Upload: ukoln-dev-university-of-bath

Post on 27-Jun-2015

1.218 views

Category:

Education


1 download

DESCRIPTION

A talk were given at automatic metadata extraction workshop by Intrallect and Jisc. This particular talk is about bibliographical metadata extraction in context of automated extraction.

TRANSCRIPT

  • 1. Bibliographic metadata (including citation) Tuesday 7 thJuly 2009 AMG 2 ndworkshop,University of Leicester , Leicester www.bath.ac.uk UKOLN is supportedby: Alexey Strelnikov Research Officer UKOLN Contributions from Emma Tonkin

2. Agenda

  • Introduction 3. What and why 4. Use cases 5. Key points 6. Issues 7. Recommendations

8. Introduction

  • Metadata extraction is the process of describing extrinsic and intrinsic qualities of a resource

9. Bibliographic metadata

  • Bibliographic metadata is a particular case of metadata extraction. 10. For example: 11. Title 12. Authors 13. Emails 14. Citations

15. What and why

  • General metadata extraction tends to involve machine learning 16. Citation and reference analysis usually involves regular expressions 17. Might involve visual structure analysis and text mining

18. What and why (2)

  • In order to improve long/boring manual operations with metadata:
    • Generation metadata on document deposit 19. Revision of metadata 20. Comparison and aggregation 21.

22. What and why (3)

  • Automatic extraction can make a system more robust (in addition to existing approaches) 23. It is not a drop-in replacement for manual creation, but semi-automated feature extraction can make for better metadata quality overall

24. Use case (1)

  • Dominik is a researcher, publishing his new paper 25. Instead of fully manual deposit (typing in all values) he makes use of system suggestions, which make the process faster andsimpler

26. Use case (2)

  • Fiona is a researcher, assessing impact made by her paper 27. How many citations of my work? 28. Network of citations (existing system: Google scholar, citeseer.net...)

29. Use case (3)

  • Bob is a repository manager, checking inconsistency in the repository's metadata 30. Make use of system recommendations, and a generated value confidence level 31. Easier to find invalid or obsolete metadata values

32. Use case (4)

  • Edward is an application profile/standard curator, checking inter-repository metadata 33. Have application profile, but no feedback on how it is followed 34. Consistent errors:
    • Not filled 35. Systematically wrong value (might be related to research field, environment)
  • Comparison & aggregation report

36. Summary for use cases

  • All approaches have a manual analogue 37. Automated metadata extraction would be an improvement, but not replacement 38. Service isinvisible , it just makes suggestions: for example 'the metadatafieldtitle should be Some name'

39. Key points

  • Standards - involved in the workflow make a big impact
    • The nice thing about standards is that there are so many of them to choose from Andrew S. Tanenbaum
  • Tools existing applications to extract metadata

40. Standards

  • Should consider a number of standards for representation, format, as well as languages and locales
    • Document encoding 41. Metadata encoding 42. Locale specifics 43. Citation formats

44.

    • Document encoding
  • Important because this may impact correct reading of a resource 45. Document formats:
    • PDF, Doc, PPT, etc.
  • Font encoding:
    • UTF, locale specific

46.

    • Metadata encoding
  • This has a direct impact on the result's usability in a given context 47. Examples of metadata standards:
    • OAI-DC 48. SWAP 49. LOM 50. OAI-ORE 51. MARC

52.

    • Locale specifics
  • Country and culture specific formats of text elements 53. For example:
    • Right-to-left languages 54. Date format:
      • dd/mm/yyyy 55. mm/dd/yyyy

56.

    • Citation and reference formats
  • There exist many citation/reference formats, different standards exist for most research fields 57. For example:
    • APA social sciences 58. MLA literature and the arts 59. AMA - biology 60. Turabian multi-field 61. Chicago standard publications 62. Harvard, Numerical, MHRA - multi-field

63. Tools

  • Automated metadata extraction is a workflow, which involves several interconnected software systems 64. Helps to overcome standards heterogeneity

65. Examples of Tools

  • Examples of existing tools:
    • DC-dot (variety of doc/web formats -> DC metadata) 66. DepositPlait (var. format metadata -> metadata repository) 67. DataFountains (var. format->metadata) 68. paperBase (prototype concentrating on eprint documents)

69. Issues

  • Full-text resource availability 70. Readability of the text 71. Legal issues 72. Engineering constraints - machine suggestions might be imperfect 73. Language & localization - need to retrain system for the other locale

74. Recommendations

  • A robust system that is easy to retrain, customizable input & outputs plugins
    • A potential gain:
      • Simplify (re)extraction of metadata, faster repository operations, validation
  • Making use of confidence level assigned to the metadata field
    • A potential gain:
      • Identifying possibly incorrect metadata records

75. Recommendations (2)

  • Make full-text document available to the system
    • A potential gain:
      • Periodical re-exploration of the resource and updating the metadata
  • Investigate the problem of analysing citation
    • A potential gain:
      • Assess level of similarity between papers 76. Classify paper nature

77. Q&A

  • Thank you for your attention