RefPars

From Scratchpad Wiki
Jump to: navigation, search

Contents

Reference parsing

Introduction

The purpose of this page is to document a review of currently available tools.

Will do more searches for tools and papers. Keywords to consider: bibliographic citation matching parsing reference.

Notes

Bibliography managers

A useful collection of tools for bibliography management with short reviews. This is not an exhaustive list, but mentions one or two tools that are not widely known.

Parsing tools

What's out there.

Reference Parser

See David Shorthouse's blog, http://ispiders.blogspot.com/2010/08/reference-parser-revived.html.

The tool itself is at http://refparser.shorthouse.net// where David describes it as:

"This jQuery plugin gives visitors of your pages quick access to a web service that parses verbatim journal article citations then gives them a link to the publisher's resource if the parsing is successful. It works by making a secondary web service call to CrossRef's OpenURL service. The plugin is especially useful if the reference citations you serve are user-generated, varied in format, or may be discoverable at some indefinite time in the future (e.g. a society's back issues scanned and later assigned DOIs)."

The code components are downloadable from:

Biblio Citation Parser

Perl Biblio::Citation::Parser - http://search.cpan.org/~mjewell/Biblio-Citation-Parser-1.10/lib/Biblio/Citation/Parser/Jiao.pm

ParsCit

The homepage is at http://aye.comp.nus.edu.sg/parsCit/, where it states:

"This is the home page of the ParsCit project, which performs two tasks: 1) reference string parsing, sometimes also called citation parsing or citation extraction, and 2) logical structure parsing of scienfific [sic] documents. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service. The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used on this web site)."

Originally jointly developed by Penn State and the University of Singapore, and while only the latter maintain it now, it is still an active project.

Results

Generally ParsCit has the best performance of any parser tried so far.

ENDERLEIN, G. 1906c. Zehn neue aussereuropäische Copeognathen. Stettiner Entomologische Zeitung 67: 306-316, 1 fig.

  • author: G ENDERLEIN
  • volume: 67
  • date: 1906
  • title: Zehn neue aussereuropäische Copeognathen.
  • journal: Stettiner Entomologische Zeitung
  • pages: 306-316

does a perfect job; but only after I cleaned up the data: on first pass the en-dash between the page numbers was lost when transfered to ParsCit, resulting in fig' being identified as the pages; replacing the en-dash with a hyphen gives the accurate results above

P F Mattingly, A Stone, K L Knight (1962) Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so named. Z. N. (S.) 1216. Bulletin of Zoological Nomenclature 19: 208 - 219

  • author: P F Mattingly, A Stone, K L Knight
  • volume: 19
  • year: 1962
  • title: Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so named.
  • journal: Z. N. (S.) 1216. Bulletin of Zoological Nomenclature
  • pages: 208--219

does a very good job, though muddles the title and journal.

CiteSeerX

http://sourceforge.net/projects/citeseerx/

FreeCite

Available from http://freecite.library.brown.edu/

Where it is described as: 'FreeCite is an open-source application that parses document citations into fielded data. You can use it as a web application or a service. You can also download the source and run FreeCite on your own server. FreeCite is distributed under the MIT license.'

The FreeCite page has links to a dataset (the CORA dataset) that was used to train FreeCite.

Results

Generally Freecite parses does very well, but never perfectly.

ENDERLEIN, G. 1906c. Zehn neue aussereuropäische Copeognathen. Stettiner Entomologische Zeitung 67: 306-316, 1 fig.

  • authors: G ENDERLEIN
  • title: Zehn neue aussereuropäische Copeognathen. Stettiner Entomologische Zeitung 67
  • volume: 1
  • pages: 306-316
  • year: 1906

has conflated the title and journal name (possibly due to both being in German?) and used the fig[ure] count as the volume.


P F Mattingly, A Stone, K L Knight (1962) Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so named. Z. N. (S.) 1216. Bulletin of Zoological Nomenclature 19: 208 - 219

  • authors: P F Mattingly A Stone K L Knight named Z N 1216
  • title: Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so
  • journal: Bulletin of Zoological Nomenclature
  • volume: 19
  • pages: 208--209
  • year: 1962

has mangled the title, but otherwise is correct.

ParaCite

Available at http://paracite.eprints.org/

Taken from http://paracite.eprints.org/about.html

'ParaCite is an experimental service, being designed at the University of Southampton, for the location of articles from raw references. When a reference is passed to the service, it is split into its component parts (e.g. author, title, year), and transferred to the search resource. Based on the subject area, and the data provided, a set of resources is presented that the system believes have the highest probability of providing the full text article at no charge.'

Its training data is available at http://paracite.eprints.org/cgi-bin/reflist.cgi.

Citation metadata extraction tool from the California Digital Library

This uses Hidden Markov Models. The code can be downloaded from here: http://gales.cdlib.org/~egh/hmm-citation-extractor/ and a presentation that describes the tool is available here http://gales.cdlib.org/~egh/hmm-citation-extractor/jcdl2008-slides.pdf

Google Code

Gupta, D., Morris, B., Catapano, T. and Sautter, G., (2009), 'A new approach towards bibliographic reference identification, parsing and inline citation matching', Contemporary Computing: Communications in Computer and Information Science, 40(2), pp.93--102, DOI: 10.1007/978-3-642-03547-0_10, downloaded from http://193.27.218.161:8080/dspace/bitstream/10199/19094/1/GuptaEtAl.pdf, last accessed November 2011.

CrossRef's DOI retriever

CrossRef has a form for retrieving DOIs for bibliographic references. However, there are usage limits on the simple text query form to prevent volume use. This will be a problem for us but CrossRef state that other options are possible. The form is here http://www.crossref.org/SimpleTextQuery/

Paperbase

An old product from Wight Scientific who no longer support it, but ported by Dave Roberts to Apple II BASIC, who states "…it was remarkably successful [at] getting probable hits, some false positives, but it missed very few things. It was intended to run on manuscripts prepared in a word-processor and link the in-text citation with its database. Then EndNote arrived."

We have Dave's source code, which could form the basis for an up-to-date tool.

Reformatting Tools

A catalogue of slightly different tools. These re-format references so have their role to play downstream in the workflow.

bibliograph.parsing

This is a suite of parsers "Each parser accepts input from a given bibliographic reference format and outputs a list of python dictionaries, one for each entry listed in the input source." Downloadable from http://pypi.python.org/pypi/bibliograph.parsing/1.0.0

bibutils

"The bibutils program set interconverts between various bibliography formats using a common MODS-format XML intermediate." Downloadable from http://sourceforge.net/p/bibutils/home/Bibutils/.

Conclusion

like it says…

Workpackages
EMonocot
Personal tools