M719Report

From Scratchpad Wiki
Jump to: navigation, search

Contents

M7.19 Review of pilot of reference de-duplication software

Due
31 July 2013
Delivered
30 July 2013
Purpose
To validate our solution to the currently unresolved problem of de-duplication of bibliographic references.
Benefit
To present users of RefBank with a smaller list of search results by removing duplicate and near duplicate entries.

Summary

The reference de-duplication software developed as part of the ViBRANT project is called RefConcile. It was tested on an archive of RefBank records. The archive has about 150,000 bibliographic references, of which about 20% are duplicates. RefConcile detects such references with 95.6% f-measure at 99.7% precision. It takes less than one second to process a newly added reference.

Background

Working with community contributed references to RefBank means that our repository will have a large number of duplicate references arising from:

  • letting users load textual, as opposed to marked-up, references means that a reference can be entered using any style such as Harvard, Chicago, etc laid out in the published source.
  • near identical references varying only by a comma or space cause by individual stylistic quirks of the contributors.
  • near identical references varying only typographical errors, whether in the original source or induced later through re-keying the reference.

We consider it important to RefBank's success that there are as few blocks as possible to user contributions: users should simply upload references as they are without having to specially reformat them to suit RefBank. This design decision leads to the problem of multiple references; however, we consider it preferable that the duplicates are resolved within RefBank after loading rather than prevent the loading of these references at all, so hindering the workflow of our potential contributors.

Approach

The problem of de-duplication is still unresolved within bibliographic reference management. We needed to develop a tool to automatically identify canonical forms of a reference from the many references loaded into RefBank.

Our approach is based on graph theory, with each reference forming a node in a graph and the emergent centroid being considered the canonical form. Various algorithms are used to calculate the centroid, decomposing the reference so that the most appropriate algorithm can be used. This canonical form of a reference will be returned in future searches, however, the other references will not be deleted but simply marked as unavailable to general searches. Manual curation is enabled so that a user can override RefBank's canonical form if necessary.

The first stage in the RefConcile algorithm is to group the references. This reduces the number of pair-wise reference comparisons as far as possible, hence speeding up processing. However, grouping is not without its issues hence RefConcile uses fuzzy grouping keys covering a range of values rather then single values composed from all the attributes of a bibliographic reference to generate meaningful groups. RefConcile also employs standard natural language processing techniques such as stop word deletion to improve processing speed without undermining accuracy of results.

RefConcile's second stage is to analyse each group. If there are three or more duplicates in the group, this works by selecting the attribute values that are most frequent across the duplicate group, individually for each attribute. This yields a reconciled value for each attribute. The most frequent value need not be the correct value, of course, hence the ability to manually curate the final result. For groups containing only two duplicates, majority voting is not applicable. Instead RefConcile selects the more recently added reference, rationalised through the more recent duplicate possibly being a correction of the older reference. Where necessary, RefConcile can construct new references, as in this text case:

Thor, AU, Cond, SE (2012) Bibliographical duplicates. Journal of TPDL 9: 8-16

Thor, AU, Corid, SE. Bibliographic duplicates. Journal of TBDL 8: 8-15, 2013

Thop, AU, Cond, SE. Bibliographic duplicates. Journal of TPDL 8 (2012): 9-15

Attribute level majority voting yields a new reference:

Thor, AU, Cond, SE. Bibliographic duplicates. Journal of TPDL 8 (2012): 8-15

Use

RefConcile is automatic, the user does not need specifically to invoke it to de-duplicate a RefBank result set.

Process

To simulate a continuously growing bibliographic reference data set, we started our evaluations with an empty data set and then added references one by one. Each addition of a reference prompted RefConcile to search for duplicates, and to reconcile any ones found.

The experiments were conducted on a 4 x 2.0MHz 64-bit machine with 8GB of main memory running Ubuntu Linux, PostgreSQL 9.1, and Sun/Oracle's JVM 1.6.

Results

Table 1: RefConcile evaluation results - overall performance
Average number of candidates 9.14
Average time for candidate retrieval 5.9 µs / reference, 47.8 ms / candidate found
Average time for candidate assessment 6.4 ms / candidate, 216.8 ms / duplicate found


Even with around 150,000 references in the database, the fuzzy blocking returns only 9.14 candidate duplicates per reference on average. This means that the matching has fewer than 10 possible duplicates to deal with, underlining the scalability of RefConcile. Implemented on top of a relational database, the incremental blocking takes only 5.9 microseconds for each pair wise reference comparison. Matching takes an average 6.4 milliseconds for each pair of references.

Table 2: RefConcile evaluation results - optimum accuracy
Precision 99.7%
Recall 91.9%
F-measure 95.6%

The high precision of 99.7% indicates that RefConcile rarely ever wrongfully labels a pair of references as a duplicate. We aimed for a high precision so that users of RefBank are not confronted with clearly non duplicate results in their result set, so potentially undermining their confidence in the software.

The recall of 91.9% means that RefConcile correctly finds about 9 out of 10 duplicate relations. While recall can be increased this is at the expense of over-fitting, the return of too many false duplicates.

The precision and recall results given above produce an F-measure of 95.6%. Suggesting these values produce an acceptable compromise in overall detection accuracy.

To assess the quality of automated references reconciliation, we manually inspected the generated cluster representatives. Only some 5-10% contained errors.

Conclusion

The pilot has shown it is possible to achieve automatic duplicate recognition for more than nine out of ten such records. Of the records selected less than one in ten such selections contained errors. This indicates that RefConcile beneficially assists RefBank users to review query results.

The software is now being incorporated into the latest revision of RefBank.

Workpackages
EMonocot
Personal tools