M727report

From Scratchpad Wiki
Jump to: navigation, search

Contents

M7.27 – Publish ViBRANT NLP corpus

Due
31 October 2013
Delivered
24 October 2013†
Purpose
To develop, refine and assess our data mining work, and the similar work of others aiming to mine biodiversity texts, a substantial gold standard corpus is required.
Benefit
No such corpus currently exists. This milestone addresses the community need for this building block to enable the development and evaluation of text mining tools for legacy biodiversity literature.

† The date recorded here is when the milestone report was begun. The corpus has been available for some time.

Access

Our corpus is available from ViBRANT's git repository at https://git.scratchpads.eu/v/vibrantcorpus.git.

(thumbnail)
Screenshot of corpus log in ViBRANT's git repository.


Screenshot of corpus log in ViBRANT's git repository.

It can be downloaded following the instructions at https://git.scratchpads.eu/v/.

Through being hosted in git, the corpus can accept updates and additions from outside the ViBRANT project in a controlled manner. This aids ViBRANT's sustainability plan: making its resources available after project completion.

Licence

As with all content produced by the ViBRANT project, the corpus is released under Creative Commons CC0 licence.

Further reading

For more information on the corpus, please consult the ReadMe file in git repository. The file is available in two version:

The creation and use of the ViBRANT corpus will form the basis for two journal articles to be published after the completion of the project.

Sample content

This is the first few lines of text from aves_v1, c136.txt. (Downloadable from https://git.scratchpads.eu/v/vibrantcorpus.git/blob/HEAD:/aves_v1/c136.txt) This is clean, re-keyed text.

136 MNIOTILTIDÆ.
12. Dendrœca decora. (Tab. X. fig. 1.)
Dendrœca graciæ, var. decora, Ridgw. Am. Nat. vii. p. 6081; Baird, Brew. & Ridgw. N. Am. B. i. p. 2402; Coues, B. Col. Vall. i. p. 2923.
Dendrœca graciæ, Salv. Ibis, 1873, p. 4284; Lawr. Bull. U.S. Nat. Mus. no. 4, p. 1655.
Dendrœca decora, Salv. Cat. Strickl. Coll. p. 926.
Supra cinerea, pilei antici plumis in medio nigris; alis et cauda fusco-nigris cinereo limbatis, illis vix pallide cinereo bifasciatis, hujus rectricibus tribus utrinque externis plaga alba gradatim latius notatis; superciliis a naribus, ciliis ipsis, macula suboculari et gutture toto læte flavis; corpore reliquo lactescenti-albo, hypochondriis cinerascentibus vix nigro striatis; rostro nigricante, pedibus corylinis. Long. tota 4, alœ 2·2, caudæ 1·8, rostri a rictu 0·55, tarsi 0·6. (Descr. exempl. ex Guatemala. Mus. Acad. Cantabr.)
Hab. MEXICO, near Zapotitlan (Sumichrast5); BRITISH HONDURAS, Belize (C. Wood1 3), GUATEMALA (Constancia6, Mus. Soc. Econ.4).
Dendrœca decora is a near ally of D. graciæ, a species of New Mexico and Arizona discovered some years ago by Dr. Coues. The differences observable between the two birds are slight, and have been treated by American ornithologists as indicating that their possessors are varieties only one of another and not distinct species. This may prove to be the case; but at present no intermediate links have been discovered blending the two races, nor do we think it very probable that such now exist; and for this reason we prefer to treat D. decora as distinct.

Accompanying the text is an annotation file, identifying taxonomic names and their rank. The matching file for the text above is c136.ann. Here are the relevant annotations for the sample text.

T1 family 4 15 MNIOTILTIDÆ
T2 genus 21 29 Dendrœca
T3 specificepithet 30 36 decora
T4 genus 56 64 Dendrœca
T5 specificepithet 65 71 graciæ
T6 infraspecificrank 73 77 var.
T7 infraspecificepithet 78 84 decora
T8 genus 193 201 Dendrœca
T9 specificepithet 202 208 graciæ
T10 genus 280 288 Dendrœca
T11 specificepithet 289 295 decora
T12 genus 995 1003 Dendrœca
T13 specificepithet 1004 1010 decora
T14 genus-abbrev 1029 1031 D.
T15 specificepithet 1032 1038 graciæ
T16 genus-abbrev 1528 1530 D.
T17 specificepithet 1531 1537 decora

The ViBRANT corpus though is more than just another set of gold standard marked up texts, because for each of the re-keyed clean texts we have the OCR available for download from the BHL.

This is the OCR text for the sample text above, taken from d136.txt (d for dirty as opposed to c for clean).

136 MNIOTILimE.

12. Dendroeca decora. (Tab. X. fig. 1.)

Dendroeca grades, var. decora, Ridgw. Am. Nat. vii. p. 608
1
; Baird, Brew. & Ridgw. N. Am. B. i.

p. 240
2
; Cones, B. Col. Vail. i. p. 292
3
.

Dendroeca gratia, Salv. Ibis, 1873, p. 428
4
; Lawr. Bull. U.S. Nat. Mus. no. 4, p. 16
5

.

Dendroeca decora, Salv. Cat. Strickl. Coll. p. 92
6
.

Supra cinerea, pilei antici plumis in medio nigris ; alis et cauda fusco-nigris cinereo limbatis, illis vix pallide

cinereo bifasciatis, bujus rectricibus tribus utrinque externis plaga alba gradatim latius notatis ; supereiliis a

naribus, ciliis ipsis, macula suboculari et gutture toto laete flavis ; corpore reliquo lactescenti-albo, bypo-

cbondriis cinerascentibus vix nigro striatis ; rostro nigricante, pedibus corylinis. Long, tota 4, alae 2-2,

caudse 1*8, rostri a rictu 0-55, tarsi 0*6. (Descr. exempl. ex Guatemala. Mus. Acad. Cautabr.)

Hob. Mexico, near Zapotitlan (Sumichrast 5
) ; Bkitish Hondueas, Belize (C. Wood 13
),
Guatemala (Constancia
6
, Mus. Soc. Econ.*).

Dendroeca decora is a near ally of D. gracice, a species of New Mexico and Arizona

discovered some years ago by Dr. Coues. The differences observable between the two

birds are slight, and have been treated by American ornithologists as indicating that

their possessors are varieties only one of another and not distinct species. This may
prove to be the case ; but at present no intermediate links have been discovered blending

the two races, nor do we think it very probable that such now exist ; and for this reason
we prefer to treat D. decora as distinct.

These are the relevant annotations taken from annotation file d136.ann.

T1 family 8 18 MNIOTILimE
T2 genus 26 35 Dendroeca
T3 specificepithet 36 42 decora
T4 genus 64 73 Dendroeca
T5 specificepithet 74 80 grades
T6 infraspecificrank 82 86 var.
T7 infraspecificepithet 87 93 decora
T8 genus 218 227 Dendroeca
T9 specificepithet 228 234 gratia
T10 genus 316 325 Dendroeca
T11 specificepithet 326 332 decora
T12 genus 1067 1076 Dendroeca
T13 specificepithet 1077 1083 decora
T14 genus-abbrev 1102 1104 D.
T15 specificepithet 1105 1112 gracice
T16 genus-abbrev 1614 1616 D.
T17 specificepithet 1617 1623 decora

These files enable the development and evaluation of taxonomic name processing tools for both clean and dirty texts. This is especially important for ViBRANT for it liberates the legacy literature and allows us to consider, in future projects, processing OCR digitised texts using the same tool kit as for born-digital literature.

We can see the value of this corpus from a non-scientific review of the sample. consider the first annotated term MNIOTILTIDÆ, which is rendered as MNIOTILimE in the OCR, while the second term Dendrœca is rendered as Dendroeca. Quickly we can see the problems, in contrast, the third term term, decora, is rendered accurately, and so on.

The biodiversity informatics and natural language processing communities now have a reliable data source to work on.

Workpackages
EMonocot
Personal tools