Download

About Revised corpus

It is a revised version of JNLPBA corpus (Kim et al., 2004). The statistics are shown in the table. Because we removed one empty abstract and some duplicate sentences from JNLPBA, the number of abstracts/sentences/tokens in Revised JNLPBA is slightly different from that of the original JNLPBA corpus. Revised JNLPBA has five BNE types: cell line, cell type, DNA, protein and RNA. We had domain experts recheck and revise JNLPNA corpus, making the annotations more consistent and useful for both entity linking and relation extraction tasks.

Type JNLPBA Revised JNLPBA JNLPBA Revised JNLPBA
abstracts 2,000 1,999 404 402
sentences 18,546 18,546 3,856 3,842
unique tokens 20,009 20,009 8,785 8,758
tokens 492,551 492,551 101,039 100,693
BNE type JNLPBA Revised JNLPBA JNLPBA Revised JNLPBA
cell lines 3,830 2,779 500 404
cell types 6,718 8,312 1,921 2,070
DNAs 9,534 6,648 1,056 808
proteins 30,269 25,379 5,067 5,256
RNAs 951 970 118 161

Download

Revised JNLPBA uses the same format with JNLPBA, and you can download Revised JNLPBA here: Download».

References