It is a revised version of JNLPBA corpus (Kim et al., 2004). The statistics are shown in the table. Because we removed one empty abstract and some duplicate sentences from JNLPBA, the number of abstracts/sentences/tokens in Revised JNLPBA is slightly different from that of the original JNLPBA corpus. Revised JNLPBA has five BNE types: cell line, cell type, DNA, protein and RNA. We had domain experts recheck and revise JNLPNA corpus, making the annotations more consistent and useful for both entity linking and relation extraction tasks.
Type | JNLPBA | Revised JNLPBA | JNLPBA | Revised JNLPBA |
---|---|---|---|---|
abstracts | 2,000 | 1,999 | 404 | 402 |
sentences | 18,546 | 18,546 | 3,856 | 3,842 |
unique tokens | 20,009 | 20,009 | 8,785 | 8,758 |
tokens | 492,551 | 492,551 | 101,039 | 100,693 |
BNE type | JNLPBA | Revised JNLPBA | JNLPBA | Revised JNLPBA |
cell lines | 3,830 | 2,779 | 500 | 404 |
cell types | 6,718 | 8,312 | 1,921 | 2,070 |
DNAs | 9,534 | 6,648 | 1,056 | 808 |
proteins | 30,269 | 25,379 | 5,067 | 5,256 |
RNAs | 951 | 970 | 118 | 161 |
Revised JNLPBA uses the same format with JNLPBA, and you can download Revised JNLPBA here: Download».