Data releases

Data release 1.2

In this data release, we provide the first annotations of referentially grounded texts, so-called reference text, with referential relations to structured data and FrameNet style semantic roles. The annotations target specific types of events and documents that make reference to the same event instances of the same type, e.g. “mass shooting”, “disease outbreak”, “auto races”. The data consists of:

  • structured data derived from Wikidata in JSON following SEM format
  • annotated reference texts in English and Dutch in XML following the NAF format
  • Dutch FrameNet lexicon in JSON format

The annotated output consists of 326 annotated reference texts, 276 Dutch and 50 English. 27533 mentions were annotated with 9220 tokens of 2729 different lexical units, covering 574 different frames (avg. 16.06 annotations per frame). In order to enable correct frame annotation, 1840 (19.9%) mentions received markable correction (avg. 5.6 per text): 699 multi-words and 1141 compounds. Also, 7457 (27%) of these mentions were annotated with instance- links.

You can download release 1.2 from v1.2

The data can be used freely under the Creative Commons CC BY 4.0 license. 

Data release 1.1

In this data release, we provide the first annotations of referentially grounded texts, so-called reference text, with referential relations to structured data and FrameNet style semantic roles. The annotations target specific types of events and documents that make reference to the same event instances of the same type, e.g. “mass shooting”, “disease outbreak”, “auto races”. The data consists of:

  • structured data derived from Wikidata in JSON following SEM format
  • annotated reference texts in English and Dutch in XML following the NAF format
  • Dutch FrameNet lexicon in JSON format

The annotated output consists of 214 annotated reference texts, 172 Dutch and 42 English. 18960 mentions were annotated with 6066 tokens of 1973 different lexical units, covering 486 different frames (avg. 12.5 annotations per frame). In order to enable correct frame annotation, 1205 (19.8%) mentions received markable correction (avg. 5.5 per text): 393 multi-words and 812 compounds. Also, 5068 (26,7%) of these mentions were annotated with instance- links.

You can download release 1.1 from v1.1

The data can be used freely under the Creative Commons CC BY 4.0 license. 

Data release 1.0


@InProceedings{vossen-EtAl:2020:LREC,
author = {Vossen, Piek and Ilievski, Filip and Postma, Marten and Fokkens, Antske and Minnema, Gosse and Remijnse, Levi},
title = {Large-scale Cross-lingual Language Resources for Referencing and Framing},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
month = {May},
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {3162–3171},
url = {https://www.aclweb.org/anthology/2020.lrec-1.387}
}

Have you ever wondered how the same event is described in different languages? Then this dataset might be useful to you.
From Wikidata, we’ve selected 25 event types, e.g., military operation (see the paper Section 6 for more information).
In total, we collected 19,979 Wikidata items that belong to these 25 event types.
For each Wikidata item, we attempted to retrieve the first paragraph of the Wikipedia page describing the Wikidata item.
We included English, Italian, and Dutch texts, which we processed using various NLP systems.
Also, we represent structured data about each Wikidata item, which facilitates research into the framing of events.

You can download it from here.