Data releases

Data release 2.1

In this data release, we provide extensive annotations of a subset of reference texts from Data Release 1.2, namely the Dutch texts describing the incidents MH17 (Wikidata Q17374096) and the Utrecht shooting (Wikidata Q62090804). The annotations show referential relations to structured data and FrameNet style semantic roles. The data consists of:
  • structured data derived from Wikidata in JSON following SEM format;
  • annotated reference texts in Dutch in XML following the NAF format Dutch FrameNet lexicon in JSON format, describing two incidents;
  • functions that extract relevant framing information from NAF;
  • a notebook in which the framing of participants per incident is visualized over time.
The annotated output consists of 159 annotated Dutch reference texts, 42 describing the Utrecht shooting, and 117 reference texts describing MH17.  In the Utrecht Shooting subcorpus, 1459 in-text mentions were linked to 13 different entities. 1830 mentions received frame-annotations and 5807 mentions were assignment with frame elements. For the MH17 subcorpus, 3390 in-text mentions were linked to 37 different entities, 3436 mentions received frame-annotations and 10978 mentions were assignment with frame elements.
You can download release 2.1 from: data_release_participant_analysis
The data can be used freely under the Creative Commons CC BY 4.0 license.

Data release 1.2

In this data release, we provide the first annotations of referentially grounded texts, so-called reference text, with referential relations to structured data and FrameNet style semantic roles. The annotations target specific types of events and documents that make reference to the same event instances of the same type, e.g. “mass shooting”, “disease outbreak”, “auto races”. The data consists of:

  • structured data derived from Wikidata in JSON following SEM format
  • annotated reference texts in English and Dutch in XML following the NAF format
  • Dutch FrameNet lexicon in JSON format

The annotated output consists of 326 annotated reference texts, 276 Dutch and 50 English. 27533 mentions were annotated with 9220 tokens of 2729 different lexical units, covering 574 different frames (avg. 16.06 annotations per frame). In order to enable correct frame annotation, 1840 (19.9%) mentions received markable correction (avg. 5.6 per text): 699 multi-words and 1141 compounds. Also, 7457 (27%) of these mentions were annotated with instance- links.

You can download release 1.2 from v1.2

The data can be used freely under the Creative Commons CC BY 4.0 license.

Data release 1.1

In this data release, we provide the first annotations of referentially grounded texts, so-called reference text, with referential relations to structured data and FrameNet style semantic roles. The annotations target specific types of events and documents that make reference to the same event instances of the same type, e.g. “mass shooting”, “disease outbreak”, “auto races”. The data consists of:

  • structured data derived from Wikidata in JSON following SEM format
  • annotated reference texts in English and Dutch in XML following the NAF format
  • Dutch FrameNet lexicon in JSON format

The annotated output consists of 214 annotated reference texts, 172 Dutch and 42 English. 18960 mentions were annotated with 6066 tokens of 1973 different lexical units, covering 486 different frames (avg. 12.5 annotations per frame). In order to enable correct frame annotation, 1205 (19.8%) mentions received markable correction (avg. 5.5 per text): 393 multi-words and 812 compounds. Also, 5068 (26,7%) of these mentions were annotated with instance- links.

You can download release 1.1 from v1.1

The data can be used freely under the Creative Commons CC BY 4.0 license.

Data release 1.0


@InProceedings{vossen-EtAl:2020:LREC,
author = {Vossen, Piek and Ilievski, Filip and Postma, Marten and Fokkens, Antske and Minnema, Gosse and Remijnse, Levi},
title = {Large-scale Cross-lingual Language Resources for Referencing and Framing},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
month = {May},
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {3162–3171},
url = {https://www.aclweb.org/anthology/2020.lrec-1.387}
}

Have you ever wondered how the same event is described in different languages? Then this dataset might be useful to you.
From Wikidata, we’ve selected 25 event types, e.g., military operation (see the paper Section 6 for more information).
In total, we collected 19,979 Wikidata items that belong to these 25 event types.
For each Wikidata item, we attempted to retrieve the first paragraph of the Wikipedia page describing the Wikidata item.
We included English, Italian, and Dutch texts, which we processed using various NLP systems.
Also, we represent structured data about each Wikidata item, which facilitates research into the framing of events.

You can download it from.