Source Codes

Scripts and resources used at the 2020 TREC Podcasts Track

Source codes for the search and hyperlinking system SHAMUS

Data

Information Retrieval

LongEval Train Collection The collection serves as the official training collection for the 2023 LongEval Information Retrieval Lab organised at CLEF. It consists of queries and documents provided by the Qwant search Engine. In total, the collection contains 672 train queries, with corresponding 9656 assessments coming from the Qwant click model, and 98 heldout queries. The set of documents consist of 1,570,734 downloaded, cleaned and filtered Web Pages.

LongEval Test Collection The collection serves as the official test collection for the 2023 LongEval Information Retrieval Lab organised at CLEF. The collection contains test datasets for two organized sub-tasks: short-term persistence (sub-task A) and long-term persistence (sub-task B). The data for the short-term persistence sub-task was collected over July 2022 and this dataset contains 1,593,376 documents and 882 queries. The data for the long-term persistence sub-task was collected over September 2022 and this dataset consists of 1,081,334 documents and 923 queries.

The 2024 versions of the Longeval collection: LongEval 2024 Train Collection and LongEval 2024 Test Collection.

Czech Cross-langauge Speech Retrieval Test Collection Malach Czech recordings provided by the Visual History Archive which consists of the interviews with the Holocaust survivors. The archive consists of audio recordings, four types of automatic transcripts, manual annotations of selected topics and interviews’ metadata. The archive totally contains 353 recordings and 592 hours of interviews.

Digital Humanities

Medieval Charter Sections Corpus This package provides an evaluation framework, training and test data for semi-automatic recognition of sections of historical diplomatic manuscripts. The data collection consists of 57 Latin charters issued by the Royal Chancellery in 14th century.

Machine Translation

Czech-Slovak Parallel Corpus Czech-Slovak parallel corpus consisting of several freely available corpora (Acquis, Europarl, Official Journal of the European Union and part of OPUS corpus – EMEA, EUConst, KDE4 and PHP) and mined European Commission website. Corpus contains 5,7M sentences.

English-Slovak Parallel Corpus English-Slovak parallel corpus consisting of several freely available corpora (Acquis, Europarl, Official Journal of the European Union and part of OPUS corpus – EMEA, EUConst, KDE4 and PHP) and mined European Commission website.

Czech-English Parallel Corpus 1.0 (CzEng 1.0) CzEng 1.0 is the fourth release of a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL) freely available for non-commercial research purposes. CzEng 1.0 contains 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation.

WMT 2011 Testing Set Testing set from WMT 2011 competition, manually translated from Czech and English into Slovak. Test set contains 3003 sentences in Czech, Slovak and English.

Manually Ranked Translation Outputs Manually ranked outputs of Czech-Slovak translations. Three annotators manually ranked outputs of five MT systems on three data sets.

Manually Classified Errors in En->Sk Translation Manual classification of errors of English-Slovak translation. 50 sentences randomly selected from WMT 2011 test set were translated by 3 MT systems and MT errors were manually marked and classified.

Manually Classified Errors in Cs->Sk Translation Manual classification of errors of Czech-Slovak translation. First 50 sentences from WMT 2010 test set were translated by 5 MT systems and MT errors were manually marked and classified.

Petra Galuscakova

Source Codes

Data

Information Retrieval

Digital Humanities

Machine Translation