Kodicare

Evaluating search systems requires setting up an environment: select a paradigm, metrics, a dataset, etc. The choice of an environment is rarely motivated objectively, and the impact of its variations (choosing a dataset against another, altering one) is rarely measured. Such objectivity comes from a quantifiable understanding of the differences between datasets, documents, or test queries. In Kodicare, we generically call such difference “knowledge delta”. Evaluation of several environments, knowing their knowledge deltas, leads to measuring and qualifying “results deltas”. Online systems require continuous evaluation with a stable and meaningful environment; which guarantees the reproducibility and explainability of systems results. The environment and result deltas will be able to support such continuous evaluation, and to provide explanations. The theoretical results will be confronted to real cases defined by a French company that deploys a web search engine (Qwant). Project is funded by ANR.

MATERIAL (Machine Translation for English Retrieval of Information in Any Language)

The MATERIAL Program seeks to develop methods for finding speech and text content in low-resource languages that is relevant to domain-contextualized English queries. Such methods must use minimal training data and be rapidly deployable to new languages and domains. Content that is responsive to the queries will be returned from multiple genres along with succinct summaries in English. The overall end-to-end capability will enable monolingual triage of multilingual datasets. Project was funded by IARPA.

Interactive information retrieval in audiovisual dialogue corpora

The project is aimed at the information retrieval in audiovisual corpora. The main focus was put on research of interactive methods of information retrieval. These methods should enable users to interactivelly specify and modify their search queries. The possibilities of automatic generation of the topics based on available metadata and audio and visual information will be studied as well as clustering of the topics into the larger units. Automatically generated topics will then be available for the users in the interactive interface. Users corrections will then be used for the evaluation and for the further improvement of the methods. The project was supported by the The Charles University Grant Agency.

CEMI (Center for large-scale multi-modal data interpretation)

The project aims at exploiting large collections of unlabeled multi-modal data, mainly video footage, to further state-of-the-art in video, audio and natural language understanding, interpretation, annotation and retrieval by combining unsupervised and semi-supervised learning. It will address problems that are very difficult (some probably impossible) to solve in a single modality by adopting an interdisciplinary approach. Progress in individual areas - vision, language and speech will be achieved by co-training and by exploiting results of other modalities as cross-training data. For efficient processing of large data collections, the project will also concern generic problems of organization, indexing, and searching based on similarity that is critical for building real-life applications. The consortium comprises internationally-recognized groups possessing cutting edge expertise in the research areas. The project will benefit from sharing of expertise, data, and methodologies. Besides scientific results presented in publications, two demonstrators will be produced. The project was supported by The Czech Science Foundation.

AMALACH (ASR- and MT-based Access to a Large Archive of Cultural Heritage)

The aim of the project AMALACH (ASR- and MT-based Access to a Large Archive of Cultural Heritage) is to design and implement software tools for facilitating access into a large collection of videos, interviews with holocaust survivors. The archive, now hosted at University of Southern California, Shoah Foundation Institute, contains more than 110 thousand hours of recordings in 32 languages. About half of the interviews are held in English and Czech amounts to approximately one thousand hours. Current access methods allow to search for keywords listed in a pre-defined dictionary (thesaurus) because snippets of the recordings were manually tagged with these keywords. The coverage of this manual labelling is however insufficient especially in the Czech part of the archive. The project AMALACH thus aims to use advanced methods of automatic speech recognition (ASR) and machine translation (MT) to enable search in at least all the Czech and English recordings. The project was supported by the Ministry of Culture Czech Republic.

EuroMatrix+

The project’s focus was on “bringing MT to the user”. The open-source system Moses, the SMT toolkit most widely adopted not only by academia but also by the translation industry, was further developed by the project. In addition to cutting-edge research in SMT and hybrid approaches to MT, in which rule-based and statistical components are combined in various ways to benefit from the strengths of both approaches, the project has organized several “MT Marathons” and continued the annual evaluation campaigns with shared tasks on burning issues, all widely recognized by the field. The organization of specialized workshops with industrial users, the release of resources and software and a total of 196 scientific publications complement the success story of EuroMatrixPlus. Project was funded by the European Commission’s Seventh Framework Programme.