Corpus


Twitter Corpus

 

Since the end of September 2015, the platform has hosted a Twitter stream, representing 1% of global tweets.

 About 500 GB of data (between 30 and 40 tweets per second) are collected each month and are available to research teams wishing to exploit this corpus.


MS-MARCO Corpus

 

 Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS  MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer. 

More information: http://www.msmarco.org/


 

I3 GDR BULLetin

 

We are hosting the 2006-2016 archives of the 'GDR I3 BULLetin' mailing list. Newsletters are sent via it, relating to upcoming conferences, calls for papers, funding announcements (post doc, posts, ...). The list of BULLetin subscribers of the GDR I3 brings together all the members (industrialists, researchers, lecturers, PhD students, ...) from the Information, Intelligence and Interaction communities concerned by the issues at the heart of these fields of research, through its different working groups.


To access these corpora, feel free to contact us.