— Open Data Sets


Research in general, and machine learning in particular, depend on big data.  Red Hen Lab seeks to create Open Data Sets and to list open data sets here that might be useful for research in multimodal communication. "Open" does not necessarily mean for Red Hen "public." It may be that there are data sets available to only certain researchers, such as Red Hens, and only under certain research licenses.

Related pages

  • Access to Red Hen Tools and Data. NB this principle: "If you access Red Hen data to create a tagged dataset for your research, the default expectation is that you will return to Red Hen a tagged dataset useful to others interested in doing such work. Typically, this dataset will be released as an Open Data Set. In some cases, depending on the content, the dataset will be suitable for release to only other Red Hens, as allowed by the Red Hen Access Board."

Open Data Sets

  • ViMELF - The Corpus of Video-Mediated English as a Lingua Franca Conversations, Version 1.0.
    • Dataset: ViMELF contains 20 fully transcribed Skype conversations with gestures and pragmatic elements between 40 speakers from Germany (20 speakers), Spain (5), Italy (5), Finland (5), and Bulgaria (5), totaling 744.5 minutes (ca. 12.5 hours), with an average conversation length of 37.23 minutes. The corpus comprises 113 670 words in the plain text version and 152 472 items in the annotated version. The transcripts are available as .docx and .txt files; the anonymized videos in MPEG4 format. Several versions are available: the fully annotated pragmatic version as text and XML (XTranscript, Gee 2018), a lexical version (XTranscript, Gee 2018), and a POS-tagged version (auto-tagged with the CLAWS C7 tagset).
    • Website and further info: http://umwelt-campus.de/case
    • Access: ViMELF transcripts are freely available for non-commercial research purposes. If you would like to use the dataset, please register via the project website – you will then receive download instructions. The video and audio data is available separately for viewing/listening via a dedicated university server.
    • Project coordination: Stefan Diemer & Marie-Louise BrunnerLanguage & Communication, Trier University of Applied Sciences, Germany
    • Citation: To cite ViMELF in your own research, please use the following citation:
      ViMELF. 2018. Corpus of Video-Mediated English as a Lingua Franca Conversations. Birkenfeld: Trier University of Applied Sciences. Version 1.0. The CASE project [http://umwelt-campus.de/case].
    • Contact: sk@umwelt-campus.de
  • Red Hen Interview Gesture Collection (RHIGC)
    • Dataset: The RHIGC is based on 20 interviews from the Ellen De Generes Show which were hand-annotated for gesture by Suwei Wu and Yao Tong at VU Amsterdam for their PhD projects under the supervision of Prof. Alan Cienki. It will contain video snippets of hand gestures (and possibly of similar shots without hand gestures). An alternative version with pre-annotated data generated with OpenPose may also be made available.
    • Project coordination: Yao Tong (VU Amsterdam) & Peter Uhrig (Universität Osnabrück/FAU Erlangen-Nürnberg).