Last updated: 2018-07-26
Red Hen Lab coordinates research in the study of multimodal communication. This page describes its vision, principles, illustrative ongoing projects, and prospective projects.
Human beings are evolved for elaborate multimodal communication. Cultures support this power. Communicating seems easy to human beings, just as seeing seems easy. But it is immensely complex, involving not only vision but also movement, sound, interpersonal interaction, dynamic coordination across agents, conceiving of the intentions of other agents, and so on. Unlike vision, advanced multimodal communication is found in only human beings; there are no good animal models. Red Hen seeks to gather and develop mathematical, computational, statistical, technical, and computational tools to help advance research into multimodal communication.
The study of multimodal communication made advances with the invention of on-line corpus presentation—the British National Corpus, the Russian National Corpus, the Corpus of Contemporary American, and so on—but the limitations of these corpora were sharp: the data were mostly text, with limited and dated holdings.
The principles of the Red Hen program are as follows:
As a research program, Red Hen functions as a cooperative exchange of research agendas and priorities, domain expertise, datasets, funding, and funding opportunities. Red Hen projects mostly arise whenever teams of Red Hens form to take responsibility for a particular specific project, and much of Red Hen’s operation is designed to foster the development of such teams. Here are some examples of completed and ongoing projects within Red Hen’s program.
Holdings. Red Hen gathers and connects datasets of many different kinds: text, photographs of paintings and sculpture, and audio or video or audiovisual recordings. In principle, any record of human communication is of interest to Red Hen, but above all, Red Hen needs massive datasets in consistent formats with time-correlated image, text, and audio data on which to develop computational and statistical tools. Accordingly, her largest holding by far consists of recordings of TV broadcast news. Such recording and archiving for the purpose of research is protected by section 108 of the U.S. Copyright Act. “News” includes any sort of broadcast in which current events are a topic, and so includes talk shows, interview programs, and so on. The TV holdings include at present about 350,000 hours of recordings, in an expanding variety of languages (English, Spanish, French, both European and Brazilian Portuguese, Italian, Norwegian, Swedish, Danish, German, Arabic, Russian, Polish, Czech, Chinese). Each day, roughly 150 hours of news are ingested robotically. This dataset of course has special features, and it is crucial in research always to keep in mind the nature of the data being used: TV news is not pillow talk. But such recordings, stretching now in Red Hen back to the 1970s because of digitization of analog holdings, include not only scripted speakers but vast footage of people being interviewed, having conversations, making presentations to crowds, operating in public spaces, or being recorded without their knowledge (such as surveillance footage). It also includes advertisements. Beyond TV news, Red Hen connects datasets of photographs, art works, texts, illuminated manuscripts, lab recordings of human beings as subjects of experiments, video conference communication, Youtube videos, Twitter data, cartoons and graphic novels, and so on, with new datasets routinely being located and networked. Having a variety of datasets makes it possible not only to locate differences in communication across these genres (e.g. the way deictics like “ ere” and “now” are used in TV news versus the way they are used in personal letters or Skype) but also regularities: the TV news and pillow talk might both be in English, for example, and both offer evidence for the use of various grammatical patterns.
Data Structure. Data are stored in flat files. A given record consists of a set of files with the same file name, indicating absolute start time, location, and event. Each file contains images, text, and audio data with precise start and end time, so that all aspects of the data and metadata can be kept in millisecond registration.
Tagging. Red Hen works with a global network of developers to add new features to existing tools for extracting metadata and annotating text and images. We work with CCExtractor to support text extraction from Brazilian, Russian, Czech, and Chinese television; we work with Stanford NLP to improve sentence splitting in single-case text. Our deployment of SEMAFOR to query FrameNet has generated by far the largest frame-annotated dataset in the world to date; we are working with researchers on improvements. Within the Red Hen Program, there is a significant potential for developing targeted test datasets for specific problems. Red Hens are working on creating test datasets for partially occluded timeline gestures, discourse management strategies, scare quote usage, and more.
Pipelines. Red Hen develops automated processing pipelines hosted on high-performance computing clusters, with the capacity to process hundreds of thousands of hours of video, audio, and text. Incoming recordings from around the world are picked up by UCLA’s Hoffman2 cluster, where on-screen text is retrieved in twelve languages via optical character recognition, using screenshots fed to custom versions of Tesseract, and the video is compressed. The text is split into words and sentences at the University of Erlangen, using custom code with Stanford NLP. The two streams come together again in the audio pipeline at Case Western Reserve University for forced alignment, speaker diarization, gender identification, and speaker recognition, and in the video pipeline for shot characterization and gesture detection. For the web site http://viz2016.com, joint text and image analyses are utilized for speaker and location detection and for topic detection and clustering in television data, joined with twitter data. These pipeline projects and others are open-ended. For example, Red Hen is conducting her third consecutive Google Summer of Code team in the development of these pipelines, to include elements such as emotion detection and characterization, controversy detection and sentiment tagging, and word-based multi-dimensional audio analysis.
Search. Red Hen has developed both command-line search utilities on text and tags using a variety of *nix calls and web search interfaces for text, metadata, and visual features. CQPweb, which is based on the software used to search the British National Corpus, is available to search for patterns in Red Hen English-language holdings. Development of such search tools is open-ended. Red Hens search reports are optimized for analysis using the statistical software package R.
Machine Learning. Data tagged using such open-source tools as ELAN and Red Hen’s Rapid Annotator are ingested into Red Hen’s metadata, in part so that tagging done by individual researchers is no longer withheld from the global research community. That “ground truth” datawhich is manually tagged by expertsis then made available to machine learning teams for the training of recognizers and classifiers. These machine-learning tools are then used to tag the Red Hen data automatically, thereby helping researchers in multimodal communication.
Multimodal constructions and co-speech gesture. To know a language is to know a vast relational network of form-meaning pairs (called by linguists “constructions”) and how they can blend. Nearly all the massive research in linguistics on constructions takes text as its data, but form-meaning pairs can include aspects of speech, gesture, the manipulation of material affordances in the environment, and so on. Researchers can locate large numbers of uses of (even infrequent) constructions in Red Hen data because it is so large and diverse and easily searched. Constructions that have been researched in Red Hen include comparative correlatives (“The closer you come, the more I hear"), XYZ (“Causation is the cement of the universe"), kinds of “absolute" constructions (“Absent diplomacy, this will fail"), conditional constructions, and many others. The researcher can see not only the text but the full human performance of the communication, including voice, gesture, and so on (Turner 2015).
Errors. “Errors" in expression are not random; they instead indicate cognitive processing. But such mistakes often go unrecognized by the human hearer, who “accommodates” mentally, and those mistakes that are detected in text are typically eliminated under editing. Red Hen data, however, frequently include such communicative performances. The researcher can predict such patterns, and check the predictions against the dataset (Turner 2017).
Deictics in different contexts. How are deictic expressions (e.g. “here,” “now,” “there,” “then”) used in different communicative environments and in different languages with different structures of deictic expressions? Nesset et al. (2013) used Red Hen to explore this topic in English versus Russian.
Multi-language frame detection. Red Hen already tags its entire English subset for conceptual frames using FrameNet. But there are FrameNet projects for other languages, e.g., Spanish. It is a natural but new extension for Red Hen to seek to detect frames in a variety of languages, which would not only produce better metadata for the researcher in that language but also provide insight for cross-linguistic frame resources.
Prosody indicating viewpoint. Speakers express their viewpoint, attitude, or perspective on the meaning of what they are saying, often framing its source. The stance a speaker or a listener adopts towards some content can be expressed wordlessly, with a shrug, a stare, a gasp, a wave of the hand, a smack, a tearful eye, a hollow laugh, or prosodically, through delicate modulations of the speed, pitch, and quality of the voice. A speaker in the act of presenting claims may for instance indicate epistemic distance—that is, a viewpoint of doubt or distrust—from these claims. This epistemic distance is crucial to the communication, but is often irretrievably lost in a mere verbal transcript. Red Hen is launching a project on the automatic detection of viewpoint as expressed by prosody.
Automatic gesture recognition. Human faces exhibit a wide variety of expressions and emotions. To recognize them, Joo et al. (2015) developed a hierarchical model to judge the perceived personalities of politicians automatically from their facial photographs and detected traits. Many similar projects could be pursued within Red Hen. Red Hen is also beginning to produce automatic classifiers for arm and hand gestures used in, for example, co-speech gesture for timelines.
Red Hen deploys the contributions of researchers from complementary fields, from AI and statistics to linguistics and political communication, to create rich datasets of parsed and intelligible multimodal communication and to develop tools to process these data and any other data susceptible to such analysis. Red Hen’s social organization and computational tools are designed for reliable and cumulative progress in a dynamic and extremely challenging field: the systematic understanding of the full complexity of human multimodal communication. The study of how human beings make meaning and interpret forms depends upon such collaboration.