Google Summer of Code 2018 Ideas Page

The deadline for submitting proposals to Google for GSoC2018 is over.


Red Hen Google Summer of Code 2018

Students, Apply! Propose a project. Code the summer away. Achieve ultimate glory. And get a nice paycheck!

How to Apply

Red Hen Lab is an international consortium of researchers in multimodal communication, with mentors spread around the globe. Together, we have crafted this Ideas page, which offers some information about the Red Hen dataset of multimodal communication (see some sample data here and here) and a long list of tasks.

To succeed in your collaboration with Red Hen, the first step is to orient yourself carefully in the relevant material. The Red Hen Lab website that you are currently visiting is voluminous. Please explore it carefully. There are many extensive introductions and tutorials on aspects of Red Hen research. Make sure you have at least an overarching concept of our mission, the nature of our research, our data, and the range of the example tasks we have provided to guide your imagination. Having contemplated the Red Hen research program on multimodal communication, come up with a task that is suitable for Red Hen and that you might like to embrace or propose. Many illustrative tasks are sketched below. Orient in this landscape, and decide where you want to go.

The second step is to formulate a pre-proposal sketch of 1-3 pages that outlines your project idea. In your proposal, you should spell out in detail what kind of data you need for your input and the broad steps of your process through the summer, including the basic tools you propose to use. Give careful consideration to your input requirements; in some cases, we will be able to provide annotations for the feature you need, but in other cases successful applicants will craft their own metadata, or work with us to recruit help to generate it.

Send your pre-proposal to redhenlab@gmail.com. Being able to generate a meaningful pre-proposal is a requirement for joining the team; if you require more hand-holding to get going, Red Hen Lab is probably not the right organization for you this year. We want to work with you at a high level, and this requires initiative on your part and the ability to orient in a complex environment.

When we receive your pre-proposals, we will assess them and attempt to locate a suitable mentor; if we succeed, they will get back to you and provide feedback to allow you to develop a fully-fledged proposal to submit to GSoC 2018. The deadline for submitting your final proposal to Google is 27 March 2018, noon EDT. Your proposal must be submitted directly to the Google Summer of Code site for Google to recognize your submission.

We are excited to be working with you and look forward to your pre-proposals.

Add your name and email to our

Red Hen GSoC Student Mailing List

(You can change that information at any time by resubmitting.)

Table of Contents for this page

Background Information

Red Hen Lab participated in Google Summer of Code in 2015, 2016, and 2017, working with brilliant students and expert mentors from all over the world. Each year, Red Hen has mentored students in developing and deploying cutting-edge techniques of multimodal data mining, search, and visualization, with an emphasis on tagging for natural language, co-speech gesture, paralinguistic elements, and a great variety of behavioral forms used in human communication. With significant contributions from Google Summer of Code students from all over the world, Red Hen has constructed tagging pipelines for linguistic, audio, and video elements. These pipelines are undergoing continuous development, improvement, and extension. Red Hens have excellent access to high-performance computing clusters at UCLA, Case Western Reserve University, and FAU Erlangen; for massive jobs Red Hen Lab has an open invitation to apply for time on NSF's XSEDE network.

Red Hen's largest dataset is a vast collection of more than 400,000 television news programs, initiated by UCLA's Department of Communication, developed in collaboration with Red Hens from around the world, and curated by the UCLA Library, with processing pipelines at Case Western Reserve University, FAU Erlangen, and UCLA. Red Hen develops and tests tools on this dataset that can be used on a great variety of data—texts, photographs, audio and audiovisual recordings. Red Hen also acquires big data of many kinds in addition to television news, and is open to the acquisition of data needed for particular projects. Red Hen creates tools that are useful for generating a semantic understanding of big data collections of multimodal data, opening them up for scientific study, search, and visualization. See Overview of Research for a description of Red Hen datasets.

In 2015, Red Hen's principal focus was audio analysis; see the Google Summer of Code 2015 Ideas page. Red Hen students created a modular series of audio signal processing tools, including forced alignment, speaker diarization, gender detection, and speaker recognition (see the 2015 reports, extended 2015 collaborations, and github repository). This audio pipeline is currently running on Case Western Reserve University's high-performance computing cluster, which gives Red Hen the computational power to process the hundreds of thousands of recordings in the Red Hen dataset. With the help of GSoC students and a host of other participants, the organization continues to enhance and extend the functionality of this pipeline. Red Hen is always open to new proposals for high-level audio analysis.

In 2016, Red Hen's principal focus was deep learning techniques in computer vision; see the Google Summer of Code 2016 Ideas page and Red Hen Lab page on the Google Summer of Code 2016 site. Talented Red Hen students, assisted by Red Hen mentors, developed an integrated workflow for locating, characterizing, and identifying elements of co-speech gestures, including facial expressions, in Red Hen's massive datasets, this time examining not only television news but also ancient statues; see the Red Hen Reports from Google Summer of Code 2016 and code repository. This computer vision pipeline is also deployed on CWRU's HPC in Cleveland, Ohio, and was demonstrated at Red Hen's 2017 International Conference on Multimodal Communication. Red Hen is planning a number of future conferences and training institutes. Red Hen GSoC students from previous years typically continue to work with Red Hen to improve the speed, accuracy, and scope of these modules, including recent advances in pose estimation.

In 2017, Red Hen invited proposals from students for components for a unified multimodal processing pipeline, whose purpose is to extract information about human communicative behavior from text, audio, and video. Students developed audio signal analysis tools, extended the Deep Speech project with Audio-Visual Speech Recognition, engineered a large-scale speaker recognition system, made progress on laughter detection, and developed Multimodal Emotion Detection in videos. Focusing on text input, students developed techniques for show segmentation, neural network models for studying news framing, and controversy and sentiment detection and analysis tools (see Google Summer of Code 2017 Reports). Rapid development in convolutional and recurrent neural networks is opening up the field of multimodal analysis to a slew of new communicative phenomena, and Red Hen is in the vanguard.

In large part thanks to Google Summer of Code, Red Hen Lab has been able to create a global open-source community devoted to computational approaches to parsing, understanding, and modeling human multimodal communication. This year, the organization is aiming to make inroads in China, a country Red Hen has had difficulties recruiting in so far, since Google is blocked from Chinese locations. Red Hen now has a new Center for Cognitive Science at Hunan Normal University in Hunan Province, directed by Red Hen Lab Co-Director Mark Turner, which gives the organization a new platform for outreach in China. With continued support from Google, Red Hen would like to bring top students from China into the open-source community.

What kind of Red Hen are you?

More About Red Hen

Our mentors

 https://www.linkedin.com/in/ffsteen
https://sites.google.com/site/inesolza/home
 https://sites.google.com/site/wrocifa/

Shruti Gullapuram, UMass Amherst

Inés Olza. University of Navarra

Jacek Wozny. University of Wroclaw

http://www.cs.ucla.edu/~lwx/
http://www.jsjoo.com
http://markturner.org
http://www.mehulbhatt.org/

Weixin Lee. Beihang University

Mehul Bhatt, University of Bremen, and Örebro University

 http://www.linkedin.com/in/PeterMBroadwell
https://www.anglistik.phil.fau.de/staff/uhrig/
http://engineering.case.edu/profiles/sxr358
http://cognitivescience.case.edu/faculty/vera-tobin/
Jakob Suchan
Anna Pleshakova
Kai Chan
 http://www.um.es/lincoing/jv/index.htm

Jakob Suchan, University of Bremen

Kai Chan, UCLA

Javier Valenzuela Manzanares.

University of Murcia

Anna Bonazzi
Heiko Schuldt
Abhinav Shukla
https://sites.google.com/site/cristobalpagancanovas/

Anna Bonazzi, UCLA and Technische Universität Dresden

Heiko Schuldt,

University of Basel

Abhinav Shukla, International Institute of Information Technology

Cristóbal Pagán Cánovas.

University of Navarra

José Fonseca, Polytechnic Higher Education Institute of Guarda

The profiles of mentors not included in the portrait gallery are linked to their name below.

Guidelines for project ideas

Your project should be in the general area of multimodal communication, whether it involves tagging, parsing, analyzing, searching, or visualizing. Red Hen is particularly interested in proposals that make a contribution to integrative cross-modal feature detection tasks. These are tasks that exploit two or even three different modalities, such as text and audio or audio and video, to achieve higher-level semantic interpretations or greater accuracy. You could work on one or more of these modalities. Red Hen invites you to develop your own proposals in this broad and exciting field.

Red Hen studies all aspects of human multimodal communication, such as the relation between verbal constructions and facial expressions, gestures, and auditory expressions. Examples of concrete proposals are listed below, but Red Hen wants to hear your ideas! What do you want to do? What is possible? You might focus on a very specific type of gesture, or facial expression, or sound pattern, or linguistic construction; you might train a classifier using machine learning, and use that classifier to identify the population of this feature in a large dataset. Red Hen aims to annotate her entire dataset, so your application should include methods of locating as well as characterizing the feature or behavior you are targeting. Contact Red Hen for access to existing lists of features and sample clips. Red Hen will work with you to generate the training set you need, but note that your project proposal might need to include time for developing the training set.

Red Hen develops a multi-level set of tools as part of an integrated research workflow, and invites proposals at all levels. Red Hen is excited to be working with the Media Ecology Project to extend the Semantic Annotation Tool, making it more precise in tracking moving objects. The "Red Hen Rapid Annotator" is also ready for improvements. Red Hen is open to proposals that focus on a particular communicative behavior, examining a range of communicative strategies utilized within that particular topic. See for instance the ideas "Tools for Transformation" and "Multimodal rhetoric of climate change". Several new deep learning projects are on the menu, from "Emotion Detection and Characterization" to "Explainable Visual Perception" and "Multimodal Egocentric Perception". On the search engine front, Red Hen also has several candidates: the "Development of a Query Interface for Parsed Data", "Multimodal CQPweb", and a mobile app for visual search, "Augmented Reality: Android-based Mobile App for Linking Photos and Videos to the Real World". Red Hen welcomes visualization proposals; see for instance the "Semantic Art from Big Data" idea below.

The latest frontier for open source software is the sophisticated Chinese DTMB television standard; Red Hen is now capturing television in China and is happy to provide shared datasets and joint mentoring with our partners CCExtractor, who provides the vital tools for text extraction in several television standards, and Joker-TV, an open-source global television capture platform.

When you plan your proposal, bear in mind that your project should result in a working application. For Red Hen, that means it finds its place within the integrated research workflow. The application may simply need to talk to the data repository, or it may be a module that is installed on Red Hen's high-performance computing clusters, fully tested, with clear instructions, and ready to be deployed to process a massive dataset. The architecture of your project should be designed so that it is clear and understandable for coders who come after you, and fully documented, so that you and others can continue to make incremental improvements. Your module should be accompanied by a python application programming interface or API that specifies the input and output, to facilitate the construction of the development of a unified multimodal processing pipeline for extracting information from text, audio, and video. Red Hen prefers projects that use C/C++ and python and run on Linux. For some of the ideas listed, but by no means all, it's useful to have prior experience with deep learning tools.

Your project should be scaled to the appropriate level of ambition, so that at the end of the summer you have a working product. Be realistic and honest with yourself about what you think you will be able to accomplish in the course of the summer. Provide a detailed list of the steps you believe are needed, the tools you propose to use, and a weekly schedule of milestones. Chose a task you care about, in an area where you want to grow. The most important thing is that you are passionate about what you are going to work on with us. Red Hen looks forward to welcoming you to the team!

Ideas for Projects

1. NLP Pipeline for English v2

Mentored by Peter Uhrig, Francis Steen, Mark Turner

Red Hen has a working NLP pipeline for all incoming English recordings, but some of the tools are outdated, and some have already been superseded by our new system. Your task is to get the latest versions of a range of programs to run and adapt existing software to our new data format. Specifically, these include:

- Commercial detection (existing software, needs to be integrated with the new file format)

- Frame annotation (we would like to switch to PathLSTM)

- Sentiment annotation

- time expressions

- possibly: coreference resolution

For this task, you should

- be good at getting software to run in Linux, which will include Bash scripting, compiling with weird tools, dependency management, ...

- be good at transforming data from one textual format to another (tables, XML, JSON, proprietary formats).

- ideally be able to modify simple I/O code in various programming languages (Python, Perl, Java)

2. Semantic Art from Big Data

Mentored by Heiko Schuldt and Francis Steen

Vast collections of multimodal data are becoming raw materials for artistic expressions and visualizations that are both informative and esthetically appealing. Red Hen is collaborating with vitrivr (https://vitrivr.org) to develop an application for semantically meaningful large-scale visualizations of multimodal data. The tools will support visualizations along a range of scalar dimensions and arrays, utilizing Red Hen's deep analysis of hundreds of thousands of hours of news videos: clusters of event categories over time, the distribution of emotions across nations and networks, the emotional intensity of a single event cascading through the international news landscape. Red Hen is also interested in making these visualizations serve as browsing tools for exploring large collections of images and videos in novel and creative ways.

For examples of Red Hen big data visualizations, see the Viz2016 project, which provides visualizations of some dimensions of US Presidential elections.

Successful applicants for this task will familiarize themselves with the vitrivr stack, including Cineast, a multi-feature content-based multimedia retrieval engine. Java and web programming skills are required.

3. Chinese Pipeline

Mentored by Weixin Li, Yao Tong, and Kai Chan

Red Hen has recently begun acquiring massive audiovisual data in Chinese and wants both to extend that collection and to add other kinds of Chinese data (text, audio). This task includes developing tools for tagging, parsing, annotating, analyzing, searching, etc. the Chinese data. Red Hen now directs a new Center for Cognitive Science at Hunan Normal University dedicated to this project and to related work on multimodal communication. Areas of work in this project might include:

    1. Extracting captions
      1. For OCR, Google tesseract in collaboration with CCExtractor
    2. NLP of various kinds: word segmentation, part-of-speech tagging, named entity recognition, sentiment analysis, etc. Resources known to Red Hen include:
    3. A curated list of resources for Chinese NLP
      1. NLPIR/ICTCLAS Chinese segmentation software: a python wrapper is available at https://github.com/tsroten/pynlpir
      2. FudanNLP tookit: https://github.com/FudanNLP/fnlp
      3. Stanford NLP tookit's Chinese module: https://nlp.stanford.edu/projects/chinese-nlp.shtml
      4. For speech to text, many speech recognition packages, such as CMUSphinx and Baidu Deep Speech, which is based on TensorFlow.
    4. Forced alignment of Chinese
    5. Multi-program transport stream splitting for Joker-TV (collaboration with Abylay Ospan)

Red Hen is collaborating with CCExtractor on text extraction and OCR; successful candidates will have a mentor from both organizations.

4. Emotion Detection and Characterization

Mentored by Mehul Bhatt and Vera Tobin

Develop and deploy emotion-detection tools in language, voice qualities, gestures, and/or facial expressions to achieve a more complex, nuanced, and integrated characterization of emotions. It will be useful to focus on a subset of emotions; the system should be constructed so that it can be extended.

The components may include natural language processing tools, audio frequency analysis, and/or deep learning techniques. The API should be a python script specifying audio/video and text input conditions and an output in JSON Lines annotations.

This project extends the project initiated during GSoC2017, see Multimodal Emotion Detection on Videos using CNN-RNN.

5. System Integration of Existing Tools Into a New Multimodal Pipeline

Mentored by Shruti Gullapuram, and Abhinav Shukla

Red Hen is integrating multiple separate processing pipelines into a single new multimodal pipeline. Orchestrating the processing of hundreds of thousands of videos on a high-performance computing cluster along multiple dimensions is a challenging design task. The winning design for this task will be flexible, but at the same time make efficient use of CPU cycles and file accesses, so that it can scale. Pipelines to be integrated include:

    1. Shot detection
    2. Commercial detection
    3. Speaker recognition
    4. Frame annotation (for English)
    5. Text and Story segmentation
    6. Sentiment Analysis
    7. Emotion detection
    8. Gesture detection

This infrastructure task requires familiarity with Linux, bash scripting, and a range of programming languages such as Java, Python, and Perl, used in the different modules.

6. Semantic Annotation Tool

Mentored by John P. Bell, Mark Williams, Heiko Schuldt, Mehul Bhatt, and Francis Steen

Red Hen provides an integrated research workflow, from manual annotation to machine learning and data visualization. The Semantic Annotation Tool (SAT) is a next-generation annotation too developed by Red Hen's collaborator The Media Ecology Project. SAT is a jQuery plugin and Ruby on Rails server that adds an annotation interface to HTML5 videos. For machine learning, it is essential that semantic annotations be spatially located within the picture frames of the video, so that the algorithms focus on the correct features. SAT supports associating tags and text bodies with time- and geometry-delimited media fragments using W3C Web Annotation ontologies. One limitation of the current tool, however, and the Web Annotation spec more generally, is that there is no support for moving a geometric annotation target within a frame over time. For example, a baseball thrown from the left side of the frame to the right would force an annotator to choose whether they want their annotation to target the ball’s location on the first frame it appears, the last frame, or even the entire path of the ball. No matter what they decide, the annotation target will necessarily be inaccurate.

Red Hen wants to add the ability to tween geometric annotation targets over time. These areas are currently defined as a single array of points. The new feature would redefine geometric targets to include multiple arrays of points for starting location, ending location, and an arbitrary number of keyframes; add interface tools to the jQuery plugin that allows all of these locations to be entered by a user; add support for graphically tweening the geometric area in sync with playback of the video; and extend the current data API (client and server) to support the new geometric data format.

Successful applicants should have a strong background in the jQuery javascript library and familiarize themselves with the current codebase of both the client and the server.

7. Development of a Query Interface for Parsed Data

Mentored by Peter Uhrig's team

This infrastructure task is to create a new and improved version of a graphical user interface for graph-based search on dependency-annotated data. The new version should have all functionality provided by the prototype plus a set of new features. The back-end is already in place.

Develop current functionality:

      • add nodes to the query graph
      • offer choice of dependency relation, PoS/word class based on the configuration in the database (the database is already there)
      • allow for use of a hierarchy of dependencies (if supported by the grammatical model)
      • allow for word/lemma search
      • allow one node to be a "collo-item" (i.e. collocate or collexeme in a collostructional analysis)
      • color nodes based on a finite list of colors
      • paginate results
      • export xls of collo-items
      • create a JSON object that represents the query to pass it on to the back-end

Develop new functionality:

      • allow for removal of nodes
      • allow for query graphs that are not trees
      • allow for specification of the order of the elements
      • pagination of search results should be possible even if several browser windows or tabs are open.
      • configurable export to csv for use with R
      • compatibility with all major Web Browsers (IE, Firefox, Chrome, Safari) [currently, IE is not supported]
      • parse of example sentence can be used as the basis of a query ("query by example")

Steps:

    1. Visit http://www.treebank.info and play around with the interface (user: gsoc2018, password: redhen) [taz is a German corpus, the other two are English]
    2. In consultation with Red Hen, decide on a suitable JavaScript Framework, such as reactJS paired with jQuery
    3. Think about html representation. Red Hen probably prefers HTML5/CSS3, but it is unclear whether its requirements can be met without major work on <canvas>, or whether sensible widgets are possible without digging into into the <canvas> tag.

Contact Peter Uhrig <peter.uhrig@fau.de> to discuss details or to ask for clarification on any point.

Mentored by Anna Bonazzi, Tim Groeling, Kai Chan, and Luca Rossetto. Read more about the project here.

Libraries and research institutions in the humanities and social sciences often have large collections of legacy video tape recordings that they have digitized, but cannot usefully access -- this is known as the "digital silo" problem. Red Hen is working with several university libraries on this problem, and several of this year's ideas contribute to the solution. A basic task Red Hen needs to solve is television program segmentation. The UCLA Library, for instance, is digitizing its back catalog of several hundred thousand hours of news recordings from the Watergate Hearings in 1973 to 2006. These digitized files have eight hours of programming that must be segmented at their natural boundaries.

Red Hen welcomes proposals for a segmentation pipeline. An optimal approach is to use a combination of text, audio, and visual cues to detect the show and episode boundaries. Your project should assemble multiple cues associated with these boundaries, from recurring phrases, theme music, and opening visual sequences, and then develop robust statistical methods to locate the most probable spot where one show ends and another begins. Red Hen is open to your suggestions for how to solve this challenge.

Red Hen is open to developing segmentation tools in collaboration with vitrivr (https://vitrivr.org), which already contains functionality to search for a frame within a video, a sequence of music within a sound track, and clusters of words in a text. Successful proposals for this infrastructure project will use all three modalities to triangulate the optimal video boundaries.

Find further info here.

9. Explainable Visual Perception

Mentored by Mehul Bhatt and Jakob Suchan

Explanatory reasoning in general is one of the hallmarks of general human reasoning ability; robust explainable visual perception particularly stands out as a foundational functional capability within the human visuo-spatial perception faculty. This project will address a notion of explainability that is driven by the ability to support (computational) commonsense, semantic question-answering over dynamic visuo-spatial imagery within a declarative KR setting (for details, refer to the following publication: "Visual Explanation by High-Level Abduction: On Answer-Set Programming Driven Reasoning about Moving Objects", AAAI 2018). Methods: Deep learning, Statistical Relational Learning, Constraints.

10. Tools for Transformation: Big Data Research on Worldwide Communication about the Future of Humankind

Mentored by Mark Turner and Anna Pleshakova

President Xi Jinping of China has observed, "Since ancient times, peaceful development has been a shared goal of mankind. Today's world is filled with uncertainties. People have hopes for the future but at the same time feel perplexed. Some lands once prosperous and bustling are now synonymous with difficulty, conflict, and crisis." He refers to "the shared dream of people worldwide for peace and development," and urges leaders to "build a community that sees a shared future for mankind." (https://www.youtube.com/watch?v=hNKTbMx8PFk ).

There are many causes of conflict and community. One of the most powerful lies in the ways that people think about and communicate about the future. Human beings are a very unusual species—their thoughts can arch over vast and extended patterns of causation, agency, time, and space. How can they do so? What mental operations and communicative structures are available to them for doing so? How do those operations and structures influence our ability to think about the future? What are the limits and constraints that come from those ways of thinking and communicating, and how should Red Hen deal with those mental and communicative limits?

Many researchers — in the humanities and social sciences, in linguistics and media studies, in politics and economics — have studied aspects of how human beings think and talk about the future. But just lately, this research has entered a new phase. First, electronic media have made it possible for mass communication to flow constantly around the world, in news reports, social media, and audiovisual recordings. And this mass communication is constantly focused on the future. Second, big data science has developed computational, statistical, and technical tools that make it possible through new methods that were never previously even imagined to find patterns and test hypotheses in this communication. Research into human multimodal communication can now happen on a scale that is completely unprecedented.

One of the most powerful ways that human beings have of conceiving of futures and communicating about them is by blending thoughts about the future with thoughts about the past. The past is drawn on selectively to help constitute a blended conception of the future. Communicative forms that are used to express ideas about the past are repurposed to express elements of this conception of the future. Accordingly, the study of thinking about the future and communicating about the future includes studying the way people think and talk about the past, and, most important, studying how people make conceptual blends that draw on the past to imagine futures and to talk about them.

This project involves developing tools for tagging and detecting communicative patterns for representing the future of the species, distinguishing what is shared from what is not, analyzing global variation using the statistical package R, developing visualization and prediction tools for trends, and so on. If you are interested, write to Mark Turner at redhenlab@gmail.com.

11. Multimodal CQPweb

Mentored by Peter Uhrig's team

CQPweb (http://cwb.sourceforge.net/cqpweb.php) is a web-based corpus analysis system used by linguists. Red Hen is involved in extending its capabilities to handle multimodal forms of data. Red Hen is open to proposals that accomplish the following tasks:

    1. Phonetic search
    2. Integration with EMU
    3. Menu-based search assistance for gesture search, shot detection, speaker, etc.
    4. Direct integration of video player

Successful applicants for this infrastructure task will familiarize themselves with the existing codebase.

12. Forced Alignment

Mentored by Francis Steen, Peter Uhrig, and Kai Chan

The Red Hen Lab dataset contains television news transcripts in multiple languages, including American English, British English, Danish, French, Italian, German, Norwegian, European and Brazilian Portuguese, Russian, European and Mexican Spanish, and Swedish. These transcripts timestamped, but these timestamps are delayed by a variable number of seconds relative to the audio and video. To bring them into alignment, we have so far used the Gentle aligner to align English-language text. Last spring, the Montreal Forced Aligner (MFA) was released, with pre-trained acoustic models for 21 languages.

The task is to create an automated pipeline on a high-performance computing cluster that deploys MFA on the languages present in the Red Hen dataset, perform quality checks, and improve the quality of the alignment for hundreds of thousands of hours of transcripts. For parallel corpora, see Europarl.

For background, see http://linguistics.berkeley.edu/plab/guestwiki/index.php?title=Forced_alignment. Improving the quality of alignment will involve developing methods for cleaning up non-speech content in the transcripts, adding words to a local dictionary, and evaluating accuracy.

Successful applicants will familiarize themselves with the Montreal forced aligner, which already has support for multiple languages, built on the Kaldi toolkit. Mentors will be available for each language.

13. Multi-speaker speech-to-text

Current speech-to-text tools are excellent when trained to a specific speaker, but poor with untrained voices. Our audio pipeline, built by GSoC students, provides efficient methods of forced alignment in English (see also project idea 11 above) and speaker recognition. We welcome proposals for projects that extend these capabilities into a system for semi-supervised or automated speaker training for multi-speaker speech-to-text.

It would be possible, for instance, to develop a system that uses shows with good transcripts and named speakers to automate the training of classifiers for speaker recognition; this could be used to improve the speech-to-text. Existing Red Hen tools for speaker diarization and automated visual recognition of faces could be combined to improve results in shows with no transcripts. Another interesting cross-modal option is to combine computer vision lip reading with audio signal processing.

In GSoC 2017, Divesh Pandey completed a project on a Speech Recognition System based on Deep Speech, see report and github. We invite proposals that extend and improve this code.

We also invite a production-implementation of the ESPNet system on our Case Western Reserve University High-Performance Computing Cluster.

Red Hen is also interested in approaches that make use of imperfect transcripts (e.g., Driven Decoding). This approach can be used to improve current transcripts and possibly to provide acoustic frames for speech recordings without transcripts.

14. Red Hen Rapid Annotator

Mentored by Peter Uhrig, Vera Tobin, and Kai Chan

This task is aimed at extending the Red Hen Rapid Annotator. The following subtasks can be identified:

- Create an administrative interface for

- setting up the experiment

- monitoring progress

- harvesting the results

- Allow new sorts of data

- text

- audio

- Allow for textual cue non non-text data (i.e. usually some sort of transcript and possibly an automatically determined annotation).

- Integrate with CQPweb, Google Docs? (and treebank.info?)

- Allow for multiple levels of annotation on the same piece of data

- Add "Undo" feature and/or personal history

15. Augmented Reality: Android-based Mobile App for Linking Photos and Videos to the Real World

Mentored by Luca Rossetto and Peter Broadwell

Red Hen is collaborating with vitrivr (https://vitrivr.org) to develop a native Android-based app that will enhance your experience of reality, whether you are visiting a neighborhood in your own city or a historical site. By pointing your camera at a building or street scene, you will be able to call up recent news videos, historical photographs, and works of art relevant to this location.

Successful applicants for this task will familiarize themselves with the vitrivr stack, including Cineast, a multi-feature content-based multimedia retrieval engine. The languages used are Java and Kotlin.

16. Constructions for Epistemic Stance

Mentored by Jakob Suchan and Inés Olza

Multimodal registers of communication often function to strengthen the credibility or cast doubt on what is being said, or to qualify what is being said as fantastic, wildly improbable, or merely slightly implausible. A proposal will focus on specific examples of evidentiary qualifications, but built in such a way that it can be extended.

The components may include linguistic elements, tone of voice, eye direction and head direction (such as a side eye), and gestures. The aim is to show how new meanings emerge from the combination of features. For some examples, see this video with comments.

Red Hen will provide an annotated dataset of multimodal epistemic stance constructions in the Red Hen dataset. Successful applicants will be familiar with deep learning tools such as Caffe, Scikit-Learn, Scikit-Flow, or TensorFlow.

17. Multimodal Rhetoric of Climate Change

Mentored by Vera Tobin and Francis Steen

The science of climate change has become a politically highly contentious issue, with significant resources being devoted to convincing audiences that this is or is not a real and serious phenomenon. Red Hen is open to innovative proposals for approaching the rhetoric of climate change. What are the communicative strategies that are being utilized for effective persuasion? Red Hen asks for proposals for identifying strategies of persuasion, characterizing them, and locating them in massive multimodal datasets.

18. Multimodal Egocentric Perception (with video, audio, eyetracking data)

Mentored by Mehul Bhatt, Jakob Suchan

This project will focus on multimodal visual and auditory cue perception in embodied interaction scenarios in social and professional contexts (everyday life, work, education etc). A large-scale dataset (involving egocentric video, audio, and eye-tracking) will be made available for the project. Programming proposals are open with respect to methods adopted. We ask for innovative ways to make sense of the multimodal egocentric data at the interface of language, logic, and cognition.

19. Kaldi-based Automatic Speech Recognition

Mentored by Karan Singla

This project will focus on making a large scale and robust Kaldi based ASR system that can have performance comparable to state-of-art. It will include using+modifying existing Kaldi recipes/scripts and benchmarking on TV-news corpora for different evaluation metrics mainly Word Error Rate (WER). We also plan to add augmented noisy data with different SNR (signal-to-noise) ration to make the ASR more robust for different environments.

Skills required : Familiarity with Kaldi environment, bash, some scripting language (preferably python). It will include handling large chunks of audio and text data.

20. Cockpit - Red Hen Monitoring System

Mentored by José Fonseca and Francis Steen

Currently, RedHen has more than a dozen remote capture stations all around the World that send their data to a central repository. The growing number of capture stations increases dramatically the complexity of their management from a central entity. The stations should be online, able to record the media signals based on a schedule and send them, or allow them to be downloaded, to a central repository. If any of these tasks fail, then the local person responsible for that station must be warned in order to fix the problem.

This project proposes to automate the task of sensing the health of the stations and take appropriate actions, according to the problems detected. Some other routine operations may also be performed, like automatic backup generation, and configuration of new stations. It should provide a responsive dashboard, called Cockpit, for the central administrator, using a web device or a smartphone, to monitor the capture stations, obtain uptime statistics and make corrective actions.

21. Blended Conversation

See Blended Classic Joint Attention for further details.

22. Russian tickertape OCR

Mentored by Abhinav Shukla, Carlos Fernandez, and Anna Pleshakova

The Russian television network NTV uses a tickertape style caption -- similar to how breaking news or stock quotes are sometimes shown, traveling from right to left across the bottom of the screen. We are working with CCExtractor on this task. We have made progress on the basic OCR task (see the CCExtractor code), but this process leaves the text in fragments.

Because the OCR process is itself error-prone, the task of deduplicating the overlapping parts of the captured tickertape OCR is more complicated than the Levenshtein distance we have used in other contexts to solve a deduplication task. First of all, tesseract has a 10% higher error rate than English OCR (cf. https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf), assuming the word images that we pass to it are perfect. Secondly, in the tickertape, a character may be rolling off the screen during image capture, resulting in character fragments that confuses tesseract. These combined effects makes it harder to stitch together the actual text moving across the bottom of the screen.

We are looking for innovative solutions to this problem and provide excellent mentorship by Russian language experts and people who have wrote the existing code.

Students mailing list (restricted, for mentors only)

8. Opening the Digital Silo: Multimodal Show Segmentation

Recognition of Elements of Blended Classic Joint Attention Using Machine Learning.