Google Summer of Code 2015 Ideas Page

Please note that this GSoC 2015 ideas page is no longer current. See our GSoC 2016 ideas page.

About us

What is Red Hen?

Red Hen is an international consortium for research on multimodal communication. We are developing open-source tools for joint parsing of text, audio/speech, and video, using a very large international dataset of television news.

Who is behind Red Hen?

Faculty, staff, and students at several universities around the world, including UCLA, FAU Erlangen, Case Western Reserve University, Centro Federal de Educação Tecnológica in Rio, Oxford University, Universidad de Navarra, University of Southern Denmark, and more. You can get more details here.

Exactly what software is used in Red Hen?

Red Hen uses 100% open source software. In fact, not just the software but everything else—including recording nodes—is shared in the consortium.

Among other tools, we use CCExtractor, ffmpeg, and OpenCV (they have all been part of GSoC in the past). Since we depend completely on established open source software, there are many opportunities for cross-project collaboration. We are happy to see our students submit ideas to improve any of the programs on which we rely.

Of course, we also have our own software; all of it is also available, even if some parts are very specific to our work at this point.

Who uses Red Hen's infrastructure? Can I have access to it?

Section 108 of the U.S. Copyright Act permits Red Hen, as a library archive, to record news broadcasts from all over the world and to loan recorded materials to researchers engaged in projects monitored by the Red Hen directors and senior personnel. Section 108 restrictions apply to only the corpus of recordings, not the software. Because everything we do is open source, anyone can replicate our infrastructure.

Participants in the Summer of Code will have full access to the main NewsScape dataset at UCLA and other datasets that have been added to Red Hen, and applicants have full access to sample datasets.

During the GSoC application period, Red Hen is opening the archive (see credentials for access immediately below) so students who are considering joining Red Hen can check out what we do and decide whether they would like to spend the summer working with us.


User: gsoc

Password: ideaspage

What's the Red Hen Corpus?

The Corpus is a huge archive of TV programming. The stats as of February 2015 are:

Total networks: 38

Total series: 2,283

Total duration in hours: 237,321

Total metadata files (CC, OCR, TPT): 633,496

Total words in metadata files (CC, OCR, TPT): 2.95 billion

Total caption files: 305,287

Total words in caption files: 1.95 billion

Total OCR files: 298,899

Total TPT files: 29,310

Total words in OCR files: 652.73 million

Total words in TPT files: 347.12 million

Total video files: 305,137

Total thumbnail images: 85,435,389

Storage used for core data: 75.26 terabytes

General considerations

    • We're interested in novel and innovative solutions to a wide range of text, speech, and image parsing tasks; what we suggest below are just starter concepts. We invite you to expand on them considerably to make them your own, or to suggest ideas we didn't think of.
  • Some of the ideas below spell out tasks that cannot possibly be perfectly implemented in the space of one summer, but you might make an honorable start on any of them. We will work with you to define just the right level of challenge for you; this is an important part of the whole process.
  • You are free to use any open source tool for any task, or write your own code. If you use something that is not part of Red Hen, you are encouraged to submit all your work to the official maintainers. Build on the work of others and prepare your work so that others can build upon it.
  • If you are interested and have questions, the sooner you contact us about them the better. We want to help you prepare a great proposal. Don't be shy.
  • You can find links to source code next to the ideas where it's relevant. Because each organization and university contributes a bit, the repository is not yet centralized. An index page where all source code can be easily be found is being prepared.
  • See the Personal Software Process (PSP) (details)

Contacting us

All mentors are available by email. Please write "GSoC-student", your name, and the proposed topic at the start of the subject so we can prioritize GSoC emails. We reply as soon as possible.

During the student selection stage (when you are looking for the organization that best matches your interests), someone will be available via IRC as well, on the channel #redhenlab on freenode.

During the actual GSoC, the IRC channel will continue to be open (students are encouraged to join and discuss with mentors and fellow students). Mentors will be available on Skype and/or Google Hangouts. Red Hen has at its disposal a variety of videoconferencing systems, including Cisco Telepresence and Cisco Jabber, Scopia, WebEx, and Adobe Connect. It is called the "Distributed Little Red Hen Lab" because we operate across many nations and many time zones, often running lab meetings through multipoint videoconferencing. You can also connect with us via linked-in. When requesting the connection, please mention GSoC.

Sister project: CCExtractor

One of the tools we use in Red Hen is CCExtractor. It's a tool that takes media files and produces a transcript of the subtitles. We use the output of CCExtractor to 'follow the chain' and start the language analysis.

It turns out that CCExtractor is also in Summer of Code. We encourage students to also check their ideas page. We share some resources (a couple of mentors and even hardware) with them, so there are many opportunities for connections between the two projects.

Project ideas

We have separated the ideas into audio analysis, image and video analysis, and text and language analysis, but there are many interesting tasks that benefit from joint parsing and we encourage innovative proposals.

Don't let the range or scope of ideas overwhelm you. We have many things to do because Red Hen is a large organization with an ambitious agenda.

Be honest with what you think you will be able to accomplish over the summer. We are aware that some of the ideas may take over a month. We are also aware that we will not be able to have all of them done over the summer.

Remember that as a student you are not only allowed but encouraged to bring your own ideas. The most important thing is that you are passionate about what you are going to do over the summer.

Media analysis (any combination of audio, video and subtitles)

Audio analysis

We use ffmpeg to extract the aac audio track from the video and convert it to wav format, and various tools to analyze it.


Forced alignment of English transcript and audio in a multi-speaker environment, bonus for improving the caption text first

Forced alignment of Spanish transcript and audio

Forced alignment of German transcript and audio

Forced alignment of French transcript and audio

Forced alignment of Norwegian transcript and audio

Forced alignment of Danish transcript and audio

Forced alignment of Swedish transcript and audio

Detect non-vocal audio features

Detect vocal features

Identify male and female speakers

Speaker diarization (turn-taking)

Characterize tone of voice

Audio feature extraction - emotions

Speaker identification

Improve caption text


Newscast captions and transcripts are imperfect and typically lag the audio stream of speech by a variable number of seconds. Forced aligners synchronize the transcript to the audio.

Expected outcome: Produce synchronized transcripts for a sample dataset.

Expected outcome: Produce synchronized transcripts for a sample dataset.

Expected outcome: Produce synchronized transcripts for a sample dataset.

Expected outcome: Produce synchronized transcripts for a sample dataset.

Expected outcome: Produce synchronized transcripts for a sample dataset.

Expected outcome: Produce synchronized transcripts for a sample dataset.

Expected outcome: Produce synchronized transcripts for a sample dataset.

Detect non-voice features like ambient noise, silence, music, explosions, cars, weather

Expected outcome: Annotate text with the detected audio features

Detect audible breathing, sighs, laughter, shouting

Expected outcome: Annotate text with the detected audio features

Use audio feature extraction techniques to identify male and female speakers

Expected outcome: Identify the gender of speakers in a relatively straightforward sample dataset.

Use audio feature extraction techniques to identify when one speaker stops speaking and another starts to speak.

Expected outcome: Annotate speaker turns

Characterize voice qualities such as commanding, seductive, joking, or ironic.

Expected outcome: Annotate transcript with the detected voice characterization.

Detect expressions of emotions -- anger, sadness, exasperation, tiredness, happiness

Expected outcome: Annotate transcript with the detected emotion.

Select a limited number of recurring speakers and identify them from their voices.

Expected outcome: A short list of recognized speakers (major news anchors and politicians) and a framework for expanding the list.

Use audio analysis to improve the caption text

Expected outcome: Detect errors in the caption text, such as misspelled or missing words


kaldi, sail_align, Prosodylab-Aligner, and Praat are good starting points. The challenge is to generate consistently good results at a reasonable speed from incomplete captions and in a multispeaker environment.

See above

See above

See above

See above

See above

See above

Consider which non-voice features are informative for news audiences -- even if audiences are not fully conscious of hearing them

Low-level vocal features can be significant in themselves or play a role in higher-level characterization

A good starting point is the pitch detection in Praat.

See overview. SHoUT, LIUM, and idiap are possible starting points. There will often not be a pause between speakers, and there may be cross-talk.

Start with something tractable, such as a "decisive" or authoritative voice common in political speeches.

openSMILE and openEAR are good starting points. Start with a single person expressing a range of different emotions. Cf. sample dataset.

Good starting points:

bob.spear and ALIZE

The tools used in semafor may be helpful for improving the caption text. We are not looking for a full-fledged speech to text application


Inés, Mark, Francis

Inés, Javier

Inés, Javier

Inés, Javier




Carlos / Cristobal

Carlos / Cristobal

Peter B / Cristobal

Peter B / Cristobal

Carlos / Cristobal

Carlos / Cristobal / Kai

Carlos / Cristobal

Carlos / Cristobal / Peter B


Medium to hard

Medium to hard

Medium to hard

Medium to hard

Medium to hard

Medium to hard

Medium to hard

Easy to medium


Easy to medium

Medium to hard

Medium to hard

Medium to hard

Medium to hard

Medium to hard

Text and language analysis

Your project should include some testing, parameter optimization, and evaluation.

New Idea: Development of a Query Interface for Parsed Data

The task is to create a new and improved version of a graphical user interface for graph-based search on dependency-annotated data.

The new version should have all functionality provided by the prototype plus a set of new features. The back-end is already in place.

Current functionality:

- add nodes to the query graph

- offer choice of dependency relation, PoS/word class based on the configuration in the database (the database is already there)

- allow for use of a hierarchy of dependencies (if supported by the grammatical model)

- allow for word/lemma search

- allow one node to be a "collo-item" (i.e. collocate or collexeme in a collostructional analysis)

- color nodes based on a finite list of colors

- paginate results

- export xls of collo-items

- create a JSON object that represents the query to pass it on to the back-end

New functionality:

- allow for removal of nodes

- allow for query graphs that are not trees

- allow for specification of the order of the elements

- pagination of search results should be possible even if several browser windows or tabs are open.

- configurable export to csv for use with R

- compatibility with all major Web Browsers (IE, Firefox, Chrome, Safari) [currently, IE is not supported]

- parse of example sentence can be used as the basis of a query ("query by example")

Suggested steps:

1. Go to and play around with the interface (user: gsoc2015, password: redhen) [taz is a German corpus, the other two are English]

2. Decide on a suitable JavaScript Framework (we'd suggest reactJS paired with jQuery or something along these lines - this will have to be discussed)

3. Think about html representation. We would like to have it HTML5/CSS3, but for the moment we are not sure whether we can meet the requirements without major work on <canvas> or whether we can have sensible widgets without having to dig into the <canvas> tag.

4. Contact Peter Uhrig to discuss details or ask for clarification on any point.


Medium to hard.