c. Summer of Code 2015 - ideas page

Please note that this GSoC 2015 ideas page is no longer current.  See our  GSoC 2016 ideas page.

About us

What is Red Hen?
Red Hen is an international consortium for research on multimodal communication. We are developing open-source tools for joint parsing of text, audio/speech, and video, using a very large international dataset of television news.

Who is behind Red Hen?
Faculty, staff, and students at several universities around the world, including UCLA, FAU Erlangen, Case Western Reserve University, Centro Federal de Educação Tecnológica in Rio, Oxford University, Universidad de Navarra, University of Southern Denmark, and more. You can get more details here.

Exactly what software is used in Red Hen?
Red Hen uses 100% open source software. In fact, not just the software but everything else—including recording nodes—is shared in the consortium.

Among other tools, we use CCExtractor, ffmpeg, and OpenCV (they have all been part of GSoC in the past). Since we depend completely on established open source software, there are many opportunities for cross-project collaboration. We are happy to see our students submit ideas to improve any of the programs on which we rely.

Of course, we also have our own software; all of it is also available, even if some parts are very specific to our work at this point.

Who uses Red Hen's infrastructure? Can I have access to it?
Section 108 of the U.S. Copyright Act permits Red Hen, as a library archive, to record news broadcasts from all over the world and to loan recorded materials to researchers engaged in projects monitored by the Red Hen directors and senior personnel. Section 108 restrictions apply to only the corpus of recordings, not the software. Because everything we do is open source, anyone can replicate our infrastructure.

Participants in the Summer of Code will have full access to the main NewsScape dataset at UCLA and other datasets that have been added to Red Hen, and applicants have full access to sample datasets.

During the GSoC application period, Red Hen is opening the archive (see credentials for access immediately below) so students who are considering joining Red Hen can check out what we do and decide whether they would like to spend the summer working with us.

User: gsoc
Password: ideaspage

What's the Red Hen Corpus?

The Corpus is a huge archive of TV programming. The stats as of February 2015 are:
Total networks: 38
Total series: 2,283
Total duration in hours: 237,321
Total metadata files (CC, OCR, TPT): 633,496
Total words in metadata files (CC, OCR, TPT): 2.95 billion
Total caption files: 305,287
Total words in caption files: 1.95 billion
Total OCR files: 298,899
Total TPT files: 29,310
Total words in OCR files: 652.73 million
Total words in TPT files: 347.12 million
Total video files: 305,137
Total thumbnail images: 85,435,389
Storage used for core data: 75.26 terabytes 

General considerations

  • We're interested in novel and innovative solutions to a wide range of text, speech, and image parsing tasks; what we suggest below are just starter concepts. We invite you to expand on them considerably to make them your own, or to suggest ideas we didn't think of.
  • Some of the ideas below spell out tasks that cannot possibly be perfectly implemented in the space of one summer, but you might make an honorable start on any of them. We will work with you to define just the right level of challenge for you; this is an important part of the whole process.
  • You are free to use any open source tool for any task, or write your own code. If you use something that is not part of Red Hen, you are encouraged to submit all your work to the official maintainers. Build on the work of others and prepare your work so that others can build upon it. 
  • If you are interested and have questions, the sooner you contact us about them the better. We want to help you prepare a great proposal. Don't be shy.
  • You can find links to source code next to the ideas where it's relevant. Because each organization and university contributes a bit, the repository is not yet centralized. An index page where all source code can be easily be found is being prepared.
  • See the Personal Software Process (PSP) (details)

Contacting us

All mentors are available by email. Please write "GSoC-student", your name, and the proposed topic at the start of the subject so we can prioritize GSoC emails. We reply as soon as possible.
During the student selection stage (when you are looking for the organization that best matches your interests), someone will be available via IRC as well, on the channel #redhenlab on freenode.
During the actual GSoC, the IRC channel will continue to be open (students are encouraged to join and discuss with mentors and fellow students). Mentors will be available on Skype and/or Google Hangouts. Red Hen has at its disposal a variety of videoconferencing systems, including Cisco Telepresence and Cisco Jabber, Scopia, WebEx, and Adobe Connect.  It is called the "Distributed Little Red Hen Lab" because we operate across many nations and many time zones, often running lab meetings through multipoint videoconferencing. You can also connect with us via linked-in. When requesting the connection, please mention GSoC.

Mentor    University / Organization Email Linked-in or personal website Languages
Inés Olza Universidad de Navarra (Spain) iolzamor@unav.es https://sites.google.com/site/inesolza/home English, Spanish, French, German
Carlos Ramisch
CNRS and Aix-Marseille Université
carlos.ramisch@lif.univ-mrs.fr
http://pageperso.lif.univ-mrs.fr/~carlos.ramisch/ and mwetoolkit
French
English
Cristóbal Pagán Cánovas Universidad de Navarra (Spain)  cpaganc@unav.es https://sites.google.com/site/cristobalpagancanovas/ English, Spanish, French, Greek
Javier Valenzuela Universidad de Murcia (Spain) jvalen@um.es http://www.um.es/lincoing/jv/index.htm English
Spanish
Peter Broadwell  University of California - Los Angeles, UCLA (United States) broadwell@library.ucla.edu http://www.linkedin.com/in/PeterMBroadwell  English, German
Carlos Fernández CCExtractor carlos@ccextractor.org es.linkedin.com/pub/carlos-fernandez-sanz/0/1a/575/en English
Spanish
Jungseock Joo
UCLA Computer Science
jungseock at gmail.com
http://cs.ucla.edu/~joo/
English
Korean
Weixin Li
UCLA Computer Science
lexiwzx at gmail.com
http://www.cs.ucla.edu/~lwx/
English,
Chinese

Francis Steen University of California - Los Angeles, UCLA (United States) fsteen at ucla.edu https://www.linkedin.com/in/ffsteen English
Norwegian
French
German
Italian
Peter Uhrig FAU Erlangen-Nuremberg, Germany peter.uhrig@fau.de http://www.anglistik.phil.uni-erlangen.de/institut/personen/englinguistics/uhrig/uhrig.php
German, English, French
Mark Turner Case Western Reserve University turner@case.edu http://markturner.org English, French, Spanish, Italian
Kai Chan UCLA Social Science Computing
kai@ssc.ucla.edu
http://www.staffassembly.ucla.edu/board/1415/Kai.html
English,
Chinese

Sister project: CCExtractor

One of the tools we use in Red Hen is CCExtractor. It's a tool that takes media files and produces a transcript of the subtitles. We use the output of CCExtractor to 'follow the chain' and start the language analysis.

It turns out that CCExtractor is also in Summer of Code. We encourage students to also check their ideas page. We share some resources (a couple of mentors and even hardware) with them, so there are many opportunities for connections between the two projects. 

Project ideas

We have separated the ideas into audio analysis, image and video analysis, and text and language analysis, but there are many interesting tasks that benefit from joint parsing and we encourage innovative proposals.

Don't let the range or scope of ideas overwhelm you. We have many things to do because Red Hen is a large organization with an ambitious agenda.

Be honest with what you think you will be able to accomplish over the summer. We are aware that some of the ideas may take over a month. We are also aware that we will not be able to have all of them done over the summer. 

Remember that as a student you are not only allowed but encouraged to bring your own ideas. The most important thing is that you are passionate about what you are going to do over the summer.

Media analysis (any combination of audio, video and subtitles)

Task Details Tips Mentors Difficulty
Detect commercials There are a few approaches to detect commercials inside a video stream. 
Most of them are more or less naive implementations that work under certain conditions.
The goal here is figure out a way (using any tool) to reliable detect commercials. 
You can do it in any way you want. 
Consider having a database of commercials (for example you could keep an archive of audio) that is easy to train.
Two tools you can check out are ComSkip and MythTV (which has its own commercial detection module). They are heuristic based though, so probably they are never going to be perfect.
Requiring some manual maintenance is OK for your approach; goal is best possible results.
Carlos Hard

Audio analysis

We use ffmpeg to extract the aac audio track from the video and convert it to wav format, and various tools to analyze it.

Task Details Tips Mentors Difficulty
Forced alignment of English transcript and audio in a multi-speaker environment, bonus for improving the caption text first
Newscast captions and transcripts are imperfect and typically lag the audio stream of speech by a variable number of seconds. Forced aligners synchronize the transcript to the audio.
Expected outcome: Produce synchronized transcripts for a sample dataset.
kaldi, sail_align, Prosodylab-Aligner, and Praat are good starting points. The challenge is to generate consistently good results at a reasonable speed from incomplete captions and in a multispeaker environment.
Inés, Mark, Francis
Medium to hard
Forced alignment of Spanish transcript and audio Expected outcome: Produce synchronized transcripts for a sample dataset. See above
Inés, Javier Medium to hard
Forced alignment of German transcript and audio Expected outcome: Produce synchronized transcripts for a sample dataset. See above
Inés, Javier Medium to hard
Forced alignment of French transcript and audio Expected outcome: Produce synchronized transcripts for a sample dataset. See above
Inés, Javier Medium to hard
Forced alignment of Norwegian transcript and audio Expected outcome: Produce synchronized transcripts for a sample dataset. See above
Francis Medium to hard
Forced alignment of Danish transcript and audio Expected outcome: Produce synchronized transcripts for a sample dataset. See above
Francis
Medium to hard
Forced alignment of Swedish transcript and audio Expected outcome: Produce synchronized transcripts for a sample dataset. See above
Francis
Medium to hard
Detect non-vocal audio features
Detect non-voice features like ambient noise, silence, music, explosions, cars, weather
Expected outcome: Annotate text with the detected audio features
Consider which non-voice features are informative for news audiences -- even if audiences are not fully conscious of hearing them
Carlos / Cristobal Easy to medium
Detect vocal features
Detect audible breathing, sighs, laughter, shouting
Expected outcome: Annotate text with the detected audio features
Low-level vocal features can be significant in themselves or play a role in higher-level characterization
Carlos / Cristobal Medium
Identify male and female speakers
Use audio feature extraction techniques to identify male and female speakers
Expected outcome: Identify the gender of speakers in a relatively straightforward sample dataset.
A good starting point is the pitch detection in Praat. Peter B / Cristobal Easy to medium
Speaker diarization (turn-taking)
Use audio feature extraction techniques to identify when one speaker stops speaking and another starts to speak.
Expected outcome: Annotate speaker turns
See overview. SHoUT, LIUM, and idiap are possible starting points. There will often not be a pause between speakers, and there may be cross-talk.
Peter B / Cristobal
Medium to hard
Characterize tone of voice
Characterize voice qualities such as commanding, seductive, joking, or ironic.
Expected outcome: Annotate transcript with the detected voice characterization.
Start with something tractable, such as a "decisive" or authoritative voice common in political speeches.
Carlos / Cristobal Medium to hard
Audio feature extraction - emotions Detect expressions of emotions -- anger, sadness, exasperation, tiredness, happiness
Expected outcome: Annotate transcript with the detected emotion.
openSMILE and openEAR are good starting points. Start with a single person expressing a range of different emotions. Cf. sample dataset.
Carlos / Cristobal / Kai
Medium to hard
Speaker identification Select a limited number of recurring speakers and identify them from their voices.
Expected outcome: A short list of recognized speakers
(major news anchors and politicians) and a framework for expanding the list.
Good starting points:
bob.spear and ALIZE

Carlos / Cristobal Medium to hard
Improve caption text
Use audio analysis to improve the caption text
Expected outcome: Detect errors in the caption text, such as misspelled or missing words
The tools used in semafor may be helpful for improving the caption text. We are not looking for a full-fledged speech to text application
Carlos / Cristobal / Peter B
Medium to hard


Text and language analysis 

Your project should include some testing, parameter optimization, and evaluation.

Task Details Tips Mentors Difficulty
Create a tool for tagging conceptual frames in Spanish Semafor (Semantic Analysis of Frame Representations) uses FrameNet to annotate text. The task is to port this application to use Spanish FrameNet.
Expected outcome: Ability to tag Spanish caption texts with the categories in Spanish FrameNet
Red Hen is already using Semafor; see sample English annotated data (FRM_01). The individual components are available for Spanish, but will need to be adapted and trained; see sample Spanish data.
Francis /
Mark
Hard   
Deploy Spanish parts of speech annotation to the Red Hen corpus Assess a variety of PoS taggers for use with Red Hen and then adapt it to the Red Hen corpus data format.
Expected outcome: Annotation of Spanish for the Red Hen corpus.
  Francis /
Mark
 Easy
Web-based front end for the mwetoolkit multiword expression tagger Design and create a web-based frontend where advanced users can enter structured multiword expressions. The backend takes the terms entered and uses the  mwetoolkit to tag the dataset.
Expected outcome: A user-friendly front end that accepts structured lists of multiword expressions and uses them to generate a tagged version of a corpus of transcripts.
See detailed suggestions for the backend and
sample corpus -- work with us to design the front end

Appropriate toolkits include  JQuery and React.
Francis /
Mark / Peter U.
 Medium
Spanish support for the web-based front end to the mwetoolkit multiword expression tagger Building on the previous task, add support for Spanish
Expected outcome: Your front end should accept multiword expressions in Spanish and tag a Spanish corpus
See Etiquetador de expresiones multipalabra for detailed suggestions; see above for the web-based front end
Carlos / Cristobal Easy (when the previous one is completed)
Adapt CWBtreebank to the Red Hen corpus format and implement new heuristics to optimize query performance While CWBtreebank exists as a piece of open-source software, it currently caters for the needs of the treebank.info project and lacks support for other data formats (particularly for multiple graph annotations on the same sentence). Furthermore, queries could be optimized by checking frequencies of the items included in the query and the querying in the appropriate order.
Expected outcome: A version of CWBtreebank that supports Red Hen's corpus format with multiple graphs and that runs faster due to optimized queries.
The software is available at https://launchpad.net/cwb-treebank.
You can see it in action at http://treebank.info (just request a test account from Peter Uhrig)
Peter U. Easy to medium

New Idea: Development of a Query Interface for Parsed Data

The task is to create a new and improved version of a graphical user interface for graph-based search on dependency-annotated data.
The new version should have all functionality provided by the prototype plus a set of new features. The back-end is already in place.

Current functionality:
- add nodes to the query graph
- offer choice of dependency relation, PoS/word class based on the configuration in the database (the database is already there)
- allow for use of a hierarchy of dependencies (if supported by the grammatical model)
- allow for word/lemma search
- allow one node to be a "collo-item" (i.e. collocate or collexeme in a collostructional analysis)
- color nodes based on a finite list of colors
- paginate results
- export xls of collo-items
- create a JSON object that represents the query to pass it on to the back-end

New functionality:
- allow for removal of nodes
- allow for query graphs that are not trees
- allow for specification of the order of the elements
- pagination of search results should be possible even if several browser windows or tabs are open.
- configurable export to csv for use with R
- compatibility with all major Web Browsers (IE, Firefox, Chrome, Safari) [currently, IE is not supported]
- parse of example sentence can be used as the basis of a query ("query by example")

Suggested steps:
1. Go to http://www.treebank.info and play around with the interface (user: gsoc2015, password: redhen) [taz is a German corpus, the other two are English]
2. Decide on a suitable JavaScript Framework (we'd suggest reactJS paired with jQuery or something along these lines - this will have to be discussed)
3. Think about html representation. We would like to have it HTML5/CSS3, but for the moment we are not sure whether we can meet the requirements without major work on <canvas> or whether we can have sensible widgets without having to dig into the <canvas> tag.
4. Contact Peter Uhrig to discuss details or ask for clarification on any point.

Difficulty:
Medium to hard.