The Distributed Little Red Hen Lab is an international consortium for research on multimodal communication. We develop open-source tools for joint parsing of text, audio/speech, and video, using datasets of various sorts, most centrally a very large dataset of international television news called the UCLA Library Broadcast NewsScape.
Faculty, staff, and students at several universities around the world contribute to Red Hen Lab, including UCLA, FAU Erlangen, Case Western Reserve University, Centro Federal de Educação Tecnológica in Rio de Janeiro, Oxford University, Universidad de Navarra, University of Southern Denmark, and more. Red Hen uses 100% open source software. In fact, not just the software but everything else—including recording nodes—is shared in the consortium.
Among other tools, we use CCExtractor, ffmpeg, and OpenCV (they have all been part of GSoC in the past).
WHO USES RED HEN'S INFRASTRUCTURE?
Section 108 of the U.S. Copyright Act permits Red Hen, as a library archive, to record news broadcasts from all over the world and to loan recorded materials to researchers engaged in projects monitored by the Red Hen directors and senior personnel. Section 108 restrictions apply to only the corpus of recordings, not the software. Because everything we do is open source, anyone can replicate our infrastructure.
WHAT ARE WE WORKING WITH?
The Red Hen archive is a huge repository of recordings of TV programming, processed in a range of ways to produce derived products useful for research, expanded daily, and supplemented by various sets of other recordings. Our challenge is to create tools that allow us to access audio, visual, and textual (closed-captioning) information in the corpus in various ways by creating abilities to search, parse and analyze the video files. However, as you can see, the archive is very large, so creating processes that can scan the entire dataset is time consuming, and often with a margin of error.
The stats as of 2015-08-23 are:
Our ideas page for GSOC 2015 challenged students to assist in a number of projects, including some that have successfully improved our ability to search, parse, and extract information from the archive.
WHAT HAVE WE ACCOMPLISHED?
Google Summer of Code 2015 Projects:
Ekateriana Ageeva - Multiword Expression Search and Tagging
References: MWEtoolkit / Interface Demo / Initial Proposal
A linguistic corpus is a large collection of annotated texts. The most common annotation includes part-of-speech, grammatical and syntactic information. However, other information may be specified depending on the purpose of a given corpus.
This project aims at facilitating a specific corpus annotation task, namely, tagging of multiword expressions. Such expressions include, for example, English phrasal verbs (look forward to, stand up for) or other expressions with a fixed structure (e.g. unit of time + manner-of-motion verb: time flies, the years roll slowly).
The goal of the project is to develop an integrated language-agnostic pipeline from user input of multiword expressions to a fully annotated corpus. The annotation will be performed by an existing tool, the mwetoolkit. In the course of the project the following components will be developed: the utility scripts that perform input and output conversion, the backend that communicates with the mwetoolkit, and the frontend that allows the user to customize their tagging task, while minimizing the amount of interaction with the tagger. As a result, the multiword expression tagging task will be automatized to the possible extent, and more corpora will benefit from an additional level of annotation.
The interface allows us to do the following:
- Upload, manage, and share corpora
A typical workflow goes as follows:
1. Create an account (used to store search history and manage corpus permissions)
The interface is build with Python/Django. It currently supports operations with corpora tagged with Stanford CoreNLP parser, with a possibility to extend to other formats supported by MWEtoolkit. The system uses part of speech and syntactic dependency information to find the expressions. Users may rely on various frequency metrics to obtain the most relevant search results.
Owen He - Automatic Speaker Recognition System
Owen used a reservoir computing method called conceptor together with the traditional Gaussian Mixture Models (GMM) to distinguish voices between different speakers. He also used a method proposed by Microsoft Research last year at the Interspeech Conference, which used a Deep Neural Network (DNN) and an Extreme Learning Machine (ELM) to recognize speech emotions. DNN was trained to extract segment-level (256 ms) features and ELM was trained to make decisions based on the statistics of these features on a utterance level.
Owen's project focused on applying this to detect male and female speakers, specific speakers, and emotions by collecting training samples from different speakers and audio signals with different emotional features. He then preprocessed the audio signals, and created the statistical models from the training dataset. Finally, he computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached. You can check out the summary of results on GitHub.
His method is as follows:
1. Create a single, small (N=10 units) random reservoir network.
2. Drive the reservoir, in two independent sessions, with 100 preprocessed training samples of each gender, and create conceptors C_male, C_female respectively from the network response.
3. In exploitation, a preprocessed sample s from the test set is fed to the reservoir and the induced reservoir states x(n) are recorded and transformed into a single vector z. For each conceptor then the positive evidence quantities z’ C_male z and z’ C_female z are then computed. We can now identify the gender of the speaker by looking at the higher positive evidence. i.e. the speaker is male if z’ C_male z > z’ C_female z and vice versa. The idea behind this procedure is that if the reservoir is driven by a signal from male speaker, the resulting response z signal will be located in a linear subspace of the reservoir state space whose overlap with the ellipsoid given by C_male is larger than that with the ellipsoid given by C_female.
4. In order to further improve the classification quality, we also compute NOT(C_male) and NOT(C_female). This leads to negative evidence quantities z’NOT(C_male)z and z’NOT(C_female)z.
5. By adding the positive evidence z’ C_male z and negative evidence z’NOT(C_female)z, a combined evidence is obtained, which can be paraphrased as “this test sample seems to be from male speaker and seems not to be from female speaker”.
Owen replicated this to detect male and female speakers, specific speakers, and emotion by collecting training samples from male and female speakers, specific speakers to be recognized, and audio signals with different emotional features. He then preprocessed the audio signals, and created the conceptors from the training dataset. He then computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached.
Vasanth Kalingeri - Commercial detection system
Vasanth Kalingeri built a system for detecting commercials in television programs from any country and in any language. The system detects the location and the content of ads in any stream of video, regardless of the content being broadcast and other transmission noise in the video. In tests, the system achieved 100% detection of commercials. An online interface was built along with the system to allow regular inspection and maintenance.
Audio fingerprinting of commercial segments was used for the detection of commercials. Fingerprint matching has a very high accuracy when dealing with audio, even severely distorted audio can be recognized with very high accuracy. The major problem faced was that the system had to be generic for all TV broadcasts, regardless of audio/video quality, aspect ratio and other transmission related errors. Audio fingerprinting provides a solution for all these problems. It was seen after implementation that several theoretical ways of detecting commercials as suggested in the proposal was not as accurate as audio fingerprinting and hence audio fingerprinting stayed in the final version of the system.
Initially the user uses a set of hand tagged commercials. The system detects these set of commercials in the TV segment. On detecting these commercials, it divides the entire broadcast into blocks. Each of these blocks can be viewed and tagged as commercials by the user. This comprises of the maintenance of the system. There is a set of 60 hand labelled commercials for one to work with. This process takes about 10-30min for a 1hr TV segment, depending on the number of commercials that have to be tagged.
When the database has an appreciable amount of commercials (usually around 30 per channel) we can use it to recognize commercials in any unknown TV segment.
On running the system on any video, one can expect the following format of output file:
00:00:00 - 00:00:38 = Unclassified
00:00:38 - 00:01:20 = ad by Jeopardy
00:01:21 - 00:02:20 = ad by Jerome’s
… and so on
The above is the location of the commercials and the content it advertises in the video.
In case the system missed detecting a few commercials, one can edit this file through a web interface which looks as follows:
On making changes to the web interface, the system updates its db with new/edited commercials. This web interface can be used for viewing the detected commercials as well.
This system is extremely useful for people with the following interest:
Summer of Code > c. Red Hen Lab Summer of Code 2015 Report > c. Summer of Code 2015 - ideas page >