Red Hen Lab - Google Summer of Code 2015 Report
WHAT IS RED HEN?
The Distributed Little Red Hen Lab is an international consortium for research on multimodal communication. We develop open-source tools for joint parsing of text, audio/speech, and video, using datasets of various sorts, most centrally a very large dataset of international television news called the UCLA Library Broadcast NewsScape.
Faculty, staff, and students at several universities around the world contribute to Red Hen Lab, including UCLA, FAU Erlangen, Case Western Reserve University, Centro Federal de Educação Tecnológica in Rio de Janeiro, Oxford University, Universidad de Navarra, University of Southern Denmark, and more. Red Hen uses 100% open source software. In fact, not just the software but everything else—including recording nodes—is shared in the consortium.
Among other tools, we use CCExtractor, ffmpeg, and OpenCV (they have all been part of GSoC in the past).
WHO USES RED HEN'S INFRASTRUCTURE?
Section 108 of the U.S. Copyright Act permits Red Hen, as a library archive, to record news broadcasts from all over the world and to loan recorded materials to researchers engaged in projects monitored by the Red Hen directors and senior personnel. Section 108 restrictions apply to only the corpus of recordings, not the software. Because everything we do is open source, anyone can replicate our infrastructure.
WHAT ARE WE WORKING WITH?
The Red Hen archive is a huge repository of recordings of TV programming, processed in a range of ways to produce derived products useful for research, expanded daily, and supplemented by various sets of other recordings. Our challenge is to create tools that allow us to access audio, visual, and textual (closed-captioning) information in the corpus in various ways by creating abilities to search, parse and analyze the video files. However, as you can see, the archive is very large, so creating processes that can scan the entire dataset is time consuming, and often with a margin of error.
The stats as of 2015-08-23 are:
Total networks: 38
Total series: 2,335
Total duration in hours: 257,184 (224,988)
Total metadata files (CC, OCR, TPT): 684,646 (601,983)
Total words in metadata files (CC, OCR, TPT): 3.18 billion, 3,181,042,472 exactly (2.81)
Total caption files: 330,978 (289,463)
Total words in caption files: 2.10 billion, 2,104,276,527 exactly (1.86)
Total OCR files: 322,504 (284,482)
Total TPT files: 31,164 (28,038)
Total words in OCR files: 707.41 million, 707,405,543 exactly (619.10)
Total words in TPT files: 369.36 million, 369,360,402 exactly (331.77)
Total video files: 330,821 (289,315)
Total thumbnail images: 92,586,368 (80,995,756)
Storage used for core data: 81.42 terabytes (71.64)
Our ideas page for GSOC 2015 challenged students to assist in a number of projects, including some that have successfully improved our ability to search, parse, and extract information from the archive.
WHAT HAVE WE ACCOMPLISHED?
Google Summer of Code 2015 Projects:
- Ekaterina Ageeva - Multiword Expression Tagger & Frontend
- Owen He - Detecting Male and Female Speakers
- Vasanth Kalingeri - Commercial Detection
Ekateriana Ageeva - Multiword Expression Search and Tagging
A linguistic corpus is a large collection of annotated texts. The most common annotation includes part-of-speech, grammatical and syntactic information. However, other information may be specified depending on the purpose of a given corpus.
This project aims at facilitating a specific corpus annotation task, namely, tagging of multiword expressions. Such expressions include, for example, English phrasal verbs (look forward to, stand up for) or other expressions with a fixed structure (e.g. unit of time + manner-of-motion verb: time flies, the years roll slowly).
The goal of the project is to develop an integrated language-agnostic pipeline from user input of multiword expressions to a fully annotated corpus. The annotation will be performed by an existing tool, the mwetoolkit. In the course of the project the following components will be developed: the utility scripts that perform input and output conversion, the backend that communicates with the mwetoolkit, and the frontend that allows the user to customize their tagging task, while minimizing the amount of interaction with the tagger. As a result, the multiword expression tagging task will be automatized to the possible extent, and more corpora will benefit from an additional level of annotation.
Ekaterina Ageeva has built a multiword expressions toolkit (mwetoolkit), which is a tool for detecting multi-word units (e.g. phrasal verbs or idiomatic expressions) in large corpora. The toolkit operates via command-line interface. To ease access and expand the toolkit's audience, Ekaterina developed a web-based interface, which builds on and extends the toolkit functionality.The interface allows us to do the following:
- Upload, manage, and share corpora
- Create XML patterns which define constraints on multiword expressions
- Search the corpora using the patterns
- Filter search results by occurrence and frequency measures
- Tag the corpora with obtained search results
A typical workflow goes as follows:
1. Create an account (used to store search history and manage corpus permissions)
2. Upload a corpus. Directly after upload, the corpus is indexed with the mwetoolkit to speed up search and tagging.
3. Create search patterns using the Pattern builder
4. Search and/or tag the corpus with the patterns
5. Download the modified corpus file after tagging.
The interface is build with Python/Django. It currently supports operations with corpora tagged with Stanford CoreNLP parser, with a possibility to extend to other formats supported by MWEtoolkit. The system uses part of speech and syntactic dependency information to find the expressions. Users may rely on various frequency metrics to obtain the most relevant search results.
Owen He - Automatic Speaker Recognition System
Owen used a reservoir computing method called conceptor together with the traditional Gaussian Mixture Models (GMM) to distinguish voices between different speakers. He also used a method proposed by Microsoft Research last year at the Interspeech Conference, which used a Deep Neural Network (DNN) and an Extreme Learning Machine (ELM) to recognize speech emotions. DNN was trained to extract segment-level (256 ms) features and ELM was trained to make decisions based on the statistics of these features on a utterance level. Owen's project focused on applying this to detect male and female speakers, specific speakers, and emotions by collecting training samples from different speakers and audio signals with different emotional features. He then preprocessed the audio signals, and created the statistical models from the training dataset. Finally, he computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached. You can check out the summary of results on GitHub.
His method is as follows:
1. Create a single, small (N=10 units) random reservoir network.
2. Drive the reservoir, in two independent sessions, with 100 preprocessed training samples of each gender, and create conceptors C_male, C_female respectively from the network response.
3. In exploitation, a preprocessed sample s from the test set is fed to the reservoir and the induced reservoir states x(n) are recorded and transformed into a single vector z. For each conceptor then the positive evidence quantities z’ C_male z and z’ C_female z are then computed. We can now identify the gender of the speaker by looking at the higher positive evidence. i.e. the speaker is male if z’ C_male z > z’ C_female z and vice versa. The idea behind this procedure is that if the reservoir is driven by a signal from male speaker, the resulting response z signal will be located in a linear subspace of the reservoir state space whose overlap with the ellipsoid given by C_male is larger than that with the ellipsoid given by C_female.
4. In order to further improve the classification quality, we also compute NOT(C_male) and NOT(C_female). This leads to negative evidence quantities z’NOT(C_male)z and z’NOT(C_female)z.
5. By adding the positive evidence z’ C_male z and negative evidence z’NOT(C_female)z, a combined evidence is obtained, which can be paraphrased as “this test sample seems to be from male speaker and seems not to be from female speaker”.
Owen replicated this to detect male and female speakers, specific speakers, and emotion by collecting training samples from male and female speakers, specific speakers to be recognized, and audio signals with different emotional features. He then preprocessed the audio signals, and created the conceptors from the training dataset. He then computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached.
Vasanth Kalingeri - Commercial detection system
Vasanth Kalingeri built a system for detecting commercials in television programs from any country and in any language. The system detects the location and the content of ads in any stream of video, regardless of the content being broadcast and other transmission noise in the video. In tests, the system achieved 100% detection of commercials. An online interface was built along with the system to allow regular inspection and maintenance.
Audio fingerprinting of commercial segments was used for the detection of commercials. Fingerprint matching has a very high accuracy when dealing with audio, even severely distorted audio can be recognized with very high accuracy. The major problem faced was that the system had to be generic for all TV broadcasts, regardless of audio/video quality, aspect ratio and other transmission related errors. Audio fingerprinting provides a solution for all these problems. It was seen after implementation that several theoretical ways of detecting commercials as suggested in the proposal was not as accurate as audio fingerprinting and hence audio fingerprinting stayed in the final version of the system.
Initially the user uses a set of hand tagged commercials. The system detects these set of commercials in the TV segment. On detecting these commercials, it divides the entire broadcast into blocks. Each of these blocks can be viewed and tagged as commercials by the user. This comprises of the maintenance of the system. There is a set of 60 hand labelled commercials for one to work with. This process takes about 10-30min for a 1hr TV segment, depending on the number of commercials that have to be tagged.
When the database has an appreciable amount of commercials (usually around 30 per channel) we can use it to recognize commercials in any unknown TV segment.
On running the system on any video, one can expect the following format of output file:
00:00:00 - 00:00:38 = Unclassified
00:00:38 - 00:01:20 = ad by Jeopardy
00:01:21 - 00:02:20 = ad by Jerome’s
… and so on
The above is the location of the commercials and the content it advertises in the video.
In case the system missed detecting a few commercials, one can edit this file through a web interface which looks as follows:
On making changes to the web interface, the system updates its db with new/edited commercials. This web interface can be used for viewing the detected commercials as well.
This system is extremely useful for people with the following interest:
- People who despise commercials by a particular company, or who despise commercials all together can use it to detect location of commercials. With help of the script ffmpeg.py (part of the system) one can remove all commercials detected, or selectively chose to keep some. This way, it can be used as an alternative to TiVo (with a few exceptions though).
- Those who are working on building a very effective automatic commercial detection system can use this to reliably build training data. This method promises to be much faster and more effective than the normal way of building the entire training data by manually tagging content.
- TV broadcasters who are wishing to employ targeted ads based on location can use it too.