f. Red Hen Lab - Google Summer of Code 2015 Report
Red Hen Summer of Code 2015
Eight students worked in or with Red Hen Summer of Code 2015, seven of them on Audio Signal Processing. Five were supported by Google (three from Red Hen and two from another GSoC managed organization) and the other three through other Red Hen funding, all on effectively identical terms. Red Hen has a small army of mentors for these students and maintains a wiki for coordination. The eight students and projects are:
Ekaterina Ageeva. Front-end for the mwetoolkit. Moscow.
Mattia Cerrato. Foundation for a hierarchical audio analysis tool. Torino.
Aditya Dalwani. OpenEMO. Banglore.
Sri Harsha. Paralinguistics. Hyderabad.
Sai Krishna Rallabandi. International Institute of Information Technology, Hyderabad, India. Multi-language Forced Alignment in a Heterogenous Corpus. Update 2016-06-26: Sai Krishna has been admitted to the PhD program at the Language Technologies Institute at Carnegie Mellon University.
Karan Singla. Diarization. Hyderabad. Update 2016-06-26: Karan Singla has been admitted to the PhD program at the Signal Analysis and Interpretation Lab (SAIL) in the Viterbi School of Engineering at the University of Southern California.
Owen He. Conceptors. Bremen.
Vasanth Kalingeri. Commercial detection. Banglore.
WHAT IS RED HEN?
The Distributed Little Red Hen Lab is an international consortium for research on multimodal communication. We develop open-source tools for joint parsing of text, audio/speech, and video, using datasets of various sorts, most centrally a very large dataset of international television news called the UCLA Library Broadcast NewsScape.
Faculty, staff, and students at several universities around the world contribute to Red Hen Lab, including UCLA, FAU Erlangen, Case Western Reserve University, Centro Federal de Educação Tecnológica in Rio de Janeiro, Oxford University, Universidad de Navarra, University of Southern Denmark, and more. Red Hen uses 100% open source software. In fact, not just the software but everything else—including recording nodes—is shared in the consortium.
Among other tools, we use CCExtractor, ffmpeg, and OpenCV (they have all been part of GSoC in the past).
WHO USES RED HEN'S INFRASTRUCTURE?
Section 108 of the U.S. Copyright Act permits Red Hen, as a library archive, to record news broadcasts from all over the world and to loan recorded materials to researchers engaged in projects monitored by the Red Hen directors and senior personnel. Section 108 restrictions apply to only the corpus of recordings, not the software. Because everything we do is open source, anyone can replicate our infrastructure.
Participants in the Summer of Code have had full access to the main NewsScape dataset at UCLA. Applicants also had full access to sample datasets.
WHAT ARE WE WORKING WITH?
The Red Hen archive is a huge repository of recordings of TV programming, processed in a range of ways to produce derived products useful for research, expanded daily, and supplemented by various sets of other recordings. Our challenge is to create tools that allow us to access audio, visual, and textual (closed-captioning) information in the corpus in various ways by creating abilities to search, parse and analyze the video files. However, as you can see, the archive is very large, so creating processes that can scan the entire dataset is time consuming, and often with a margin of error.
The stats as of 2015-11-08 are:
Total networks: 38
Total series: 2,354
Total duration in hours: 264,829 (224,988)
Total metadata files (CC, OCR, TPT): 704,381 (601,983)
Total words in metadata files (CC, OCR, TPT): 3.27 billion, 3,269,446,453 exactly (2.81)
Total caption files: 340,863 (289,463)
Total words in caption files: 2.16 billion, 2,163,787,674 exactly (1.86)
Total OCR files: 331,706 (284,482)
Total TPT files: 31,812 (28,038)
Total words in OCR files: 728.78 million, 728,779,461 exactly (619.10)
Total words in TPT files: 376.88 million, 376,879,318 exactly (331.77)
Total video files: 340,705 (289,315)
Total thumbnail images: 95,338,404 (80,995,756)
Storage used for core data: 83.82 terabytes (71.64)
Our ideas page for GSOC 2015 challenged students to assist in a number of projects, all of which have successfully improved our ability to search, parse, and extract information from the archive.
WHAT HAVE WE ACCOMPLISHED?
Google Summer of Code 2015 Projects:
Ekaterina Ageeva - Multiword Expression Tagger & Frontend (For Red Hen Lab)
Karan Singla - Speaker Diarization (For CCExtractor)
Sai Krishna - Forced Alignment (For CCExtractor)
Owen He - Audio Analysis Using Conceptors (For Red Hen Lab)
Vasanth Kalingeri - Commercial Detection (For Red Hen Lab)
Aditya Dalwani - OpenEMO Emotion Detection (For Red Hen Lab)
Red Hen Summer of Code 2015 Projects:
Red Hen Lab Summer of Code 2015 projects were supported by funding from the The Alexander von Humboldt Foundation - Anneliese Maier Research Prize.
EKATERINA AGEEVA - MULTIWORD EXPRESSION TOOLKIT
References: MWEtoolkit / Interface Demo / Initial Proposal
A linguistic corpus is a large collection of annotated texts. The most common annotation includes part-of-speech, grammatical and syntactic information. However, other information may be specified depending on the purpose of a given corpus.This project aims at facilitating a specific corpus annotation task, namely, tagging of multiword expressions. Such expressions include, for example, English phrasal verbs (look forward to, stand up for) or other expressions with a fixed structure (e.g. unit of time + manner-of-motion verb: time flies, the years roll slowly).The goal of the project is to develop an integrated language-agnostic pipeline from user input of multiword expressions to a fully annotated corpus. The annotation will be performed by an existing tool, the mwetoolkit. In the course of the project the following components will be developed: the utility scripts that perform input and output conversion, the backend that communicates with the mwetoolkit, and the frontend that allows the user to customize their tagging task, while minimizing the amount of interaction with the tagger. As a result, the multiword expression tagging task will be automatized to the possible extent, and more corpora will benefit from an additional level of annotation.
Ekaterina Ageeva has built a multiword expressions toolkit (mwetoolkit), which is a tool for detecting multi-word units (e.g. phrasal verbs or idiomatic expressions) in large corpora. The toolkit operates via command-line interface. To ease access and expand the toolkit's audience, Ekaterina developed a web-based interface, which builds on and extends the toolkit functionality.
The interface allows us to do the following:
- Upload, manage, and share corpora
- Create XML patterns which define constraints on multiword expressions
- Search the corpora using the patterns
- Filter search results by occurrence and frequency measures
- Tag the corpora with obtained search results
A typical workflow goes as follows:
1. Create an account (used to store search history and manage corpus permissions)
2. Upload a corpus. Directly after upload, the corpus is indexed with the mwetoolkit to speed up search and tagging.
3. Create search patterns using the Pattern builder
4. Search and/or tag the corpus with the patterns
5. Download the modified corpus file after tagging.
The interface is build with Python/Django. It currently supports operations with corpora tagged with Stanford CoreNLP parser, with a possibility to extend to other formats supported by MWEtoolkit. The system uses part of speech and syntactic dependency information to find the expressions. Users may rely on various frequency metrics to obtain the most relevant search results.
KARAN SINGLA - SPEAKER DIARIZATION
References: GitHub / Blog / Initial Proposal
According to UC Berkeley's ICSI Lab, speaker diarization consists of segmenting and clustering a speech recording into speaker-homogenous regions, using an unsupervised algorithm. Given an audio track of a group in conversation, a speaker-diarization system will 1.) identify speaker turns, or the points at which one speaker stops speaking and another begins. 2.) Identify the number of speakers in the audio track, and 3.) Identify those speakers as A, B, C, etc and recognize those speakers throughout the conversation. This requires both speech and non-speech detection, as well as overlap detection and resolution.
(Image Reference: http://multimedia.icsi.berkeley.edu/speaker-diarization)
Karan Singla built a model using pyCASP that has identified speaker turns in audio content of the Broadcast speech corpus. Karan's work has shown 65% accuracy on the entire NewsScape. Karan's project originally embarked upon this task using LIUM, but found more accuracy using pyCASP at the end of the project.
His method and results are as follows:
1. Data Pre-Processing
Generally News data is multi-channeled which needs to be converted to single-channel and then requires beam-forming. However, we don't need beamforming here as we all the data we are getting is single-channeled.
2. Differentiating Speech & Non-Speech
As seen from the news corpus, it has large portions which should be regarded as music, advertisements, etc. Therefore, the machine needs to be able to differentiate between the speaker's speech so that it knows what segments it needs to cluster to make system learn a speaker profile.
3. Audio Segmentation
This is generally done using various Popular Algorithms like BIC Clustering, I-Vector Clustering etc.
4. Speakers Hierarchal Clustering
Single-Show: Once we have segments which mark a speaker change, then these segments are hierarchically clustered to merge the segments with same-label as been spoken by the same speaker.
Cross-Show: Once we have speaker-segments for one show (news-session), then speakers are again hierarchically clustered to merge speakers which are common across various shows on the same network
Example : There are various news session which are covered my famous anchor Natalie Allen, then she will be recognized as Natalie Allen across all shows.
Why must this be hierarchal?
We don't yet know the number of speakers, AND there is such annotation for the data-available.
Weeks 1 - 2, Work Results:
By week 2, Karan built the data-preprocessing module to do the following:
1. Extract Audio data using ffmpeg in ".wav" from ".mp4"
2. Use sox command to convert the file to ".sph" format and also make the data single-channeled
3. Save the data in same data folder according to different Networks (Data will help in Cross-show diarization)
*Sample output using LIUM takes the pre-processing output.
Weeks 3 - 4, Work Results:
In Weeks 3-4, Karan made a parser that inserts the speaker turn and speaker label information (Label, M/F) into the Closed-Captioned File. He also converted a reverse parser which will normalize ".tpt" (ones which are marked with speaker information and speaker turns to look like LIUM output, and make them ready for evaluation).
*ALERT : TPT files have inconsistencies, therefore may be it will be better to stick to NIST TV News Diarization dataset.
Observations Karan noticed with LIUM Output :
1. LIUM Constantly recognizes more speakers
2. SMS (Speech/Music/Silence) recognition module always returned "U" (unknown) which have led to some music and silence to be marked as different cluster. ( I will consider that in further evaluation, by marking SMS from an external tool and then discarding those segments)
THE BIG QUESTION: Is LIUM is producing false positives ?
Answer: Not really. Some of them may be correctly clustered (they are commercials,music etc) which should have been ignored but we don't have a module (they are no data to train a model like this) which separates them out.
So what can be done?
It will be hard to detect noises using LIUM as it requires either of two things :
1. We must have every speakers' data (like training GMM model for every speaker). However, this is not possible.
2. The other option is to provide speech/music GMM models, OR do something like "Audio Fingerprinting," which Matta is working on.
Karan's solution?
He found a demo of ALIZE speaker identification tool kit which has pre-trained GMM models for music, music+speech, speech which he used to expel segments of the audio which were classified as music and generate segmentation, clustering. Finally he resulted with a speaker label for each segment.
Use "spk_det.sh "in the "ALIZE_spk_det" folder to generate output for a mp4 file.
Making Ground-Truth Files
What is a ground-truth file ?
In simple words, a file with 100% speaker boundaries, which ignores all type of noises which should not belong to any speaker.
Currently all types of noises (music/commercial) are a part of the cluster ID's, and I removed the cluster ID's that covered segments coming in the duration of "♪" in the ".tpt" files. There is some reduction the amount of cluster-Ids for each show. (check the results doc about LIUM output which shows the number of clusters ID's obtained for each file). But there are a lot different noises in the input files which is not really a speaker and are not discarded using the TPT files.
We observed a "tpt" file corresponding to an audio file and observed the following:
1. "NER01" tag can give you the speaker boundaries
2. BUT these boundaries have a delay which is variable
3. 1330 timestamp in TPT file means 13:30 mins not 1330 seconds.
4. A lot of noises (small durations) have been ignored and are considered part of the audio stamp do delays
Therefore, Karan wrote a script called "tpt2rttm.sh" (find it in scripts folder) to convert .tpt file to .rttm format so that they can be treated as ground truth for evaluation purposes.
Results?
Results below are shown for one file "2014-01-01_0000_US_CNN_Erin_Burnett_Out_Front.mp4" for which we took output by module above to be ground truth
<TOTAL SPK = 53 ( original 43 : as shown by TPT file)
EVAL TIME = 3572.00 secs
EVAL SPEECH = 3468.75 secs ( 97.1 percent of evaluated time)
SCORED TIME = 901.61 secs ( 25.2 percent of evaluated time)
SCORED SPEECH = 810.60 secs ( 89.9 percent of scored time)
---------------------------------------------
MISSED SPEECH = 13.14 secs ( 1.5 percent of scored time)
FALARM SPEECH = 91.01 secs ( 10.1 percent of scored time)<
---------------------------------------------
SCORED SPEAKER TIME = 810.60 secs (100.0 percent of scored speech)
MISSED SPEAKER TIME = 13.14 secs ( 1.6 percent of scored speaker time)
FALARM SPEAKER TIME = 91.01 secs ( 11.2 percent of scored speaker time)
SPEAKER ERROR TIME = 445.53 secs ( 55.0 percent of scored speaker time)
---------------------------------------------
OVERALL SPEAKER DIARIZATION ERROR = 67.81 percent `(ALL)
There can be many possible reasons for bad results:
1. In TPT files there was a delay for each speaker turn that can lead to missed speech and particularly cases of false alarm.
2. SAD (Speaker activity detection) module needs to be improved to remove segments with commercials, music or any other types of noises. We are trying to figure out if we can get the audio segments with commercials and music marked separately in the ".seg" files so that we can train our own GMM models to remove these type of segments while clustering.
3. Bad Clustering : For that Karan is trying more clustering methods like ILP & BIC based clustering to check if they can produce better results.
Diarization using pyCASP
In the last leg of the GSOC, we experimented with PyCASP at HPC cluster in CWRU to adapt it for speaker diarization project on our NewsScape corpus.
PyCASP is designed in way that it can run with CUDA in the backhand, therefore allows successful use of GPU's and parallel programming to make the system fast.
Diarization using PyCASP vs LIUM
Diarization on same file referred in previous blog we attained better results. And PyCASP is able to detect speaker boundaries quite perfectly
Evaluation with LIUM on file : 2014-01-01_0000_US_CNN_Erin_Burnett_Out_Front.mp4 (Time Taken : 1 hour 52 minutes)
EVAL TIME = 3572.00 secs
EVAL SPEECH = 3468.75 secs ( 97.1 percent of evaluated time)
SCORED TIME = 901.61 secs ( 25.2 percent of evaluated time)
SCORED SPEECH = 810.60 secs ( 89.9 percent of scored time)
---------------------------------------------
MISSED SPEECH = 13.14 secs ( 1.5 percent of scored time)
FALARM SPEECH = 91.01 secs ( 10.1 percent of scored time)
---------------------------------------------
SCORED SPEAKER TIME = 810.60 secs (100.0 percent of scored speech)
MISSED SPEAKER TIME = 13.14 secs ( 1.6 percent of scored speaker time)
FALARM SPEAKER TIME = 91.01 secs ( 11.2 percent of scored speaker time)
SPEAKER ERROR TIME = 445.53 secs ( 55.0 percent of scored speaker time)
---------------------------------------------
OVERALL SPEAKER DIARIZATION ERROR = 67.81 percent `(ALL)
Evaluation on the same file using PyCasp ( Time Taken : 7 minutes 32 seconds )
EVAL TIME = 3571.72 secs
EVAL SPEECH = 3469.12 secs ( 97.1 percent of evaluated time)
SCORED TIME = 3571.72 secs (100.0 percent of evaluated time)
SCORED SPEECH = 3469.12 secs ( 97.1 percent of scored time)
---------------------------------------------
MISSED SPEECH = 0.39 secs ( 0.0 percent of scored time)
FALARM SPEECH = 102.60 secs ( 2.9 percent of scored time)
---------------------------------------------
SCORED SPEAKER TIME = 5205.92 secs (150.1 percent of scored speech)
MISSED SPEAKER TIME = 1736.37 secs ( 33.4 percent of scored speaker time)
FALARM SPEAKER TIME = 103.39 secs ( 2.0 percent of scored speaker time)
SPEAKER ERROR TIME = 989.01 secs ( 19.0 percent of scored speaker time)
---------------------------------------------
OVERALL SPEAKER DIARIZATION ERROR = 54.34 percent `(ALL)
But here PyCASP recoganized just two speakers but if you increase "initial_clusters" parameter in "03_diarizer_cfg.py" then results are quite good.
200 Initial Clusters : 14 Speakers : 38.24 % diarization error : 3 hour 14 min
300 Initial Clusters : 18 Speakers : 23.24 % diarization error : 6 hour 32 min
400 initial Clusters : 24 Spaekers : 18.24 % diarization error : 9 hours 51 min
Karan experimented by increasing the initial numbers of clusters to check it's affect on diarization output and it was seen that high quality speaker diarization can be done using PyCASP but that time-complexity increases many folds.
The easy-scripts can also be downloaded from github repo (pycasp folder). The repository also has scripts to use ALIZE and LIUM diarization tool kits for NewsScape corpus.
Topics for future work:
1. Way to use Multi-GPU's and other ways to make balance between quality & time-complexity
2. Making it a part of the bigger pipeline so that we can use diarization for other tasks
3. Adding Speech/Non-Speech module in PyCASP so that we can filter-out non-speech patters before we give it to PyCASP to diarize
4. Making a comparison for a bigger data-set
SAI KRISHNA - MULTI-LANGUAGE FORCED ALIGNMENT OF PHONIC AND WORD LABELS IN A HETEROGENEOUS CORPUS
References: GitHub / Summary of Results in PDF / Blog / Initial Proposal
Sai Krishna has successfully built a forced alignment model that minimized the word error rate in speaker diarization with closed captioning by developing a method for data pruning by phone confidence measure.
This allowed for the calculation of a confidence measure based on the estimated posterior probability to prune the bad data. Posterior probability, by definition, tells the correctness or confidence of a classification. In speech recognition, the posterior probability of a phone or a word hypothesis w given a sequence of acoustic feature vectors OT1 = O1O2 .. OT is computed (as given in equation below) as the likelihood of all paths passing through the particular phone/word (in around same time region) in the lattice normalized by the total likelihood of all paths in the lattice. It is computed using forward-backward algorithm over the lattice. His model has improved the word error rate by ~14%.
Let Ws and We respectively denote word sequences preceding and succeeding word w whose posterior probability is to be computed. Also let W' denote the word sequence (WswWe). Then,
In above equation, p(O 1T ) in the denominator is approximated as the sum of likelihoods of all paths in the lattice. As can be seen, the posterior score has contribution from LM too (the term p(W') in the numerator which signifies LM likelihood). Hence, we consider the score after contribution of LM is nullified, and when the posterior score purely reflects the acoustic match. It is also true that posterior score depends on the acoustic model, (mis)match between the training and test conditions, and also on the error rate. To alleviate the dependence of the posterior on one signal acoustic model, we re-compute the posteriors after rescoring the same lattices using a different acoustic model such as one trained on articulatory features. Such system combination makes the posteriors more reliable.
The Sequence of Algorithmic Steps which gave the best result:
> Monophone Decoding
> Triphone Decoding - Pass 1 ( Deltas and Delta-Deltas and Decoding)
> Triphone Decoding - Pass 2 ( LDA + MLLT Training and Decoding )
> Triphone Decoding - Pass 3 ( LDA + MLLT + SAT + MPE Training and Decoding)
> Subspace Gaussian Mixture Model Training
> MMI + SGMM Training and Decoding
> DNN Hybrid Training and Decoding
> System Combination ( DNN + SGMM)
Module Overview:
Steps to Train and Test:
Steps Added to Increase Efficiency:
1. Add out of vocabulary words to the dictionary and get phonetic sequence for them using seqiter g2p
2. Build Language model again
3. Decode
Explanation:
Using free audiobooks from Librivox that contain read speech of recorded books from Project Gutenberg, Sai Krishna first used the audiobooks without corresponding text and tried to obtain accurate text (transcripts and timestamps) using the open source acoustic and language model Librispeech, available at kaldi-asr.org. According to Karan, "LM plays an important role in the search space minimization and also on the quality of the competing alternative hypotheses in the lattice. Here, Librispeech LM is used as the task is to decode audiobooks available at Librivox. The Librispeech LM won’t be that useful if we’re to decode audio of a lecture or a speech which encapsulates a specific topic and vocabulary. In such a case, it would be better to prepare a new LM based on the text related to the topic in the audio and interpolate it with LM prepared on Librispeech corpus.
Why did we use Librispeech?
Librispeech is the largest open source continuous speech corpus in English, with approximately 1000 hours available for use. It mainly consists of two parts, 460 hours of clean data and 500 hours of data containing artificially added noise. Because our goal is to decode audiobooks and consequently develop a USS system, using models trained on Librispeech data seemed to be the best choice.
Preparing Librispeech data for decoding:
Each broadcast news item consists of an approximate transcription file and a video file from which the audio needs to be extracted. The audio files were downloaded and converted to 16kHz WAV format. These wavefiles were then chopped based on silence intervals of 0.3 seconds or more to create phrasal chunks averaging 15 seconds in length. The chunks were power-normalized. An average chunk length of 15 seconds is sufficient enough to capture intonation variation, and does not create memory shortage problems during Viterbi decoding. It is also observed that decoding is faster and more accurate compared to when it is performed for much longer chunks.
Obtaining Accurate Hypotheses and Timestamps:
To prune out noisy/disfluent/unintelligible regions from audio, we also need a confidence (or posterior probability) score that reflects the acoustics reliably. In addition the confidence score should not reflect language model score, and instead should reflect purely acoustic likelihood score.
1. Feature extraction: The first step is to extract features from audiobooks. 39 dimensional acoustic feature vectors (12 dimensional MFCC and normalized power, with their deltas and double-deltas) are computed. Cepstral mean and variance normalization is applied. The feature vectors are then spliced to form a context window of seven frames (three frames on either side) on which linear and discriminative transformation such as Linear Discriminant Analysis is applied which helps achieve dimensionality reduction.
2. Decoding the audiobook using speaker adapted features: We use the p-norm DNN-HMM speaker independent acoustic model trained on 460 hours of clean data from Librispeech corpus for decoding. The decoding is carried out in two passes. In the first pass, an inexpensive LM such as pruned trigram LM is used to constrain the search space and generate a lattice for each utterance. The alignments obtained from the lattices are used to estimate speaker dependent fMLLR or feature-space MLLR transforms. In the second pass of decoding, an expensive model such as unpruned 4-gram LM is used to rescore the lattice, and obtain better LM scores.
Combination of phone and word decoding: The lattices generated in the previous step don’t simply contain word hypotheses, but instead contain a combination of phone and word hypotheses. Phone decoding, in tandem with word decoding helps reduce errors by a significant proportion in the occurrence of out-of-vocabulary words or different pronunciations of in-vocabulary words. A combination of phone and word decoding can be performed by simply including the phones in the text from which the LM is prepared. An example to highlight the use of this technique is as below. We can observe the sequence of phones hypothesized because of the difference in pronunciations of the uttered word and its pronunciation in the lexicon.
Example:
Bayers b ey er z (pronunciation in the lexicon)
Reference: Performed by Catherine Bayers
Hypothesis: Performed by Catherine b ay er z
3. Improving decoded transcripts using LM interpolation: We find the 1-best transcripts from the lattices generated in the previous step. These 1-best transcriptions encapsulate the specific vocabulary, topic and style of the book. As a result, a LM computed purely from the decoded text is expected to be different and more relevant for recognition compared to a LM prepared from all Librispeech text. We exploit this fact to further improve the decoding by creating a new LM which is a linear interpolation of LM prepared on decoded text and LM prepared on entire Librispeech text. The LM interpolation weight for the decoded text is set to 0.9 to create a strong bias towards the book text. A new lattice is then generated using the new LM.
4. Nullifying LM likelihood before computing posteriors: Our goal is to obtain a 1-best hypothesis and associated posterior scores that match well with the acoustics, and have little or no influence of language on them, for USS task. Every word/phone hypothesis in a particular lattice of an utterance carries an acoustic and language model likelihood score. In a pilot experiment, we tried generating a lattice based upon pure acoustic score in the following way. We prepared a unigram LM from text containing a unique list of in-vocabulary words and phones. Just one single occurrence of each word and phone in the text made sure that frequency, and consequently the unigram probability of each word and phone is the same. It was found that the 1-best hypothesis produced by this method was nowhere close to the reference sequence of words. This outcome was understandable as we had not put any language constraints, and the decoder was tied down to choose between several words (~200,000) in the lexicon just on the basis of acoustics. Understandably, the phone hypothesis obtained by lexicon look up was also worse. The example below demonstrates the large error in the hypothesis when a unigram LM was used.
Reference: RECORDING BY RACHEL FIVE LITTLE PEPPERS AT SCHOOL BY MARGARET SIDNEY.
Unigram LM: EDINGER RACHEL FADLALLAH PEPPERS SAT SQUALL PRIMER GRITZ SIDNEY.
We therefore resorted to the following approach. Rather than using the above mentioned unigram LM from the start, i.e. for the generation of lattices, it proved to be more useful to rescore the lattices (obtained in the previous step after LM rescoring and LM interpolation) containing alternative hypotheses which are much closer to the sequence of reference words. The posteriors, therefore obtained also reflected pure acoustics. The sentences below show the 1-best output of the lattice in the previous step (after performing LM rescoring and interpolation), and the 1-best output after rescoring the same lattice with unigram LM having equal unigram probabilities for each in-vocabulary word.
4-gram LM: READING BY RACHEL FIVE LITTLE PEPPERS AT SCHOOL BY MARGARET SIDNEY.
Nullified LM: READING MY RACHEL SIL FILE IT ILL PEPPERS AT SCHOOL BY MARGARET SIDNEY.
The second hypothesis is more close to acoustics. Difference between the two hypotheses are italicized. It is clear that the 1-best transcription is better and also closer to acoustics when the unigram LM is used for rescoring the lattice generated in the previous step, rather than using it for generating the lattice from scratch. Consequently, the phone-level transcripts are also better, and the posteriors purely reflect acoustic match.
5. Articulatory rescoring and system combination: The lattices generated in the previous step are rescored using a pnorm DNN-HMM acoustic model, trained on articulatory features, and speaker adapted articulatory features to yield a new lattice. This new rescored lattice is then combined with the original lattice to form a combined lattice. Pure articulatory feature based recognition is not as robust, and hence lattices are not generated using the acoustic model trained on articulatory features, and it is rather used for rescoring the lattice generated using acoustic model trained on MFCC. Lattice combination provides the advantage that two lattices scored with two different models and features contain complementary information, which yields a lattice with more robust acoustic scores. The 1-best hypothesis obtained from the above lattice is also more accurate. Word lattices are then converted to phone lattices. The 1-best phone sequence from the phone lattice along with the posteriors is what we use for building USS system.
Owen He has also assisted in this project and assisted by:
1. Shifting from Python 3 to Python 2
2. Increased the speed of training by replacing GMM from Sklearn with GMM from pyCASP.
3. Added functions to recognize features directly, so that it is ready for the shared features from the pipeline.
4. Returned the log likelihood of each prediction so that one can make rejections on untrained classes and filter out unreliable prediction results. You can also use it to search for speakers, by looking for predicted speakers with high likelihood.
5. Incorporating Karan's Speaker Diarization results
6. Made the output file have a format that's consistent with other Red Hen output files, an example output file produced with the same video that Karan used can be found here.
Additional information about Owen He's work on Speaker Identification can be found here, under the files "Pipeline" and "Speaker."
OWEN HE - AUDIO ANALYSIS BY CONCEPTORS / EMOTION RECOGNITION AND TONE CHARACTERIZATION BY DNN + ELM
References: GitHub / Summary of Results / Initial Proposal
Owen He used a reservoir computing method called conceptor together with the traditional Gaussian Mixture Models (GMM) to distinguish voices between different speakers. He also used a method proposed by Microsoft Research last year at the Interspeech Conference, which used a Deep Neural Network (DNN) and an Extreme Learning Machine (ELM) to recognize speech emotions. DNN was trained to extract segment-level (256 ms) features and ELM was trained to make decisions based on the statistics of these features on a utterance level. Owen's project focused on applying this to detect male and female speakers, specific speakers, and emotions by collecting training samples from different speakers and audio signals with different emotional features. He then preprocessed the audio signals, and created the statistical models from the training dataset. Finally, he computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached. You can check out the summary of results on GitHub.
His method is as follows:
1. Create a single, small (N=10 units) random reservoir network.
2. Drive the reservoir, in two independent sessions, with 100 preprocessed training samples of each gender, and create conceptors C_male, C_female respectively from the network response.
3. In exploitation, a preprocessed sample s from the test set is fed to the reservoir and the induced reservoir states x(n) are recorded and transformed into a single vector z. For each conceptor then the positive evidence quantities z’ C_male z and z’ C_female z are then computed. We can now identify the gender of the speaker by looking at the higher positive evidence. i.e. the speaker is male if z’ C_male z > z’ C_female z and vice versa. The idea behind this procedure is that if the reservoir is driven by a signal from male speaker, the resulting response z signal will be located in a linear subspace of the reservoir state space whose overlap with the ellipsoid given by C_male is larger than that with the ellipsoid given by C_female.
4. In order to further improve the classification quality, we also compute NOT(C_male) and NOT(C_female). This leads to negative evidence quantities z’NOT(C_male)z and z’NOT(C_female)z.
5. By adding the positive evidence z’ C_male z and negative evidence z’NOT(C_female)z, a combined evidence is obtained, which can be paraphrased as “this test sample seems to be from male speaker and seems not to be from female speaker”.
Owen replicated this to detect male and female speakers, specific speakers, and emotion by collecting training samples from male and female speakers, specific speakers to be recognized, and audio signals with different emotional features. He then preprocessed the audio signals, and created the conceptors from the training dataset. He then computed the combined evidence in real time and tuned the apertures for the conceptors so that the optimal classification performance could be reached.
The training was done using the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database from USC's Viterbi School of Engineering, found here.
Owen incorporates the python speech features library to extract MFCC features from audio files, which can be replaced by the common feature extraction component once the Red Hen Lab Audio Analysis pipeline is established.
Vasanth Kalingeri - Commercial detection system
References: GitHub / Summary of Results / Initial Proposal
Vasanth Kalingeri built a system for detecting commercials in television programs from any country and in any language. The system detects the location and the content of ads in any stream of video, regardless of the content being broadcast and other transmission noise in the video. In tests, the system achieved 100% detection of commercials. An online interface was built along with the system to allow regular inspection and maintenance.
Method employed
Audio fingerprinting of commercial segments was used for the detection of commercials. Fingerprint matching has a very high accuracy when dealing with audio, even severely distorted audio can be recognized with very high accuracy. The major problem faced was that the system had to be generic for all TV broadcasts, regardless of audio/video quality, aspect ratio and other transmission related errors. Audio fingerprinting provides a solution for all these problems. It was seen after implementation that several theoretical ways of detecting commercials as suggested in the proposal was not as accurate as audio fingerprinting and hence audio fingerprinting stayed in the final version of the system.
Work
Initially the user uses a set of hand tagged commercials. The system detects these set of commercials in the TV segment. On detecting these commercials, it divides the entire broadcast into blocks. Each of these blocks can be viewed and tagged as commercials by the user. This comprises of the maintenance of the system. There is a set of 60 hand labelled commercials for one to work with. This process takes about 10-30min for a 1hr TV segment, depending on the number of commercials that have to be tagged.
When the database has an appreciable amount of commercials (usually around 30 per channel) we can use it to recognize commercials in any unknown TV segment.
Results
On running the system on any video, one can expect the following format of output file:
00:00:00 - 00:00:38 = Unclassified
00:00:38 - 00:01:20 = ad by Jeopardy
00:01:21 - 00:02:20 = ad by Jerome’s
… and so on
The above is the location of the commercials and the content it advertises in the video.
In case the system missed detecting a few commercials, one can edit this file through a web interface which looks as follows:
On making changes to the web interface, the system updates its db with new/edited commercials. This web interface can be used for viewing the detected commercials as well.
Applications
This system is extremely useful for people with the following interest:
People who despise commercials by a particular company, or who despise commercials all together can use it to detect location of commercials. With help of the script ffmpeg.py (part of the system) one can remove all commercials detected, or selectively chose to keep some. This way, it can be used as an alternative to TiVo (with a few exceptions though).
Those who are working on building a very effective automatic commercial detection system can use this to reliably build training data. This method promises to be much faster and more effective than the normal way of building the entire training data by manually tagging content.
TV broadcasters who are wishing to employ targeted ads based on location can use it too.
MATTIA CERRATO - AUDIO FINGERPRINTING
References: GitHub / Final Report / Blog / Initial Proposal
Mattia Cerrato worked on developing audio fingerprinting solutions that adapt well to the Red Hen corpus of TV News broadcasts. He developed two applications, broadcastsegmentor and clip-find; he also contributed code that extracts the audio novelty feature.Methods Employed
Broadcastsegmentor uses the Panako library to create and query an audio fingerprint database that is made up of commercials. After querying the database, recognition times are used to find commercial breaks in the input audio file, which is assumed to be taken from a TV News broadcast.
Clip-find also uses Panako to create and query a fingerprint database, but its usage is focused on simply finding where (and if) the input audio file was broadcasted and when.
Audio novelty is a measure introduced by Foote in a 1999 paper. Its original purpose was to find note onsets in a solo instrument performance, but has a very broad definition that can be useful to the Red Hen Lab (that is, saying something meaningful about a TV news corpus).
Roughly speaking, audio novelty is a measure of how surprising the audio we are currently observing is when compared with audio from before and after it. It is a spectral feature, that is, we calculate and confront the spectrum of the audio to make such a judgement. This definition is very broad, since a lot of changes in audio can be deemed surprising (a change of speakers, a commercial starting or ending): monitoring the peaks in audio novelty throughout an audio file can find moments that relate to different audio analysis tasks.
Results and Future Developments
When broadcastsegmentor is trained with every commercial in the TV broadcast, it can detect commercial breaks with very good accuracy and timing; however, building a complete “commercial database” remains a challenge to be undertaken.
[image 1: a typical broadcastsegmentor file output: the program has been fully trained in this example]
Clip-find can be used to build a complete fingerprint database out of the Red Hen Corpus: given ~10 seconds of audio taken from a TV news program, it can find it even if the database is very large.
Future developers could be interested in developing a new fingerprint generation and matching algorithm that is faster, so that the application can “parse back” the corpus faster, or with stronger noise resistance.
[image 2: clip-find outputs if and when the input audio (first file) was found in the fingerprinted Red Hen corpus]
The best usage for the audio novelty feature extractor would be in a machine learning approach, as a boolean feature with high recall (interesting moments are often novelty peaks) but low precision.
[image 3: In this audio file, taken from an Al Jazeera America broadcast, the novelty peaks out at the ~90th analysis frame.]
SRI HARSHA - DETECTION OF NON-VERBAL EVENTS IN AUDIO FILE
References: GitHub / Updates / Blog / Initial Proposal
Funding for this Red Hen Summer of Code was provided by the Alexander von Humboldt Foundation, via an Anneliese Maier Research Prize awarded to Mark Turner to support his research into blending and multimodal communication. Red Hen is grateful to the Alexander von Humboldt Foundation for the support.
Sri Harsha has worked to develop a module for detecting non-verbal events in an audio file. Examples of "non-verbal events" are laughs, sighs, yawns or any expression that does not use language. Verbal sounds refer to the sounds which makes the different words in our language. Each sound has its own rules, which are followed to produce that particular sound. The meaning conveyed by our speech depends on the sequence of sounds we produce. In our daily conversations, apart from speaking, we laugh, cry, shout and produce sounds like deep breathing to show exhaustiveness, yawing, different sighs, cough etc. All these sounds/speech segments are produced by humans and do not convey any meaning explicitly but do provide information about the physical, mental and emotional state of a person. These speech segments are referred to as non-verbal speech sounds.
One interesting difference between verbal and non-verbal speech sounds is that a person cannot understand the verbal content if he is not familiar with the language but irrespective of the language, we can perceive the non-verbal speech sounds and get some knowledge about the emotional or physical state of a person.
Approach:
An algorithm for detection of laughter segments is developed by considering the features and common pattern exhibited by most laughter segments.
Main steps involved in the algorithm are:
1. Pre-processing
2. Feature extraction
3. Decision logic
4. Post-processing
1. Pre-processing
Most of the laughter segments are voiced. Hence, the first step of the algorithm involves the voiced non-voiced segmentation of the speech signal. After the voiced non-voiced segmentation, only the voiced segments are considered for further analysis.
The pre-processing step involves the voiced non-voiced (VNV) segmentation of the given audio signal and also extraction of epoch locations.
VNV segmentation is performed by using the energy of the signal obtained as output of the zero frequency resonator (ZFR). In order to perform the VNV segmentation method explained in [6] is used. Steps involved in this method are:
Given audio signal is passed through a cascade of two zero frequency resonators (ZFR).
Energy of the ZFR output signal is calculated
VNV decision is obtained by placing a threshold on energy of ZFR output signal
Epoch refers to the location of the significant excitation of the vocal tract system. All excitation source features obtained in this algorithm are extracted around the epoch locations. This is because epochs are regions with high signal to noise ratio and more robust to environmental degradations compared to other regions of the speech signal. In this algorithm, epoch locations are obtained by using the method explained here.
2. Feature extraction
In this step, only voiced segments obtained in the first step are considered. Acoustic features based on the excitation source and vocal tract system characteristics of laughter segments are extracted for detection.
The source and system characteristics of laugh signals are analyzed using features like pitch period (T0 ), strength of excitation (α), amount of breathiness and some parameters derived from them which are explained below in detail.
Pitch period ( T0 ):
Pitch period (T0) is computed by taking the difference between two epoch locations. Thus T0 values are extracted at each epoch location. Reciprocal of T0 values gives the values of fundamental frequency (F0).
It is observed that fundamental frequency for laughter is more than that for normal speech. For normal speech the fundamental frequency typically ranges between 80 Hz and 200 Hz for male speakers and 200 Hz to 400 Hz for female speakers, whereas for laughter the mean fundamental frequency for males is above 250 Hz, and for females it is above 400 Hz [1].
Strength of Excitation (α):
Strength of excitation (α) at every epoch is computed as the difference between two successive samples of the zero frequency resonator output signal in the vicinity of the epoch i.e., difference between the values of the ZFR signal samples before and after epoch locations.
Since there is large amount of air pressure build up in the case of laughter, (as large amounts of air is exhaled), the closing phase of the vocal folds is very fast. This will result in an increase in the strength of excitation [7].
Duration of the opening phase (β)
Since the closing phase of the vocal folds is fast for laughter, the corresponding opening phase will be larger in duration [7]. So we have used the ratio (β) of the strength to excitation (α) at the epoch location and the pitch period (T0 ) as an approximate measure of the relative duration of opening phase [4].
β = α/T 0
Slope of T0 (δT0):
The pitch period contour of laughter has a unique pattern of rising rapidly at the end
of a call. So, we use the slope of the pitch period contour to capture this pattern.
Extraction of slope of T0:
First the pitch period contour is normalized between 0 and 1.
At every epoch location the slope of the pitch period contour is obtained using a window width of 5 successive epochs. The slope is calculated by dividing the difference between the maximum and minimum of the 5 pitch period values within each window by the duration of the window.
We denote this slope by δT0 . This value will almost be zero at the first half of the laughter call and a value approximately corresponding the slope of the contour during the rising phase.
Slope of α (δα)
As in the case of the pitch period, the strength of excitation at epochs also changes rapidly. It rises rapidly to some maximum value and again falls at the same rate.
Hence the slope of the normalized strengths is calculated by dividing the difference between maximum and minimum of the normalized strength values within 5 epochs window by the duration of the window. We denote this slope by δα.
Loudness parameter (η)
Because of high amount of airflow, laughter is typically accompanied by some amount of breathiness. Breathiness is produced with the vocal folds loosely vibrating and as a result more air escaping through the vocal tract than modally voiced sound [10]. This type of phonation is also called glottal frication and is reflected as high frequency noise (non-deterministic component) in the signal. A breathy signal will typically have less loudness and more non-deterministic (noise) component.
Measures based on hilbert envelope (HE) are used for calculating loudness and proportion of non-deterministic component in the signal. Loudness is defined as the rate of closure of vocal folds at the glottal closure instant (gci) [11]. This can be computed from hilbert envelope of excitation signal (residual) obtained from inverse filtering the signal (LP analysis).
Dominant resonance frequency (DRF)
Dominant resonance frequency (DRF) represents the dominant resonances of the vocal tract system. DRF values can be obtained by computing the DFT in a short window and considering the frequency with maximum amplitude in the DRF values. This frequency is called Dominant resonance frequency (DRF).
Dominant resonance strength (γ)
Dominant resonance strength (γ) refers to the amplitude of the DFT at DRF i.e., the maximum amplitude of the DFT values obtained in a frame. DRS values are higher for laughter compared to neutral speech.
3. Decision logic
A decision is obtained for every feature extracted in step 2. This decision is obtained by laying a threshold on the feature value (threshold is different for different feature).
After extracting the above features for every epoch location, a decision has to be finally made on the voiced segment based on these values. Note that in this algorithm only voiced laughter and unvoiced laughter are not considered.
For every feature, a decision is first made on each epoch in the segment. This is performed by putting a threshold on the feature value (different for each feature), which is called as ‘value threshold’ (vt) for that feature. If the feature value of an epoch satisfies this ‘value threshold’, it means that the epoch belongs to laughter according to that feature. A decision is then made on the segment by putting a threshold called ‘fraction threshold’ ( ft), which determines the percentage of epochs that should satisfy the ‘value threshold’ for the segment to be a laughter segment. After applying the two thresholds, separate binary decisions on the segment are obtained for all the features. Finally, the segment is considered as laughter if at least 50% of the features gave a positive decision.
False alarm rate (FAR) and false rejection rate (FRR) are computed for training set for various threshold values. The threshold value for which FAR and FRR values is minimum is selected as the required threshold for the particular feature. Then thresholds are also computed by considering different combination of features to verify which combination of features and thresholds will given the best performance.
4. Post-processing
This step is used to obtain the boundaries of the laughter segments based on the decision obtained in step 3.
The regions of the speech signal considered as laughter in the decision logic step are further processed to obtain the final boundaries of the laughter regions.
This step has two parts:
First, if two regions considered as laughter are separated by less than 50 msec, then both the segments are connected i.e., segments less than 50 msec between two regions hypothesized as laughter in decision logic step are also considered as laughter.
In the second part, segments less than 50 msec hypothesized as laughter are eliminated as generally we do not have laughter segments less than 50 msec.