Audio processing pipeline

Introduction

Red Hen Lab's Summer of Code 2015 students worked mainly on audio. Graduate student Owen He has now assembled several of the contributions into an integrated audio processing pipeline, to process the entire NewsScape dataset. This is a description of the current pipeline along with some design instructions; for the code itself, see our github account.


Related resources

Red Hen processing pipelines

Red Hen is developing the following automated processing pipelines:
  1. Capture and text extraction (multiple locations around the world)
  2. OCR and video compression (Hoffman2)
  3. Text annotation (Red Hen server, UCLA)
  4. Audio parsing (Case HPC)
  5. Video parsing (Hoffman2)
The first three are in production; the task is to create the fourth. Dr. Jungseock Joo and graduate student Weixin Li have started on the fifth. The new audio pipeline is largely based on the audio parsing work done over the summer, but Owen He has also added some new code. The pipeline may be extended to allow video analysis and text to contribute to the results. The data is multimodal, so we're aiming to eventually develop fully multimodal pipelines.

Audio pipeline design

Candidate extensions

  1. Temporal windows in gentle -- see https://github.com/lowerquality/gentle/issues/103
  2. Speech to text -- cf. https://github.com/gooofy/py-kaldi-simple
Temporal windows could usefully be combined with speech to text. For some of our video files, especially those digitized from tapes, the transcript is very poor. We could use a dictionary to count the proportion of valid words, and run speech to text on passages where the proportion falls below a certain level. 

Current implementation by Owen He

Below I briefly describe each component of the pipeline. You can find detailed documentation as an executable main script at https://github.com/RedHenLab/Audio/blob/master/Pipeline/Main.ipynb

1. Python Wrapper: All the audio processing tools from last summer are wrapped into a Python Module called "AudioPipe". In addition, I fixed an almost undetectable bug in the Diarization code, leading to an efficiency improvement by 3 times.

2. Shared Preprocessing: the preprocessing part of the pipeline (media format conversion, feature extraction, etc.) is also wrapped as python modules (features, utils) in "AudioPipe".

3. Data Storage: Data output is stored in this folder (https://github.com/RedHenLab/Audio/tree/master/Pipeline/Data), where you can also find the results from testing the pipeline on a sample video (media files are .gitignored since their sizes are too large). Note that the speaker recognition algorithm is now able to detect imposters (tagged as "Others"). The subfolder "Model" is the Model Zoo, where future machine learning algorithms should store their model configurations and README files. The result data are stored in RedHen format (but the meta data for computation are in .json).

4. Data Managing: For manipulating the data, we use abstract syntax specified in the data managing module. The places where data are stored are abstracted as "Node", and the computation processes are abstracted as "Flow" from one Node to the other.

5. Main Script: As you can see in the main script(https://github.com/RedHenLab/Audio/blob/master/Pipeline/Main.ipynb), by deploying the data managing module, the syntax becomes so concise that every step in the pipeline boils down to only 2 lines of code. This will be very convenient for the non-developers to use the audio pipeline.

Design targets

The new audio processing pipeline will be implemented on Case Western Reserve University's High-Performance Computing Cluster. Design elements:
  • Core pipeline is automated, processing all NewsScape videos via GridFTP from UCLA's Hoffman2 cluster
    • Incoming videos -- around 120 a day, or 100 hours
    • Archived videos -- around 330,000, or 250,000 hours (will take months to complete)
  • Extensible architecture that facilitates the addition of new functions, perhaps in the form of conceptors and classifiers
The pipeline should have a really clear design, with an overall functional structure that emphasizes core shared functions and a bunch of discrete modules. For instance, we could think of a core system that ingests the videos and extracts the features needed by the different modules. Or a 'digestive system' type approach where each stage contributes to the subsequent stage.

The primary focus for the first version of the pipeline is an automated system that ingests all of our videos and texts and processes them in ways that yield acceptable quality output with no further training or user feedback. There shouldn't be any major problems completing this core task, as the code is largely written and it's a matter of creating a good processing architecture.

Audio pipeline modules

The audio processing pipeline should tentatively have at least these modules, using the code from our GSoC 2015:
  1. Forced alignment (Gentle, using Kaldi)
  2. Speaker diarization (Karan Singla)
  3. Gender detection (Owen He)
  4. Speaker identification (Owen He -- a pilot sample and a clear procedure for adding more people)
  5. Paralinguistic signal detection (Sri Harsha -- two or three examples)
  6. Emotion detection and identification (pilot sample of a few very clear emotions)
  7. Acoustic fingerprinting (Mattia Cerrato -- a pilot sample of recurring audio clips)
The last four modules should be implemented with a small number of examples, as a proof of concept and to provide basic functionality open to expansion.

Audio pipeline output

The output is a series of annotations in JSON-Lines and also in Red Hen's data format, with timestamps, primary tags indicating data type, and field=value pairs.

Output samples

Here are the outputs from running the pipeline on the sample video 
2015-08-07_0050_US_FOX-News_US_Presidential_Politics.mp4, in sections below.

Diarization

Speaker diarization using Karan Singla's Code:

SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 0.0 7.51 <NA> <NA> speaker_1.0 <NA>
SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 7.5 2.51 <NA> <NA> speaker_0.0 <NA>
SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 10.0 5.01 <NA> <NA> speaker_1.0 <NA>

Gender identification

Gender Identification based on Speaker Diarization results:

GEN_01|2016-03-28 16:38|Source_Program=SpeakerRec.py Data/Model/Gender/gender.model|Source_Person=He Xu
20150807005002.000|20150807005009.500|GEN_01|Gender=Male|Log Likelihood=-19.6638865771
20150807005009.500|20150807005012.000|GEN_01|Gender=Male|Log Likelihood=-21.5807774474

Gender Identification without Speaker Diarization, but based on 5-second segments:

Speaker recognition

Speaker Recognition (of Donald Trump) based on Speaker Diarizaiton results:

SPK_01|2016-03-28 16:57|Source_Program=SpeakerID.py Data/Model/Speaker/speaker.model|Source_Person=He Xu
20150807005002.000|20150807005009.500|SPK_01|Name=Other|Log Likelihood=-19.9594622598
20150807005009.500|20150807005012.000|SPK_01|Name=Other|Log Likelihood=-20.9657984337
20150807005012.000|20150807005017.000|SPK_01|Name=Other|Log Likelihood=-20.7527012621

Speaker Recognition without Speaker Diarization, but based on 5-second segments,

Acoustic fingerprinting

Acoustic Fingerprinting using Mattia Cerrato's Code (as panako database files):

The video, audio and feature files are not pushed to GitHub due to their large sizes, but they are part of the pipeline outputs as well.

An alternative tool for audio fingerprinting is the open-source tool dejavu on github.

Training

We may have to do some training to complete the sample modules. It would be very useful if you could identify what is still needed to complete a small number of classifiers for modules 4-6, so that we can recruit students to generate the datasets. We can use Elan, the video coding interface developed at the MPI in Nijmegen, to code some emotions (see Red Hen's integrated research workflow).

We have several thousand tpt files, and I suggest we use them to build a library of trained models for recurring speakers. The tpt files must first be aligned; they inherit their timestamps from the txt files, so they are inaccurate. We can then
  1. read the tpt file for boundaries
  2. extract the speech segments for every speaker
  3. concatenate the segments from the same speakers, so that we can have at least 2 minutes training data for everyone
  4. feed these training data to the speaker recognition algorithm to get the models we want
This way, the entire training process can be automated.

A simple, automated method to select which speakers to train for would be to extract the unique speakers from each tpt file and then count how often they recur. I did this in the script cartago:/usr/local/bin/speaker-list; it generates this output:

tna@cartago:/tmp$ l *tpt
-rw-r--r-- 1 tna tna  125138 Apr  7 08:39 2006-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna  305251 Apr  7 08:36 2006-Speakers.tpt
-rw-r--r-- 1 tna tna  468137 Apr  7 08:40 2007-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1352569 Apr  7 08:28 2007-Speakers.tpt
-rw-r--r-- 1 tna tna  403465 Apr  7 08:40 2008-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1370985 Apr  7 08:30 2008-Speakers.tpt
-rw-r--r-- 1 tna tna  442405 Apr  7 08:40 2009-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1294787 Apr  7 08:31 2009-Speakers.tpt
-rw-r--r-- 1 tna tna  375668 Apr  7 08:40 2010-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1015746 Apr  7 08:32 2010-Speakers.tpt
-rw-r--r-- 1 tna tna  336958 Apr  7 08:40 2011-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1024899 Apr  7 08:33 2011-Speakers.tpt
-rw-r--r-- 1 tna tna  277164 Apr  7 08:40 2012-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna  833007 Apr  7 08:34 2012-Speakers.tpt
-rw-r--r-- 1 tna tna  342556 Apr  7 08:40 2013-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna  940021 Apr  7 08:35 2013-Speakers.tpt
-rw-r--r-- 1 tna tna  328208 Apr  7 08:40 2014-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna 1106093 Apr  7 08:37 2014-Speakers.tpt
-rw-r--r-- 1 tna tna  283859 Apr  7 08:40 2015-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna  980963 Apr  7 08:38 2015-Speakers.tpt
-rw-r--r-- 1 tna tna   75845 Apr  7 08:40 2016-Recurring-Speakers.tpt
-rw-r--r-- 1 tna tna  242861 Apr  7 08:38 2016-Speakers.tpt

So the /tmp/$YEAR-Recurring-Speakers.tpt files list how many shows a person appears in, by year. If we want more granularity, we could run this by month instead, to track who moves in and out of the news. The script tries to clean up the output a bit, though we may want to do more.

If you look at the top speakers so far in 2016, it's a fascinating list:

     32 Alison Kosik
     32 Margaret Hoover
     32 Nic Robertson
     32 Polo Sandoval
     32 Sen. Rand Paul
     33 Jean Casarez
     33 Nima Elbagir
     33 Sarah Palin
     33 Victor Blackwell
     34 CAMEROTA
     34 CLINTON
     34 Sara Sidner
     34 Sen. Lindsey Graham
     35 CUOMO
     35 Kate Bolduan
     35 QUESTION
     36 Chad Myers
     36 Sen. Bernie Sanders (Vt-i)
     37 Andy Scholes
     37 Katrina Pierson
     38 Jeffrey Toobin
     38 Nick Paton Walsh
     39 Brian Todd
     39 Frederik Pleitgen
     39 Hillary Rodham Clinton
     39 Nancy Grace
     39 Sara Ganim
     40 Matt Lewis
     40 Michelle Kosinski
     40 Paul Cruickshank
     42 Bakari Sellers
     42 Bill Clinton
     42 PEREIRA
     43 Ana Navarro
     43 Clarissa Ward
     43 S.E. Cupp
     43 Van Jones
     44 Ben Ferguson
     44 Jason Carroll
     45 CLIENT
     46 AVO
     46 Evan Perez
     46 Maeve Reston
     47 David Chalian
     47 David Gergen
     47 Errol Louis
     47 Ron Brownstein
     48 SFX
     49 Dr. Sanjay Gupta
     50 CRUZ
     51 Amanda Carpenter
     51 John King
     52 Miguel Marquez
     53 FLO
     54 Kayleigh Mcenany
     55 Poppy Harlow
     58 Athena Jones
     59 Bernie Sanders
     59 Coy Wire
     60 FARMER
     61 Don Lemon
     61 Nick Valencia
     62 Barbara Starr
     63 Brian Stelter
     65 Jim Sciutto
     67 Gov. Chris Christie
     68 VO
     70 Erin Burnett
     71 Carol Costello
     72 Joe Johns
     72 Pamela Brown
     74 Ashleigh Banfield
     74 Jeffrey Lord
     75 Manu Raju
     76 Chris Frates
     81 Phil Mattingly
     88 Christine Romans
     90 TRUMP
     93 Mark Preston
     95 Brooke Baldwin
     96 Gloria Borger
    101 Jake Tapper
    103 Jeb Bush
    107 SANDERS
    112 Jim Acosta
    122 Michaela Pereira
    127 PRESIDENTIAL CANDIDATE
    131 Brianna Keilar
    133 Gov. John Kasich
    134 Sunlen Serfaty
    135 Anderson Cooper
    143 Dana Bash
    144 Sen. Bernie Sanders (I-vt)
    149 Sara Murray
    150 John Berman
    157 Wolf Blitzer
    160 Barack Obama
    163 Alisyn Camerota
    166 Sen. Bernie Sanders
    168 Jeff Zeleny
    180 Chris Cuomo
    258 Sen. Marco Rubio
    326 Hillary Clinton
    382 Sen. Ted Cruz
    522 Donald Trump

You see Sanders as Sen. Bernie Sanders (I-vt), Sen. Bernie Sanders (Vt-i), Sen. Bernie Sanders, and SANDERS, Trump as Donald Trump and TRUMP, so let's include multiple names for the same person in extracting the training data. 

For systematic disambiguation, it may be possible to use the Library of Congress Name Authority File (LCNAF). It contains 8.2 million name authority records (6 million personal, 1.4 million corporate, 180,000 meeting, and 120,000 geographic names, and .5 million titles). As a publicly supported U.S. Government institution, the Library generally does not own rights in its collections and what is posted on its website."Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. The Library also reserves the right to terminate programs that require more than 24 hours to complete." For an example record, see:

  http://lccn.loc.gov/n94112934

The virtue of using this database is that it is likely accurate. However, the records are impoverished relative to Wikipedia; arguably, it's Wikipedia that should be linking to into LCNAF. It is also unclear if the LCNAF has an API that facilitates machine searches; see LoC SRW for leads.

Pete Broadwell writes on 17 April 2016,

In brief, I think the best way to disambiguate named persons would be to set up our own local DBpedia Spotlight service:
We’ve discussed Spotlight briefly in the past; it’s trivial to set up a basic local install via apt, but (similar to Gisgraphy), I think more work will be necessary to download and integrate the larger data sets that would let us tap into the full potential of the software.

In any case, this is something Martin and I have planned to do for the library for quite some time now. I suggest that we first try installing it on babylon, with the data set and index files stored on the Isilon (which is what we do for Gisgraphy) — we could move it somewhere else if babylon is unable to handle the load. We can also see how well it does matching organizations and places (the latter could help us refine the Gisgraphy matches), though of course places and organizations don’t speak. 

I share your suspicion that the LCNAF isn’t necessarily any more extensive, accurate or up-to-date than DBpedia/Wikipedia, especially for people who are in the news. It also doesn’t have its own API as far as I can tell; the suggested approach is to download the entire file as RDF triples and set up our own Apache Jena service (http://jena.apache.org/) to index them. Installing Spotlight likely would be a better use of our time.

If we set the cutoff at speakers who have appeared in at least 32 shows, we would get a list of a hundred common speakers. But it may be useful to go much further. Even people who appear in a couple of shows could be of interest; I recognize a lot of the names. That would give us thousands of speakers:

tna@cartago:/tmp$ for YEAR in {2006..2016} ; do echo -en "$YEAR: \t" ; grep -v '^      1 ' $YEAR-Recurring-Speakers.tpt | wc -l ; done
2006:   2121
2007:   8441
2008:   6922
2009:   7360
2010:   6112
2011:   5442
2012:   4438
2013:   5416
2014:   5189
2015:   4434
2016:   1149

It's likely 2007 is high simply because we have a lot of tpt files from that year. Give some thought to this; the first step is to get the alignment going. Once we have that, we should have a large database of recurring speakers we can train with.

Efficiency coding

The PyCASP project makes an interesting distinction between efficiency coding and application coding. We have a bunch of applications; your task is to integrate them and make them run efficiently in an HPC environment. PyCASP is installed at the Case HPC if you would like to use it. The Berkeley team at ICSI who developed it also have some related projects; we have good contacts with this team:
Please asses if the efficiency coding framework could be useful in the pipeline design. It's important to bear in mind that we want this design to be clear, transparent, and easy to maintain; it's possible that introducing the PyCASP infrastructure will make it more difficult to extend, in which case we should not use it.

Integrating the training stage

To the extent there's time, I'd also like us to consider a somewhat more ambitious project that integrates the training stage. Could you for instance sketch an outline of how we might create a processing architecture that integrates deep learning for some tasks and conceptors for others? There are a lot of machine learning tools out there; RHSoC2015 used SciPy and Kaldi. Consider Google's project TensorFlow -- this is a candidate deep learning approach for integrated multimodal data. We see this as a longer-term project.


Red Hen Audio Processing Pipeline Guide


Dependencies:


In order to run the main processing main script, the following modules should be loaded on the HPC Cluster:


module load boost/1_58_0

module load cuda/7.0.28

module load pycasp

module load hdf5

module load ffmpeg


And Python related modules are installed in a virtual environment, which can be activated by the following command:


. /home/hxx124/myPython/virtualenv-1.9/ENV/bin/activate



Python Wrapper:


Although we welcome audio processing tools implemented in any language, to make it easy for the integration of several audio tools into one unified pipeline, we strongly recommend the developers can wrap their code as a Python module, so that the pipeline can include their work by simply importing the corresponding module. All audio-related works from GSoC 2015 have been wrapped into a Python Module called "AudioPipe". See below for some examples:


# the AudioPipe Python Module
import AudioPipe.speaker.recognition as SR # Speaker Recognition Module
import AudioPipe.fingerprint.panako as FP # Acoustic Fingerprinting Module
from AudioPipe.speaker.silence import remove_silence # tool for remove the silence in the audio, not needed
import numpy as np
from AudioPipe.features import mfcc # Feature Extraction Module, part of the shared preprocessing
import scipy.io.wavfile as wav
from AudioPipe.speaker.rec import dia2spk, getspk # Speaker Recognition using diarization results
from AudioPipe.utils.utils import video2audio # Format converting module, part of the shared preprocessing
import commands, os
from AudioPipe.diarization.diarization import Diarization # Speaker Diarization Module
import AudioPipe.data.manage as DM # Data Management Module


Pipeline Abstract Syntax:


To specify a new pipeline, one can use the abstract syntax defined in the Data Management Module, where a Node is a place to store data and a Flow is a computational process that transforms input data to output data.


For instance, one can create a Node for videos as follows:


# Select the video file to be processed
Video_node = DM.Node("Data/Video/",".mp4")
name = "2015-08-07_0050_US_FOX-News_US_Presidential_Politics"


where DM.Node(dir, ext) is a constructor for node that takes 2 arguments: directory(dir) and extension(ext).


And to construct a small pipeline that converts video to audio, one can do the following:

# Convert the video to audio
Audio_node = DM.Node("Data/Audio/", ".wav")
audio = Video_node.Flow(video2audio, name, Audio_node, [Audio_node.ext])


which first creates a node for audio outputs, and then flow a specific file from the video node to the audio node through the computational process called video2audio. Name specifies which file exactly is going to be processed and it will also be used as the output file name. The last argument of .Flow() is a list of arguments required by the computational process(video2audio in this case).


The following code gives a more complex pipeline example:

# Select the file for the meta information
Meta_node = DM.Node("Data/RedHen/",".seg")
meta = Meta_node.Pick(name)


# Store the fingerprint of the video
FP_node= DM.Node("Data/Fingerprint/")
output, err, exitcode = Video_node.Flow(FP.Store, name, FP_node, [])


# Run speaker diarization on the audio

Dia_node = DM.Node("Data/Diarization/", ".rttm")
args = dict(init_cluster=20, dest_mfcc='Data/MFCC', dest_cfg="Data/Model/DiaCfg")
dia =  Audio_node.Flow(Diarization, name, Dia_node, args)


# Gender Identification based on Speaker Diarization
Gen_node = DM.Node("Data/Gender/",".gen")
gen = Audio_node.Flow(dia2spk, name, Gen_node, [model_gender, dia, meta, Gen_node.ext])

# Speaker Recognition based on Speaker Diarization
Spk_node = DM.Node("Data/Speaker/",".spk")
spk = Audio_node.Flow(dia2spk, name, Spk_node, [model_speaker, dia, meta, Spk_node.ext])


Basically this pipeline does the following:

It takes the video and stores the fingerprints of it; takes the audio and produces diarization results; from the audio it identifies the gender of each speaker based on the boundary information provided by the diarization results; similarly from the audio it recognizes the speakers.


       

 

Comments