Audio processing pipeline


Red Hen Lab's Summer of Code 2015 students worked mainly on audio. Graduate student Owen He has now assembled several of the contributions into an integrated audio processing pipeline, to process the entire NewsScape dataset. This is a description of the current pipeline along with some design instructions; for the code itself, see our github account.

Related resources

Red Hen processing pipelines

Red Hen is developing the following automated processing pipelines:

    1. Capture and text extraction (multiple locations around the world)
    2. OCR and video compression (Hoffman2)
    3. Text annotation (Red Hen server, UCLA)
  1. Audio parsing (Case HPC)
    1. Video parsing (Hoffman2)

The first three are in production; the task is to create the fourth. Dr. Jungseock Joo and graduate student Weixin Li have started on the fifth. The new audio pipeline is largely based on the audio parsing work done over the summer, but Owen He has also added some new code. The pipeline may be extended to allow video analysis and text to contribute to the results. The data is multimodal, so we're aiming to eventually develop fully multimodal pipelines.

Audio pipeline design

Candidate extensions

    1. Temporal windows in gentle -- see
    2. Speech to text -- cf.
  1. Acoustic and language models at CMU Sphinx

Temporal windows could usefully be combined with speech to text. For some of our video files, especially those digitized from tapes, the transcript is very poor. We could use a dictionary to count the proportion of valid words, and run speech to text on passages where the proportion falls below a certain level.

Current implementation by Owen He

Below I briefly describe each component of the pipeline. You can find detailed documentation as an executable main script at

1. Python Wrapper: All the audio processing tools from last summer are wrapped into a Python Module called "AudioPipe". In addition, I fixed an almost undetectable bug in the Diarization code, leading to an efficiency improvement by 3 times.

2. Shared Preprocessing: the preprocessing part of the pipeline (media format conversion, feature extraction, etc.) is also wrapped as python modules (features, utils) in "AudioPipe".

3. Data Storage: Data output is stored in this folder (, where you can also find the results from testing the pipeline on a sample video (media files are .gitignored since their sizes are too large). Note that the speaker recognition algorithm is now able to detect imposters (tagged as "Others"). The subfolder "Model" is the Model Zoo, where future machine learning algorithms should store their model configurations and README files. The result data are stored in RedHen format (but the meta data for computation are in .json).

4. Data Managing: For manipulating the data, we use abstract syntax specified in the data managing module. The places where data are stored are abstracted as "Node", and the computation processes are abstracted as "Flow" from one Node to the other.

5. Main Script: As you can see in the main script(, by deploying the data managing module, the syntax becomes so concise that every step in the pipeline boils down to only 2 lines of code. This will be very convenient for the non-developers to use the audio pipeline.

Design targets

The new audio processing pipeline will be implemented on Case Western Reserve University's High-Performance Computing Cluster. Design elements:

    • Core pipeline is automated, processing all NewsScape videos via GridFTP from UCLA's Hoffman2 cluster
      • Incoming videos -- around 120 a day, or 100 hours
      • Archived videos -- around 330,000, or 250,000 hours (will take months to complete)
    • Extensible architecture that facilitates the addition of new functions, perhaps in the form of conceptors and classifiers

The pipeline should have a really clear design, with an overall functional structure that emphasizes core shared functions and a bunch of discrete modules. For instance, we could think of a core system that ingests the videos and extracts the features needed by the different modules. Or a 'digestive system' type approach where each stage contributes to the subsequent stage.

The primary focus for the first version of the pipeline is an automated system that ingests all of our videos and texts and processes them in ways that yield acceptable quality output with no further training or user feedback. There shouldn't be any major problems completing this core task, as the code is largely written and it's a matter of creating a good processing architecture.

Audio pipeline modules

The audio processing pipeline should tentatively have at least these modules, using the code from our GSoC 2015:

    1. Forced alignment (Gentle, using Kaldi)
    2. Speaker diarization (Karan Singla)
    3. Gender detection (Owen He)
    4. Speaker identification (Owen He -- a pilot sample and a clear procedure for adding more people)
    5. Paralinguistic signal detection (Sri Harsha -- two or three examples)
    6. Emotion detection and identification (pilot sample of a few very clear emotions)
    7. Acoustic fingerprinting (Mattia Cerrato -- a pilot sample of recurring audio clips)

The last four modules should be implemented with a small number of examples, as a proof of concept and to provide basic functionality open to expansion.

Audio pipeline output

The output is a series of annotations in JSON-Lines and also in Red Hen's data format, with timestamps, primary tags indicating data type, and field=value pairs.

Output samples

Here are the outputs from running the pipeline on the sample video

2015-08-07_0050_US_FOX-News_US_Presidential_Politics.mp4, in sections below.


Speaker diarization using Karan Singla's Code:

SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 0.0 7.51 <NA> <NA> speaker_1.0 <NA>
SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 7.5 2.51 <NA> <NA> speaker_0.0 <NA>
SPEAKER 2015-08-07_0050_US_FOX-News_US_Presidential_Politics 1 10.0 5.01 <NA> <NA> speaker_1.0 <NA>

Gender identification

Gender Identification based on Speaker Diarization results:

GEN_01|2016-03-28 16:38| Data/Model/Gender/gender.model|Source_Person=He Xu
20150807005002.000|20150807005009.500|GEN_01|Gender=Male|Log Likelihood=-19.6638865771
20150807005009.500|20150807005012.000|GEN_01|Gender=Male|Log Likelihood=-21.5807774474

Gender Identification without Speaker Diarization, but based on 5-second segments:

Speaker recognition

Speaker Recognition (of Donald Trump) based on Speaker Diarizaiton results:

SPK_01|2016-03-28 16:57| Data/Model/Speaker/speaker.model|Source_Person=He Xu
20150807005002.000|20150807005009.500|SPK_01|Name=Other|Log Likelihood=-19.9594622598
20150807005009.500|20150807005012.000|SPK_01|Name=Other|Log Likelihood=-20.9657984337
20150807005012.000|20150807005017.000|SPK_01|Name=Other|Log Likelihood=-20.7527012621

Speaker Recognition without Speaker Diarization, but based on 5-second segments,

Acoustic fingerprinting

Acoustic Fingerprinting using Mattia Cerrato's Code (as panako database files):

The video, audio and feature files are not pushed to GitHub due to their large sizes, but they are part of the pipeline outputs as well.

An alternative tool for audio fingerprinting is the open-source tool dejavu on github.


We may have to do some training to complete the sample modules. It would be very useful if you could identify what is still needed to complete a small number of classifiers for modules 4-6, so that we can recruit students to generate the datasets. We can use Elan, the video coding interface developed at the MPI in Nijmegen, to code some emotions (see Red Hen's integrated research workflow).

We have several thousand tpt files, and I suggest we use them to build a library of trained models for recurring speakers. The tpt files must first be aligned; they inherit their timestamps from the txt files, so they are inaccurate. We can then

  1. read the tpt file for boundaries
  2. extract the speech segments for every speaker
  3. concatenate the segments from the same speakers, so that we can have at least 2 minutes training data for everyone
  4. feed these training data to the speaker recognition algorithm to get the models we want

This way, the entire training process can be automated.

A simple, automated method to select which speakers to train for would be to extract the unique speakers from each tpt file and then count how often they recur. I did this in the script cartago:/usr/local/bin/speaker-list; it generates this output:

tna@cartago:/tmp$ l *tpt

-rw-r--r-- 1 tna tna 125138 Apr 7 08:39 2006-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 305251 Apr 7 08:36 2006-Speakers.tpt

-rw-r--r-- 1 tna tna 468137 Apr 7 08:40 2007-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 1352569 Apr 7 08:28 2007-Speakers.tpt

-rw-r--r-- 1 tna tna 403465 Apr 7 08:40 2008-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 1370985 Apr 7 08:30 2008-Speakers.tpt

-rw-r--r-- 1 tna tna 442405 Apr 7 08:40 2009-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 1294787 Apr 7 08:31 2009-Speakers.tpt

-rw-r--r-- 1 tna tna 375668 Apr 7 08:40 2010-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 1015746 Apr 7 08:32 2010-Speakers.tpt

-rw-r--r-- 1 tna tna 336958 Apr 7 08:40 2011-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 1024899 Apr 7 08:33 2011-Speakers.tpt

-rw-r--r-- 1 tna tna 277164 Apr 7 08:40 2012-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 833007 Apr 7 08:34 2012-Speakers.tpt

-rw-r--r-- 1 tna tna 342556 Apr 7 08:40 2013-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 940021 Apr 7 08:35 2013-Speakers.tpt

-rw-r--r-- 1 tna tna 328208 Apr 7 08:40 2014-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 1106093 Apr 7 08:37 2014-Speakers.tpt

-rw-r--r-- 1 tna tna 283859 Apr 7 08:40 2015-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 980963 Apr 7 08:38 2015-Speakers.tpt

-rw-r--r-- 1 tna tna 75845 Apr 7 08:40 2016-Recurring-Speakers.tpt

-rw-r--r-- 1 tna tna 242861 Apr 7 08:38 2016-Speakers.tpt

So the /tmp/$YEAR-Recurring-Speakers.tpt files list how many shows a person appears in, by year. If we want more granularity, we could run this by month instead, to track who moves in and out of the news. The script tries to clean up the output a bit, though we may want to do more.

If you look at the top speakers so far in 2016, it's a fascinating list:

32 Alison Kosik

32 Margaret Hoover

32 Nic Robertson

32 Polo Sandoval

32 Sen. Rand Paul

33 Jean Casarez

33 Nima Elbagir

33 Sarah Palin

33 Victor Blackwell



34 Sara Sidner

34 Sen. Lindsey Graham


35 Kate Bolduan


36 Chad Myers

36 Sen. Bernie Sanders (Vt-i)

37 Andy Scholes

37 Katrina Pierson

38 Jeffrey Toobin

38 Nick Paton Walsh

39 Brian Todd

39 Frederik Pleitgen

39 Hillary Rodham Clinton

39 Nancy Grace

39 Sara Ganim

40 Matt Lewis

40 Michelle Kosinski

40 Paul Cruickshank

42 Bakari Sellers

42 Bill Clinton


43 Ana Navarro

43 Clarissa Ward

43 S.E. Cupp

43 Van Jones

44 Ben Ferguson

44 Jason Carroll


46 AVO

46 Evan Perez

46 Maeve Reston

47 David Chalian

47 David Gergen

47 Errol Louis

47 Ron Brownstein

48 SFX

49 Dr. Sanjay Gupta


51 Amanda Carpenter

51 John King

52 Miguel Marquez

53 FLO

54 Kayleigh Mcenany

55 Poppy Harlow

58 Athena Jones

59 Bernie Sanders

59 Coy Wire


61 Don Lemon

61 Nick Valencia

62 Barbara Starr

63 Brian Stelter

65 Jim Sciutto

67 Gov. Chris Christie

68 VO

70 Erin Burnett

71 Carol Costello

72 Joe Johns

72 Pamela Brown

74 Ashleigh Banfield

74 Jeffrey Lord

75 Manu Raju

76 Chris Frates

81 Phil Mattingly

88 Christine Romans


93 Mark Preston

95 Brooke Baldwin

96 Gloria Borger

101 Jake Tapper

103 Jeb Bush


112 Jim Acosta

122 Michaela Pereira


131 Brianna Keilar

133 Gov. John Kasich

134 Sunlen Serfaty

135 Anderson Cooper

143 Dana Bash

144 Sen. Bernie Sanders (I-vt)

149 Sara Murray

150 John Berman

157 Wolf Blitzer

160 Barack Obama

163 Alisyn Camerota

166 Sen. Bernie Sanders

168 Jeff Zeleny

180 Chris Cuomo

258 Sen. Marco Rubio

326 Hillary Clinton

382 Sen. Ted Cruz

522 Donald Trump

You see Sanders as Sen. Bernie Sanders (I-vt), Sen. Bernie Sanders (Vt-i), Sen. Bernie Sanders, and SANDERS, Trump as Donald Trump and TRUMP, so let's include multiple names for the same person in extracting the training data.

For systematic disambiguation, it may be possible to use the Library of Congress Name Authority File (LCNAF). It contains 8.2 million name authority records (6 million personal, 1.4 million corporate, 180,000 meeting, and 120,000 geographic names, and .5 million titles). As a publicly supported U.S. Government institution, the Library generally does not own rights in its collections and what is posted on its website."Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. The Library also reserves the right to terminate programs that require more than 24 hours to complete." For an example record, see:

The virtue of using this database is that it is likely accurate. However, the records are impoverished relative to Wikipedia; arguably, it's Wikipedia that should be linking to into LCNAF. It is also unclear if the LCNAF has an API that facilitates machine searches; see LoC SRW for leads.

Pete Broadwell writes on 17 April 2016,

In brief, I think the best way to disambiguate named persons would be to set up our own local DBpedia Spotlight service:

We’ve discussed Spotlight briefly in the past; it’s trivial to set up a basic local install via apt, but (similar to Gisgraphy), I think more work will be necessary to download and integrate the larger data sets that would let us tap into the full potential of the software.

In any case, this is something Martin and I have planned to do for the library for quite some time now. I suggest that we first try installing it on babylon, with the data set and index files stored on the Isilon (which is what we do for Gisgraphy) — we could move it somewhere else if babylon is unable to handle the load. We can also see how well it does matching organizations and places (the latter could help us refine the Gisgraphy matches), though of course places and organizations don’t speak.

I share your suspicion that the LCNAF isn’t necessarily any more extensive, accurate or up-to-date than DBpedia/Wikipedia, especially for people who are in the news. It also doesn’t have its own API as far as I can tell; the suggested approach is to download the entire file as RDF triples and set up our own Apache Jena service ( to index them. Installing Spotlight likely would be a better use of our time.

If we set the cutoff at speakers who have appeared in at least 32 shows, we would get a list of a hundred common speakers. But it may be useful to go much further. Even people who appear in a couple of shows could be of interest; I recognize a lot of the names. That would give us thousands of speakers:

tna@cartago:/tmp$ for YEAR in {2006..2016} ; do echo -en "$YEAR: \t" ; grep -v '^ 1 ' $YEAR-Recurring-Speakers.tpt | wc -l ; done

2006: 2121

2007: 8441

2008: 6922

2009: 7360

2010: 6112

2011: 5442

2012: 4438

2013: 5416

2014: 5189

2015: 4434

2016: 1149

It's likely 2007 is high simply because we have a lot of tpt files from that year. Give some thought to this; the first step is to get the alignment going. Once we have that, we should have a large database of recurring speakers we can train with.

Efficiency coding

The PyCASP project makes an interesting distinction between efficiency coding and application coding. We have a bunch of applications; your task is to integrate them and make them run efficiently in an HPC environment. PyCASP is installed at the Case HPC if you would like to use it. The Berkeley team at ICSI who developed it also have some related projects; we have good contacts with this team:

Please asses if the efficiency coding framework could be useful in the pipeline design. It's important to bear in mind that we want this design to be clear, transparent, and easy to maintain; it's possible that introducing the PyCASP infrastructure will make it more difficult to extend, in which case we should not use it.

Integrating the training stage

To the extent there's time, I'd also like us to consider a somewhat more ambitious project that integrates the training stage. Could you for instance sketch an outline of how we might create a processing architecture that integrates deep learning for some tasks and conceptors for others? There are a lot of machine learning tools out there; RHSoC2015 used SciPy and Kaldi. Consider Google's project TensorFlow -- this is a candidate deep learning approach for integrated multimodal data. We see this as a longer-term project.

Red Hen Audio Processing Pipeline Guide


In order to run the main processing main script, the following modules should be loaded on the HPC Cluster:

module load boost/1_58_0

module load cuda/7.0.28

module load pycasp

module load hdf5

module load ffmpeg

And Python related modules are installed in a virtual environment, which can be activated by the following command:

. /home/hxx124/myPython/virtualenv-1.9/ENV/bin/activate

Python Wrapper:

Although we welcome audio processing tools implemented in any language, to make it easy for the integration of several audio tools into one unified pipeline, we strongly recommend the developers can wrap their code as a Python module, so that the pipeline can include their work by simply importing the corresponding module. All audio-related works from GSoC 2015 have been wrapped into a Python Module called "AudioPipe". See below for some examples:

# the AudioPipe Python Module

import AudioPipe.speaker.recognition as SR # Speaker Recognition Module

import AudioPipe.fingerprint.panako as FP # Acoustic Fingerprinting Module

from AudioPipe.speaker.silence import remove_silence # tool for remove the silence in the audio, not needed

import numpy as np

from AudioPipe.features import mfcc # Feature Extraction Module, part of the shared preprocessing

import as wav

from AudioPipe.speaker.rec import dia2spk, getspk # Speaker Recognition using diarization results

from AudioPipe.utils.utils import video2audio # Format converting module, part of the shared preprocessing

import commands, os

from AudioPipe.diarization.diarization import Diarization # Speaker Diarization Module

import as DM # Data Management Module

Pipeline Abstract Syntax:

To specify a new pipeline, one can use the abstract syntax defined in the Data Management Module, where a Node is a place to store data and a Flow is a computational process that transforms input data to output data.

For instance, one can create a Node for videos as follows:

# Select the video file to be processed

Video_node = DM.Node("Data/Video/",".mp4")

name = "2015-08-07_0050_US_FOX-News_US_Presidential_Politics"

where DM.Node(dir, ext) is a constructor for node that takes 2 arguments: directory(dir) and extension(ext).

And to construct a small pipeline that converts video to audio, one can do the following:

# Convert the video to audio

Audio_node = DM.Node("Data/Audio/", ".wav")

audio = Video_node.Flow(video2audio, name, Audio_node, [Audio_node.ext])

which first creates a node for audio outputs, and then flow a specific file from the video node to the audio node through the computational process called video2audio. Name specifies which file exactly is going to be processed and it will also be used as the output file name. The last argument of .Flow() is a list of arguments required by the computational process(video2audio in this case).

The following code gives a more complex pipeline example:

# Select the file for the meta information

Meta_node = DM.Node("Data/RedHen/",".seg")

meta = Meta_node.Pick(name)

# Store the fingerprint of the video

FP_node= DM.Node("Data/Fingerprint/")

output, err, exitcode = Video_node.Flow(FP.Store, name, FP_node, [])

# Run speaker diarization on the audio

Dia_node = DM.Node("Data/Diarization/", ".rttm")

args = dict(init_cluster=20, dest_mfcc='Data/MFCC', dest_cfg="Data/Model/DiaCfg")

dia = Audio_node.Flow(Diarization, name, Dia_node, args)

# Gender Identification based on Speaker Diarization

Gen_node = DM.Node("Data/Gender/",".gen")

gen = Audio_node.Flow(dia2spk, name, Gen_node, [model_gender, dia, meta, Gen_node.ext])

# Speaker Recognition based on Speaker Diarization

Spk_node = DM.Node("Data/Speaker/",".spk")

spk = Audio_node.Flow(dia2spk, name, Spk_node, [model_speaker, dia, meta, Spk_node.ext])

Basically this pipeline does the following:

It takes the video and stores the fingerprints of it; takes the audio and produces diarization results; from the audio it identifies the gender of each speaker based on the boundary information provided by the diarization results; similarly from the audio it recognizes the speakers.