— How to monitor the Rosenthal pipeline

Introduction

Hoffman2 is a high-performance computing cluster (HPC) at UCLA, administered by the Institute for Digital Research and Education (IDRE) and used by the NewsScape project for multimodal research in Red Hen. The Rosenthal pipeline was set up in February and March of 2016 to process files generated by Tim Groeling's team digitizing of the analog back catalog of the Communication Studies Archive (the Rosenthal Collection), created from 1973 to 2006 by Paul Rosenthal. 

This scroll documents how the pipeline is designed and how the monitoring screens are set up to watch over the pipeline and ensure files are processed efficiently. For developing the pipeline, see Video processing pipelines.


Related links

Daily tasks

While the Rosenthal pipeline is robust and handles most cases automatically, it needs periodic attention to ensure that everything is operating as intended.

Check in on the pipeline

To confirm the pipeline is healthy:
  • Login to groeling@cartago
    • in screen window 1, repeat the command to count the number of completed files (see Count completed files)
  • Login to the active file servers (currently wd3, wd4, and wd5)
    • press 'd' to see the current downloads to hoffman2 (max three)
    • press 'm' to move completed files (expect 10-25 a day -- see Monitoring the NAS
  • Login to groeling@hoffman2
    • adjust the number of nodes to request depending on the number of active file servers (see screen window 4 work)
Of course, if files are not picked up from the file servers and removed, or the daily report fails to report anything, you know there is a problem.

Re-digitize failed files

Check the daily e-mail reports for files that failed to repair or lack closed captions. Files that failed to repair typically have both degraded video and closed captions, so even if the word count for the captions looks good, the text may be junk.

If a tape is assessed to be inherently problematic, give the file the name $FIL_BE for "best effort" -- for instance

     2006-08-09_0000_US_00001025_V4_MB9_VHSP4_H14_GG_BE

This signals that it's not worthwhile to try to achieve better results.

If a second or third attempt is made to digitize a tape recording, give the file a numerical extension, for instance

     2006-06-13_0000_US_00000705_V10_MB7_VHS7_H11_CG_2

Increment this number for each attempt. Never re-use the same file name; this will result in the digitized file being ignored by the inventory script, which interprets it to be the same file as the one that was processed earlier.

Pipeline design

The starting point is that a file is digitized in the Public Affairs lab and exported to one of two file servers, WD3 and WD4. A file containing the show cut points may or may not be manually created. In broad strokes, the current pipeline on hoffman2 then runs as follows: 
  1. On a login node, the script fetch-Rosenthal-daemon-local.sh queries each file server for inventory
    • Prefer files that have a cutpoints file (.cuts.txt)
    • Skip files that are newer than two hours
    • Create a $FIL.len file in ~/tv2/pond for each file that is ready to be processed
    • Maintain a pool of 25 .len files at all times, evenly divided from the current file servers
  2. On a login node, the script work-daemon-local-01g.sh requests new jobs from the grid engine when needed
    • Aim to maintain a steady pool of grid-engine jobs at all times (see section 4 work for how to make adjustments)
  3. On a compute node, the script node-daemon-local-14g.sh orchestrates the processing of each file
    • Starts fetch2node, repair, textract, and ts2mp4 
  4. On the compute node, fetch2node-21g.sh transfers the 25GB file
    • Checks first if the file server is already transferring three files -- if it is, deletes the reservation and move on
    • Copies the file from its source on the file server to the local drive on the compute node
  5. On the compute node, run repair-06g.sh to repair the file
    • Without repair, CCExtractor cannot get the metadata dump, and the text is often degraded
    • Uses project-x first and then dvbt-ts to ensure a consistent file
  6. On the compute node, runs textract-02.sh to extract the text from the digitized file
    • CCExtractor is used to retrieve the metadate dump, the srt file, and the txt2 file
  7. On the compute node, runs ts2mp4-16g.sh to compress the file
    • The video is compressed from mpeg2 to h264 and the audio from s16p to aac, in an mp4 container
  8. On the compute node, copies the completed files to storage
    • The files are copied to UCLA's Isilon storage, as mounted on cartago, our developer machine
  9. On cartago, split the files on the Isilon, using the cutpoint files
    • Also cut the text files to match
The pipeline is now relatively robust. The main bottlenecks are file quality and the availability of hoffman2 compute nodes. Transfer speeds are down to around 10 minutes for non-simultaneous transfers.


Design economy

The Rosenthal pipeline attempts to maintain a good balance between the availability of files to process on the one hand and compute nodes to do the work on the other. This balance is maintained by two scripts: a budgeting script and an inventory script. 

The budgeting script -- work-daemon-local-01.sh -- will attempt to provide an optimal number of compute nodes for the pipeline; you tell it to stay below a certain number, and it will dynamically scale up to that limit depending on the need for compute nodes and their availability. The script may not succeed, since it merely makes requests for nodes, and the requests aren't always granted. Sometimes the budgeting script has to wait for hours to get a node, or may fail altogether. Nevertheless, it patiently persists in making requests. It's safe to set the request to a high number, say 30; it will ask for more nodes only if they are needed. It will submit up to four requests at once, and then wait until at least one is granted.

The availability of files to be processed is tracked by the inventory script, fetch-Rosenthal-daemon-local.sh. Its job it is to figure out where there are files available for download. It looks at the file servers and creates $FIL.len files in the pond directory on hoffman2, which the compute nodes use to find new jobs. The inventory script in turn has to be told how often to check for files and where to look -- so for instance, you could tell it to only check WD3, or to check WD3 three times as often as WD4, or to try to maintain an equal number of files from each source. The default is to check the four file servers from WD1 to WD4 every half hour.

Both the budgeting script and the inventory script are currently set to scale dynamically up to the limit you provide. When there are more files available for processing on the file servers, the inventory script will create more .len files, and the budget script will ask for more nodes. Since each node is requested for 24 hours, we will get overshoot (idle nodes) if the inventory drops. The goal is to avoid both unnecessary waiting and idle nodes; this may take some further tweaking of the flow logic in the budget and inventory scripts.

While this system will flexibly take advantage of available nodes, its throughput is still constrained by the time required to transfer files from the iMac file servers. The transfers do not start until we have reserved a node, and we transfer no more than three files concurrently from each iMac. This limits our ability to take advantage of rapid swings in node availability.

Screen windows

Monitoring takes place on the command line, and uses GNU screen to create a suite of windows on a particular hoffman2 login node (typically login1) that persist across logins. When hoffman2 login nodes are rebooted, which happens a few times a year, the screen windows must be reconstituted, since the pipeline is driven by commands that run in the screen windows. A typical screen session looks like this:

0 bin  1 pond  2 stages  3 fetch  4 work  5 jobs  6 day  7 logs  8 downloads  9 compressions  10 processes  11 files

These named windows within screen are made possible by the configuration file ~/.screenrc. Screen windows are named by Ctrl-a A, and you navigate by C-a n (next) and C-a p (previous) or by number: C-a 8. To reach numbers above 9, use C-a ' and enter the two-digit number. Here is what is done in the different screen windows:
  1. bin   -- contains the scripts that control the pipeline processes
  2. pond  -- contains the reservations for the videos that are currently being processed
  3. stages -- displays the log of the current processes
  4. fetch -- runs the inventory script, which creates the marker .len files used to track files available for processing
  5. work -- runs the budget script, which requests new jobs from the grid engine when needed
  6. jobs -- lists current jobs and job queue
  7. day -- a directory tree that tracks which files have been processed
  8. logs -- used to query the pipeline processing logs
  9. downloads -- show the progress of files currently being downloaded
  10. compressions -- show the progress of files currently being compressed
  11. processes -- show the processes running on all active nodes
  12. files -- show the files being created on all active nodes
These are the typical commands used in each screen window:
  1. bin   -- used to edit bash scripts (.sh files) and SGI grid engine jobs (.cmd files)
  2. pond -- cd ~/tv2/pond; f (to list all files that are being or have been fetched); r (ditto for reserved files)
  3. stages -- cd ~/tv2/logs ; tail -n 500 -f reservations.$( date +%F )
  4. fetch --  while true ; do fetch-Rosenthal-daemon-local.sh 25 ; echo -e "\n\tResting since `date`" ; sleep 1800 ; done
  5. work -- work-daemon-local-01g.sh 18
  6. jobs -- myjobs
  7. day -- day 4 ; l ; day $FIL ; l $FIL*  (where $FIL is some file name without extension)
  8. logs -- F=$FIL ; grep $F res* -h (where $FIL is some file name without extension)
  9. downloads -- j2
  10. compressions -- j3
  11. processes -- i
  12. files -- j
Several of these commands -- f, r, day, j2, j3, i, and j -- are defined in ~/.bashrc and work only in the context of the given screen windown. Below are the typical outputs of the commands in each screen window.

0 bin

See section Modifying the scripts below.

1 pond

The pond is where the fetch daemon deposits its .len files, which tell the compute nodes which files that are ready for processing. It's used to monitor the overall progress:

Rosenthal pipeline -- Output of f in GNU screen window ~/tv2/pond

Other useful commands in 1 pond:
  • e -- remove expired reservations (automated by fetch-daemon-local)
  • r -- to see all reserved files
  • grep ^wd *len -- to see the source of the .len files

2 stages

The main command is "cd ~/tv2/logs ; tail -n 500 -f reservations.$( date +%F )", which gives a continuous output like this:

Rosenthal pipeline on Hoffman2: continuous log in the stages window

For other useful commands in the 2 stages window, see Speed and output monitoring below.

3 fetch

This is the inventory script, which tracks which video files are available on the file servers. The standard command "while true ; do fetch-Rosenthal-daemon-local.sh 12 ; echo -e "\n\tResting since `date`" ; sleep 1800 ; done" gives this kind of output:

Rosenthal pipeline -- fetch files (screen window 3)

The fetch-Rosenthal-daemon-local.sh script calls the script "m" on the file servers, which in turn moves completed files out of the way, and deletes them after a couple of weeks. Note that the inventory script does not copy any video files, it just creates .len files that the compute nodes use to locate the video files. It's also the compute nodes that actually download the files from the file servers.

4 work

This is the budget script, which asks for compute nodes when it detects that the inventory script has added tracking files. The standard command "work-daemon-local-01g.sh 8" gives this kind of output:

Work daemon window in the Rosenthal pipeline

The maximum level of jobs to request is set manually; jobs will scale dynamically up to that ceiling. A modest ceiling is a safeguard, though nodes will be requested only when files are waiting. Once there are no waiting files, all jobs will terminate, freeing up the compute nodes. Note that scaling up is gradual, a new job being requested for each waiting file, but scaling down is catastrophic, all jobs terminating when no further files are waiting. This skews the system slightly in favor of the waiting files, at the expense of momentarily idling nodes, yet avoids the large-scale waste of keeping nodes idling until they expire.

5 jobs

The standard command "myjobs" gives this kind of output:

Rosenthal pipeline on Hoffman2: myjobs output

In the State column, r means running and qw means queue wait.

6 day

Typical usage includes this sort of command sequence -- first navigate to the right day in the day directory tree and then list the files belonging to a particular captured file:

Rosenthal pipeline on Hoffman2: day directory tree

You can also navigate in the day tree using "day <days ago>", for instance "day 24". The day directory tree is used by the fetch-Rosenthal-daemon-local.sh script to determine if a captured file has already been converted.

7 logs

This typical command sequence:

  F=2005-05-11_0000_US_00000188_V2_MB12_VHS13_H2_JK
  grep $F res* -h

gives the processing history of a captured file:

Rosenthal pipeline on Hoffman2: commands in the logs screen window

Other useful ways to query the logs include the commands "group <days ago>" -- for instance "group 1" -- and searches for job IDs.

If a file has failed to process, the pipeline may make continual new attempts. These attempts may result in files being left stranded on the compute nodes. To clean up, issue
        purge $FIL
for instance
        purge 2005-03-04_0000_US_00006063_V13_VHSP12_MB20_H17_ES
The purge script will identify all the compute nodes that made a processing attempt and delete the stranded files.

8 downloads

The alias j2, defined in ~/.bashrc, displays the progress of downloading files, which are typically 25GB in size:

Rosenthal pipeline on Hoffman2: download progress

Downloads are faster when there are fewer simultaneous downloads from one source.

9 compressions

The alias j3, defined in ~/.bashrc, displays the progress of compressing mpg files to mp4, typically to around 2GB in size:

Rosenthal pipeline on Hoffman2: file compression progress

It may be possible to speed up compression without quality loss by using single-pass variable bitrate, but our early experiments have not been promising.

10 processes

The main processes on all active nodes can be listed with the shortcut i:

Rosenthal pipeline on Hoffman2: main processes on all active nodes

This is useful to detect active nodes waiting for work (this is also automatically tracked by the budget script), or to see the stages of files on the various nodes.

11 files

The main files on all active nodes can be listed with the shortcut j:

Rosenthal pipeline on Hoffman2: main files on active nodes

These shortcuts can be improved and modified at will in ~/.bashrc.

Speed and output monitoring

Here are some commands for monitoring processing speed and quality.

Download times

Our download times from our current file servers WD3 and WD4 are excellent -- typically ten minutes on a single download. The download times from the WDMs is surprisingly poor and highly variable, so we have stopped using them. If you look at the column with values like "WD1 Dd time 2", this gives you the source and the number of simultaneous downloads from that source. You'll see the slowest downloads come from WD1 with three simultaneous jobs, and the fastest from WD2 with a single job. The number of jobs can vary during the course of the download; the number is just the starting condition. Our modal value from WD2 is an hour and a half, and roughly double that for WD1:
Rosenthal pipeline on Hoffman2: download speeds

Repair times

Repair times are quite predictable, typically around 40 minutes:

Rosenthal pipeline on Hoffman2: repair times

Text extraction

Text extraction takes only a couple of minutes, so we don't time it. We're typically getting more than ten thousand lines of text in every eight-hour file. The quality so far is good and often not clearly inferior to live capture. At the same time, there are many examples of severely degraded text.

Rosenthal pipeline on Hoffman2: text extraction counts

Compression times

Compression times show a surprisingly large range, from two to seven hours with a mode around five.

Rosenthal pipeline on Hoffman2: compression times

Completion times

Completion times range from three to eleven hours. Our modal value used to be around eight, but has now dropped closer to five.

Rosenthal pipeline on Hoffman2: completion times

Nudging the total processing time down even a little bit helps, since we can then fit three and sometimes four files in one 24-hour job. They used to time out after two.

CPU usage

In May 2016, the Rosenthal pipeline used 12,611.13 CPU hours to process 492 files, or 26 CPU hours per file. Since the processing is done on four CPUs, that means we're averaging 6.4 hours per file.


Modifying the scripts

The bash processing scripts in ~/bin are can and should be improved when possible. Scripts are numbered, and modifications should be made by copying the current script to an incremental number. The name of the currently active script is kept in a .name file:
  • fetch2node.name
  • node-daemon-local.name
  • repair.name
  • textract.name
  • ts2mp4.name
 To verify the name of the currently active script, check the contents of the corresponding name file:

groeling@login1:~/bin$ cat ts2mp4.name
ts2mp4-16g.sh

Then copy that file before you make changes:

cp ts2mp4-16g.sh ts2mp4-17g.sh

Make your changes to the script, test it, and then activate it by updating the contents of the .name file. The pipeline will check the .name file next time the script is called. This avoids disrupting scripts while they are running -- a potentially very messy situation that can take days to recover from, so be careful.

Monitoring and configuring the file servers

We use iMacs with attached RAID as file servers -- they receive digitized files from the digitizing stations, and these files are then picked up by the hoffman2 scripts. 

Monitoring transfers

When you ssh into an iMac file server, you can use these commands to monitor and intervene:
  • p  -- navigate to the working directory
  • d  -- to list current downloads to hoffman2 (set to max 3 per iMac in fetch2node.sh)
  • l1  -- list current files by modification date
  • m -- move completed files into the subdirectory mp4-only or (if a cuts file is present) mp4+cuts
The m command is automated on the iMacs and does not need to be entered manually, but it does no harm.

Remote desktop

If it's necessary to use a remote desktop to the file servers, use VNC through an ssh tunnel. 
Make sure to turn off Apple Remote Desktop when it's not in use, using the shortcut command sson and ssoff.

Configuring the file structure

  • create symlinks for the path /mnt/HD/HD_a2/Comm/VHS to the working RAID directory
  • create the subdirectories mp4-only mp4+cuts Bad
Since we're no longer using the NAS for file servers, we could simplify the path the VHS, but this arrangement keeps the scripts compatible with the NAS.

Copy configuration files

  • copy the script m to /usr/local/bin -- it moves completed files into the mp4-only directory and cleans up
  • copy over ~/.bashrc ~/.screenrc ~/.nanorc

Macports

For those of our file servers that are OS X desktops, we install macports and run some scripts.
  • port install coreutils gsed bash findutils wget mp4v2 ossp-uuid dos2unix alpine moreutils fail2ban bash ffmpeg
The scripts require the GNU utilities as default, along with bash 4. To configure,
    • add path to ~/.profile: export PATH="/opt/local/libexec/gnubin:/opt/local/bin:/opt/local/sbin:$PATH"
    • add "/opt/local/bin/bash" to /etc/shells and issue "chsh -s /opt/local/bin/bash"

Change the ssh port

We can save ourselves lots of attacks just by moving the ssh port. To change ssh to use a different port, modify ssh's plist; see instructions. DCL02 has a new ssh.plist -- see /System/Library/LaunchDaemons/ssh.plist.new and /System/Library/LaunchDaemons/ssh.plist.orig. Copy that file (do a diff first to be safe, and save the original), or manually replace the Sockets stanza to read like this:

        <key>Sockets</key>
        <dict>
                <key>Listeners</key>
                <dict>
                        <key>SockServiceName</key>
                        <string>9876</string>
                        <key>SockFamily</key>
                        <string>IPv4</string>
                        <key>Bonjour</key>
                        <array>
                                <string>9876</string>
                                <string>sftp-ssh</string>
                        </array>
                </dict>
        </dict>

-- where 9876 is the newly assigned port. It may even be we don't need to disclose the new port to bonjour. Then issue
  • launchctl unload /System/Library/LaunchDaemons/ssh.plist
  • launchctl load /System/Library/LaunchDaemons/ssh.plist
This can be made transparent to the user by modifying ~/.ssh/config on incoming machines as needed -- say

Host wd87
        User csa
        Port 9876
        Hostname fileserver7.ucla.edu

We can the reach the server with a simple "ssh wd87".

Configure fail2ban

For Linux on a Raspberry Pi, see How to set up a Red Hen capture station.

fail2ban @0.9.3 (security, python) -- http://www.fail2ban.org/

Description:  Fail2ban scans log files (e.g. /var/log/apache/error_log) and bans IPs that show the malicious signs—too many password failures, seeking for exploits, etc. Generally Fail2Ban is then used to update firewall rules to reject the IP addresses for a specified amount of time, although any arbitrary other action (e.g. sending an email, or ejecting CD-ROM tray) could also be configured. Out of the box, Fail2Ban comes with filters for various services (apache, curier, ssh, etc).


To configure, issue

cp /opt/local/etc/fail2ban/fail2ban.conf /opt/local/etc/fail2ban/fail2ban.local
cp /opt/local/etc/fail2ban/jail.conf /opt/local/etc/fail2ban/jail.local

Edit /opt/local/etc/fail2ban/jail.local and add in some IP ranges for friendly computers:

        ignoreip = 127.0.0.1/8 164.67.171.0/24 164.67.183.179

After changing the configuration files, first unload and then load:

sudo port unload fail2ban
sudo port load fail2ban

The log is /var/log/fail2ban.log:

# cat /var/log/fail2*g
2016-05-21 08:37:17,288 fail2ban.server    [99336]: INFO    Changed logging target to /var/log/fail2ban.log for Fail2ban v0.9.3
2016-05-21 08:37:17,289 fail2ban.database       [99336]: INFO    Connected to fail2ban persistent database '/opt/local/var/run/fail2ban/fail2ban.sqlite3'
2016-05-21 08:37:18,089 fail2ban.database       [99336]: WARNING New database created. Version '2'

To block incoming traffic from a specific IP, add a line like this to /etc/pf.conf:
block in from 221.0.213.154
Then re-load the config file:
pfctl -e -f /etc/pf.conf
Not fully tested.

On Hoffman2

Once the new file server is configured, add it to hoffman2's ~/.ssh/config, fetch-Rosenthal-daemon-local.sh, and fetch2node-xxg.sh.

Configuring and monitoring the NAS

We have stopped using the two NAS units WD1 and WD2 as file servers, as their transfer rates are painfully slow. However, they continue to be used for file transport and cutpoint generation.

Configuring the NAS

We first need to set up a new NAS, such as WD1 and WD2, with RSA keys to accept automatic ssh logins. 
First, establish the RSA keys. Log in with the password and issue:

ssh-keygen -t rsa

Copy the freshly generated key:

cat /home/root/.ssh/id_rsa.pub

Add this key to the authorized keys of an already configured NAS. 
Then copy the list of keys and hosts from that WDM:

scp sshd@<IP address>:/home/root/.ssh/authorized_keys /home/root/.ssh/
scp sshd@<IP address>:/home/root/.ssh/config /home/root/.ssh/

Verify remote access is working. 
Second, create the configuration backup directory:

mkdir /mnt/HD/HD_a2/system

and back up the .ssh directory:

cp -rp /home/root/.ssh /mnt/HD/HD_a2/system/ssh-root

Check that /mnt/HD/HD_a2/comm/VHS exists. If the directory is called /mnt/HD/HD_a2/Comm with an uppercase C, create a symlink, as the scripts assume lowercase:

cd /mnt/HD/HD_a2 ; ln -s Comm comm

Third, copy the shortcuts files that makes navigation and recurring tasks simpler:

cd /mnt/HD/HD_a2/system
scp -p wd1:`pwd`/* .

Tweak the shortcut file 'hi' and copy it into position:

vi hi  (make any needed changes, such as the system name)
cp -p hi /home/root

Finally, activate the shortcut file, copy over the command list, and create the tracking folders:

cd
source hi
scp -p wd1:`pwd`/commands .
mkdir mp4+cuts mp4-only Bad

Change the file ownership to something appropriate to the workflow, for instance:

chown -R admin:share /mnt/HD/HD_a2/comm/VHS

The system is now configured.

Shortcut files


The shortcut files are kept in /mnt/HD/HD_a2/system and currently include the following:

root@WD1 system # cat hi
clear
cd /mnt/HD/HD_a2/comm/VHS
echo -e "\n   Welcome to WD1\n"
echo -e "\tl  -- list files"
echo -e "\tl1 -- by modification time"
echo -e "\td  -- show files downloading"
echo -e "\tp  -- go to video file directory"
echo -e "\tm  -- move completed files to mp4+cuts\n"
ENV=/mnt/HD/HD_a2/system/bashrc ash

After logging into WD1 or WD2 by ssh, type "source hi" and activate the custom environment. You also get a menu with some of the functionality that is added by the bashrc file. This is easy to expand as needed.

root@WD1 system # cat bashrc 

alias nano='vi'
alias bashrc='vi /mnt/HD/HD_a2/system/bashrc'
alias sbashrc='source /mnt/HD/HD_a2/system/bashrc'
alias p='cd /mnt/HD/HD_a2/Comm/VHS'
HISTFILE=/mnt/HD/HD_a2/system/.ashfile
alias history='cat $HISTFILE'
alias ls='ls --color=always'
alias l='ls -Ll'
alias md='mkdir'
alias rd='rmdir'
alias ..='cd ..'
alias d='date ; ps x | grep rsync | grep -v grep | grep -v "sh -c"'
alias ll='ls -goh'
alias la='ls -lA'
alias l1='ls -lt -r'    # sort by date, reversed
alias l2='ls -l -S -r'  # sort by file size, reversed
alias l3='ls -lt'       # sort by date, most recent last
alias m='for i in mp4-only/*done ; do j=${i##*/} ; if [ -f ${j%%.*}*.txt ] ; then echo -e "mp4+cuts \t ${j%%.*}" ; mv $i ${j%%.*}*.txt ${j%%.*}.mpg mp4-only/${j%%.*}.mpg mp4+cuts 2>/dev/null ; elif [ -f ${j%%.*}.mpg ] ; then echo -e "mp4-only \t ${j%%.*}" ; mv ${j%%.*}.mpg mp4-only 2>/dev/null ; fi ; done'
fix_hyphen () { cd /mnt/HD/HD_a2/Comm/VHS/ ; for i in `find . -maxdepth 1 -regextype sed -regex '.*\/[0-9\-]\{10\}_0000_US_[A-Za-z0-9]\{7,8\}_.*\.\(txt\|mpg\)'` ; do e=${i#*_} ; if [ "$( echo $e | grep '\-' )" ] ; then mv -vn $i ${i%%_*}_${e//-/_} ; fi ; done }

We could do more here -- this bashrc file gives us the ability to define shortcuts for routine tasks, such as navigating or moving files around according to fixed criteria. Note the "m" alias, which moves files into mp4+cuts when ready. Hoffman2 puts a marker for converted files in mp4-only, and m moves the ones that also have a cutpoints file into mp4+cuts, along with the mpg file. We could also put this on a crontab on the WDM, or integrate it into a hoffman script. Similarly, the "fix_hyphen" function removes stay hyphens from filenames.

Finally, system contains the directory ssh-root, which has the RSA keys. 

Reestablishing the configuration after a reboot


When the WDM is reset -- and this doesn't take much; a reboot is enough -- the /home/root directory is wiped (very user friendly). To reconstitute, log in with the password, cd /mnt/HD/HD_a2/system, and

cp -rp ssh-root /home/root/.ssh
cp -p hi /home/root/

Log out and back in; the keys will now work and you can enter "source hi". 

Monitoring the NAS

When you ssh into WD1 or WD2, type "source hi".  You can then use these commands to monitor and intervene:
  • p  -- navigate to the working directory
  • d  -- to list current downloads to hoffman2 (set to max 3 per NAS in fetch2node.sh)
  • l1  -- list current files by modification date
  • m -- move completed files into the subdirectory mp4+cuts
The m command is currently not automated and should be entered manually once a day.

To determine the proportion of files that are waiting to be processed on hoffman2, count the files in the working directory:

l *mpg | wc -l

and subtract the number of files in the mp4-only directory:

l mp4-only/*mp4.done | wc -l

When the difference is zero, you should switch exporting files to this WDM. Alternatively, run this command:

root@WD2 VHS # for i in *mpg ; do ls -l mp4-only/${i%.*}.mp4.done ; done

If you get "No such file or directory", that file has not been processed.

Hoffman2 retrieves files twice as fast from a machine that is idle (not used for exporting from the capture machines). It also retrieves files twice as fast from two WDMs than from one. A good arrangement would be to have three WDMs, so that one is always used for file exports from the recording machines, and the other two are used to copy files to hoffman2. The time it takes to copy files to hoffman2 is still a major constraint on our ability to process files, since we can only copy three files at a time from each WDM.

Monitoring the Isilon storage

Completed files are transferred to the Isilon storage at cartago: /mnt/ifs/NewsScape/Rosenthal. Cartago produces a daily and a weekly report that is sent to all e-mails listed in /mnt/ifs/NewsScape/Rosenthal/logs/z-mail-recipients. The person running the digitizing lab can also monitor the results by logging into user groeling on cartago.

Daily reports

Cartago generates a daily e-mail report that looks like this:

	Processing Report from the Rosenthal Pipeline on Hoffman2 for May 28, 2016


			 ~ 2004 ~

	2 files from 2004 completed processing on 2016-05-28

	56447 	08:14:56.90 	2004-06-02_0000_US_00001015_V1_MB11_VHS12_H6_GG
	48307 	08:14:17.66 	2004-06-25_0000_US_00001014_V6_MB6_VHS8_H10_GG

	The following files failed to repair, so the quality of the video and text is likely
        degraded:

		All files repaired successfully

	The following files have no closed captions:

		All files have captions

			 ~ 2005 ~

	5 files from 2005 completed processing on 2016-05-28

	43572 	08:12:57.82 	2005-06-24_0000_US_00000846_V1_MB11_VHS12_H6_LS
	49077 	08:12:30.22 	2005-08-08_0000_US_00001016_V6_M1_VHS13_H3_GG
	36228 	08:01:00.19 	2005-11-21_0000_US_00000956_V3_MB10_VHSP1_H15_GG_BE
	1755 	00:23:01.51 	2005-11-25_0000_US_00000956_V3_MB17_VHSP3_H13_DB
	56065 	08:14:22.08 	2005-12-09_0000_US_00001010_V15_MB5_VHS5_H7_GG

	The following files failed to repair, so the quality of the video and text is likely
        degraded:

		All files repaired successfully

	The following files have no closed captions:

		All files have captions

			 ~ 2006 ~

	10 files from 2006 completed processing on 2016-05-28

	63300 	08:13:33.29 	2006-01-06_0000_US_00001017_V11_M2_VHS10_H4_GG_BE
	53580 	08:13:35.57 	2006-02-24_0000_US_00001011_V10_MB3_VHS3_H9_GG
	54957 	08:13:56.35 	2006-03-29_0000_US_00001008_V12_MB1_VHS4_E1_GG
	53320 	08:14:28.63 	2006-05-19_0000_US_00001007_V8_MB2_VHS2_H8_GG
	50459 	08:13:24.50 	2006-05-26_0000_US_00000958_V8_MB2_VHS2_H8_LS
	47819 	08:13:27.62 	2006-06-02_0000_US_00000995_V8_MB2_VHS2_H8_DB
	55632 	08:13:11.66 	2006-06-12_0000_US_00001009_V10_MB7_VHS7_H11_GG
	49384 	08:14:06.96 	2006-06-16_0000_US_00000981_V8_MB2_VHS2_H8_MS
	49785 	07:10:02.16 	2006-08-10_0000_US_00001012_V4_MB9_VHSP4_H5_GG
	18129 	03:01:02.28 	2006-08-18_0000_US_00001004_V14_MB15_VHSP6_E2_DB

	The following files failed to repair, so the quality of the video and text is likely
        degraded:

		All files repaired successfully

	The following files have no closed captions:

		All files have captions

	Summary Report for 2016-05-28:

		17 files completed processing
		   787816 words of closed captioning
		   452027 seconds of video (125:33:47 hours)
		17 files processed correctly
		All files repaired successfully
		All files have captions

	Command: groeling@cartago:/usr/local/bin/Rosenthal-daily-report 1 s m

	This is an automatically generated daily report for the UCLA Communication Studies
        Archive digitizing project.
	To add or remove recipients, edit 
        groeling@cartago:/mnt/ifs/NewsScape/Rosenthal/logs/z-mail-recipients

Use this information to reprocess whatever files failed because of some error in the digitizing process.

Weekly reports

Cartago also generates a weekly e-mail report that looks like this:
        Summary Processing Report from the Rosenthal Pipeline on Hoffman2

                          May 19, 2016 to May 26, 2016

                          93 files completed processing

        14640   08:12:14.78     2006-06-27_0000_US_00000041_V12_M1_VHSP3_H3_JK_BE                    
        21472   08:08:57.70     2004-11-02_0000_US_00000378_V7_MB9_VHSP4_H5_JK_BE                    
        7115    03:01:22.71     2005-08-26_0000_US_00000687_V6_MB9_VHSP4_H5_CG_BE                    
        59954   08:13:19.42     2006-01-25_0000_US_00000830_V11_M2_VHS10_H4_CG               
        44661   08:13:56.11     2006-05-03_0000_US_00000831_V12_MB1_VHS4_E1_CG               
        51721   08:13:45.00     2006-04-27_0000_US_00000842_V10_MB3_VHS3_H9_DB               
        766     03:53:15.29     2005-12-02_0000_US_00000858_V3_M1_VHSP3_H3_JK_BE                     
        57370   08:14:27.12     2006-01-18_0000_US_00000859_V11_M2_VHS10_H4_JK               
        53509   08:14:46.94     2006-04-11_0000_US_00000861_V12_MB2_VHS2_H8_JK               
        55703   08:02:31.97     2006-06-09_0000_US_00000863_V8_MB9_VHSP4_H5_JK_BE                    
        35642   08:14:45.43     2006-04-26_0000_US_00000864_V12_MB1_VHS4_E1_JK               
        55213   08:14:16.30     2006-05-25_0000_US_00000866_V10_MB3_VHS3_H9_JK               
        16106   05:45:45.86     2005-11-29_0000_US_00000868_V3_M1_VHSP3_H3_JK_BE                     
        54526   08:14:43.94     2005-06-17_0000_US_00000871_V1_MB5_VHS5_H7_JK                
        54487   08:14:06.12     2006-05-26_0000_US_00000872_V10_MB7_VHS7_H11_JK              
        41534   08:14:08.19     2006-04-27_0000_US_00000873_V12_MB8_VHSP5_EB1_JK                     
        54256   08:13:41.04     2005-06-16_0000_US_00000875_V1_MB10_VHS6_H12_JK              
        54534   08:14:35.71     2005-05-27_0000_US_00000876_V6_MB6_VHS8_H10_JK               
        8440    08:03:43.61     2005-12-01_0000_US_00000877_V3_MB9_VHSP4_H5_JK               
        53868   08:12:31.56     2006-04-13_0000_US_00000878_V5_MB13_VHS14_H1_JK              
        51669   08:13:52.99     2005-06-15_0000_US_00000879_V1_MB1_VHS12_H6_JK               
        54764   08:14:33.02     2005-06-14_0000_US_00000880_V1_MB15_VHSP6_E2_JK              
        54572   08:14:50.40     2005-06-10_0000_US_00000881_V1_MB5_VHS5_H7_JK                
        50890   08:14:04.73     2005-06-13_0000_US_00000882_V1_MB10_VHS6_H12_JK              
        53989   08:13:54.79     2006-06-06_0000_US_00000883_V10_MB7_VHS7_H11_JK              
        39192   08:14:27.41     2006-04-25_0000_US_00000884_V12_MB1_VHS4_E1_JK               
        60956   08:14:04.34     2006-01-23_0000_US_00000885_V11_M2_VHS10_H4_JK               
        25107   08:14:52.18     2006-05-01_0000_US_00000886_V12_MB8_VHSP5_EB1_JK  Failed repair
        53856   08:14:31.66     2006-06-05_0000_US_00000887_V10_MB3_VHS3_H9_JK               
        53489   08:14:30.53     2005-06-10_0000_US_00000888_V6_MB6_VHS8_H10_JK               
        22463   08:14:36.22     2005-11-30_0000_US_00000889_V3_MB12_VHS13_H2_JK              
        56384   08:14:02.71     2006-04-14_0000_US_00000890_V5_MB13_VHS14_H1_JK              
        51060   07:46:15.38     2005-06-09_0000_US_00000891_V1_MB11_VHS12_H6_JK              
        49630   08:14:19.42     2005-06-08_0000_US_00000892_V1_MB15_VHSP6_E2_JK              
        34431   08:10:12.48     2005-11-28_0000_US_00000893_V3_MB12_VHS13_H2_JK              
        58474   08:07:30.10     2006-04-17_0000_US_00000894_V5_MB13_VHS14_H1_JK              
        55348   08:13:43.39     2005-05-20_0000_US_00000897_V6_MB6_VHS8_H10_CG               
        52284   08:17:00.10     2006-05-30_0000_US_00000898_V10_MB3_VHS3_H9_CG               
        57645   08:12:59.26     2005-06-07_0000_US_00000899_V1_MB10_VHS6_H12_CG              
        55756   08:13:22.37     2005-06-06_0000_US_00000900_V1_MB5_VHS5_H7_CG                
        53420   08:13:06.86     2006-06-01_0000_US_00000901_V10_MB7_VHS7_H11_CG              
        50972   08:13:37.20     2006-04-21_0000_US_00000902_V12_MB1_VHS4_E1_CG               
        55377   08:13:31.32     2006-04-20_0000_US_00000903_V12_MB2_VHS2_H8_CG               
        51449   08:13:14.04     2006-04-19_0000_US_00000904_V12_MB8_VHSP5_EB1_CG                     
        55512   08:13:07.70     2006-01-17_0000_US_00000905_V11_M2_VHS10_H4_CG               
        51051   08:10:00.65     2006-05-29_0000_US_00000906_V10_MB7_VHS7_H11_JK              
        60547   08:12:24.62     2005-05-31_0000_US_00000907_V1_MB5_VHS5_H7_JK                
        60287   08:13:22.15     2005-06-01_0000_US_00000908_V1_MB10_VHS6_H12_JK              
        29937   08:14:27.22     2006-04-18_0000_US_00000909_V12_MB1_VHS4_E1_JK               
        35218   08:14:05.59     2006-04-17_0000_US_00000910_V12_MB8_VHSP5_EB1_JK                     
        55220   08:14:14.71     2006-04-14_0000_US_00000911_V12_M2_VHS2_H8_JK                
        60239   08:14:30.14     2005-10-31_0000_US_00000912_V11_M2_VHS10_H4_JK               
        51001   08:05:33.48     2005-05-06_0000_US_00000914_V6_MB6_VHS8_H10_JK               
        54991   08:13:47.30     2006-05-24_0000_US_00000915_V10_MB3_VHS3_H9_JK               
        36772   08:06:50.69     2005-11-25_0000_US_00000916_V3_MB12_VHS13_H2_MS              
        53838   08:14:36.38     2005-05-26_0000_US_00000917_V1_MB15_VHSP6_E2_MS              
        50397   08:14:04.39     2005-05-25_0000_US_00000918_V1_MB11_VHS12_H6_MS              
        2150    04:39:29.02     2005-05-30_0000_US_00000919_V1_MB9_VHSP4_H5_CG_BE                    
        57340   08:14:18.17     2005-05-27_0000_US_00000920_V1_MB5_VHS5_H7_MS                
        32026   08:14:28.68     2006-04-13_0000_US_00000922_V12_MB2_VHS2_H8_MS               
        40648   08:14:01.63     2006-04-12_0000_US_00000924_V12_MB8_VHSP5_EB1_MS                     
        57427   08:14:58.68     2006-01-20_0000_US_00000925_V11_M2_VHS10_H4_MS               
        51778   08:16:34.03     2006-05-22_0000_US_00000926_V10_MB3_VHS3_H9_MS               
        56007   08:10:50.47     2005-04-29_0000_US_00000927_V6_MB6_VHS8_H10_MS               
        47631   08:05:02.88     2004-11-12_0000_US_00000928_V6_M1_VHSP3_H3_MS                
        54078   08:14:08.62     2006-04-07_0000_US_00000929_V12_MB1_VHS4_E1_MS               
        50752   08:14:13.15     2006-05-19_0000_US_00000930_V10_MB7_VHS7_H11_MS              
        53292   08:16:20.57     2005-05-18_0000_US_00000931_V1_MB10_VHS6_H12_MS              
        62988   08:13:30.43     2006-01-19_0000_US_00000933_V11_M2_VHS10_H4_CG               
        35492   08:13:20.09     2006-04-10_0000_US_00000934_V12_MB8_VHSP5_EB1_CG                     
        48363   08:13:13.18     2006-04-05_0000_US_00000935_V12_MB2_VHS2_H8_CG               
        57832   08:13:02.66     2006-03-14_0000_US_00000937_V10_MB7_VHS7_H11_CG              
        55861   08:12:55.58     2005-05-24_0000_US_00000938_V1_MB10_VHS6_H12_CG              
        39935   08:07:57.10     2005-05-23_0000_US_00000939_V1_MB5_VHS5_H7_CG                
        52774   08:16:31.01     2006-03-13_0000_US_00000941_V10_MB3_VHS3_H9_MS               
        45066   08:12:57.98     2004-11-05_0000_US_00000943_V6_MB6_VHS8_H10_CG               
        53242   08:12:46.25     2006-04-07_0000_US_00000944_V5_MB13_VHS14_H1_CG              
        26862   08:13:08.98     2005-11-24_0000_US_00000945_V3_MB12_VHS13_H2_CG              
        58106   08:13:25.80     2006-01-13_0000_US_00000946_V11_M2_VHS10_H4_CG               
        55951   08:13:37.63     2006-04-11_0000_US_00000947_V5_MB13_VHS14_H1_CG              
        51274   08:02:30.07     2005-11-23_0000_US_00000949_V3_MB8_VHSP5_EB1_CG              
        58276   08:13:24.62     2006-01-16_0000_US_00000950_V11_MB2_VHS2_H8_CG               
        56108   08:13:33.89     2005-12-14_0000_US_00000952_V15_MB5_VHS5_H7_CG               
        55711   08:13:34.66     2005-12-14_0000_US_00000952_V15_MB7_VHS7_H11_CG              
        57123   08:11:10.46     2004-11-02_0000_US_00000963_V6_MB6_VHS8_H10_CG               
        48335   08:12:52.90     2005-05-17_0000_US_00000964_V1_MB11_VHS12_H6_CG              
        44373   08:12:47.88     2005-08-12_0000_US_00000966_V2_M1_VHS13_H3_CG                
        17599   08:15:58.87     2005-08-10_0000_US_00000971_V2_MB1_VHS4_E1_MS                
        56268   08:13:46.51     2006-04-06_0000_US_00000972_V15_MB5_VHS5_H7_MS               
        54823   08:14:22.94     2006-02-03_0000_US_00000973_V10_MB10_VHS6_H12_MS                     
        65302   08:15:39.12     2006-01-10_0000_US_00000974_V11_M2_VHS10_H4_MS               
        51418   08:14:51.91     2004-10-29_0000_US_00000976_V6_MB6_VHS8_H10_MS               
        51847   08:13:59.52     2004-06-09_0000_US_00000978_V1_MB11_VHS12_H9_MS              

        Totals 2016-05-19 to 2016-05-26:

                93 files completed processing
                   4417762 words of closed captioning
                   2693514 seconds of video (748:11:54 hours)
                92 files processed correctly
                1 files failed repair
                All files have captions

        Command: groeling@cartago:/usr/local/bin/Rosenthal-periodic-report 10 7 s

Check inventory

You can also check the inventory directly. To navigate in the Rosenthal tree, use "day $FIL" -- for instance "day 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12" -- or "day $DATE" -- for instance "day 2006-07-04".

Each captured file should have these generated files:

groeling@cartago: $ day 2006-10-23 ; l 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12*
-rw-r--r-- 1 groeling staff 2893853    Feb 13 17:41 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.ccx.bin
-rw-r--r-- 1 groeling staff 549        Feb 21 04:13 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.cuts
-rw-r--r-- 1 groeling staff 2055491965 Feb 13 20:44 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.mp4
-rw-r--r-- 1 groeling staff 804538     Feb 23 07:50 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.txt3

The cuts file is not mandatory. See also Checking the cutpoint files for the fetch-cutfiles script, which copies in missing files.

Identify partial failures

The file cartago:/home/groeling/commands shows how to ask about repairs and text extraction failures:

# List failed repairs
cd /mnt/ifs/NewsScape/Rosenthal/2006 && grep 'Repair failed' */*/*3

# List failed text extractions
cd /mnt/ifs/NewsScape/Rosenthal/2006 && grep 'Text extraction failed' */*/*3

Run these commands daily (or use the information from the daily report) and reprocess whatever files failed because of some error in the recording process.

Count completed files

On cartago you can also query how many files have been processed per day:

groeling@cartago:/mnt/ifs/NewsScape/Rosenthal/2006$ for i in {1..31} ; do echo -en "March $i:\t" ; l1 */*/*4 | grep -P "Mar\s*$i\ " -c ; done
March 1:        33
March 2:        19
March 3:        25
March 4:        10
March 5:        34
March 6:        54
March 7:        63
March 8:        35

Creating cutpoint files

Each digitized file contains several news programs. The start of each program is identified by cut points, either before or after file compression.

Generating cutpoints before compression

In the digitizing lab, videos are digitized to 10-20 Macs and the files exported to a file server, currently one of several iMacs with a RAID. Staff and students then create cutpoint files; they're able to do them quickly thanks to a system Groeling devised, using around five minutes per 8-hour file. 

Hoffman2 picks up the files with cutpoints firstThese sometimes have a final two-letter code that is different from the mpg file, which the scripts can handle. The pipeline then extracts the metadata and compresses the video and audio to mp4. After processing, the mp4 files are copied to the Isilon at ca:/mnt/ifs/NewsScape/Rosenthal, along with the extracted text and commercial segmentation ($FIL.txt3), the cutpoint files ($FIL.cuts), the srt file ($FIL.srt), and the metadata dump ($FIL.ccx.bin).

If no cutpoints file has been created from the mpg file, we will need to create the cut points for the mp4 file instead.

Generating cutpoints after compression

The compressed files can be copied in large numbers onto the WDMs, which students can take home and use to generate cutpoint files. The system designed for rapid splitting should work just as well on the mp4 files, and these file are of course smaller and more tractable.

Checking the cutpoint files

The cutpoint files are currently being created on the WDMs. If they are ready by the time Hoffman2 transfers the file, the cutpoint file will be copied to Hoffman2 and to the Isilon. On the WDMs, the cutpoint files variably have the extension .txt or .cuts.txt; on Hoffman2 and the Isilon they are given the extension .cuts. Again, note that the final two letters in the file name of the .cuts file may differ from that of the .mpg file.

In parallel, the script fetch-cutfiles runs once a day on groeling@cartago, copying any new cutpoint files into the Rosenthal tree; the logs are in /mnt/ifs/NewsScape/Rosenthal/logs. Note that cutpoint files already present in the tree do not get updated by the script -- the version on the tree is treated as master, as it may need to be reformatted before it is fed into the cut script. To trigger an update, just remove or change the extension of the .cuts file.

The cutpoint files may have format anomalies that require pre-processing. Provide feedback to Groeling and his staff if the format is inconsistent or unsuited to machine processing.

Splitting the files into news shows

The next task is to split the files into news shows according to the cutpoint files. We still don't have any experience doing this, so the method will need to be refined. Our assumption is that we will simply split the .mp4 files and the accompanying .txt3 files, but this section also explores some other possibilities that should be attempted.

Splitting will take a steady hand and some manual checking, but the premise is that the cutpoint files will be correct and the process can be fully automated. 

Selecting which file to split

In some cases, there is more than one version of the digitized file. This may happen because the first attempt resulted in a file that failed to repair, or lacked closed captions. 

If a tape is assessed to be inherently problematic, the file is given the name $FIL_BE for "best effort" -- for instance

     2006-08-09_0000_US_00001025_V4_MB9_VHSP4_H14_GG_BE

This is used to signal that it's not worthwhile trying to achieve better results.

If a second or third attempt is made to digitize a tape recording, they will be given a numerical extension, for instance

     2006-06-13_0000_US_00000705_V10_MB7_VHS7_H11_CG_2

If there is more than one version of the file present, the version that has an accompanying $FIL.ccx.bin file should always be selected, this the presence of this file means that the original transport stream repaired successfully, and thus is almost certainly of better quality. 

If there is no $FIL.ccx.bin, we should typically use the version that generated the most caption text, though it may be worth comparing the amount of non-words in the captions. 

Embedding text

The Hoffman2 pipeline creates .srt files where possible. This file may be usable for embedding the closed captioning into the mp4 file. It's possible this should be done before the files are split.

Splitting the metadata dump

An interesting question is whether the metadata dump file ($FIL.ccx.bin) can be split, using the cutpoints file -- if so, you could just extract the text per show from the pieces. A relatively small number of files failed to repair, which also leads to a failed metadata dump, though interestingly normal text extraction and file compression typically works. These files may get redigitized, and should receive the extension _2.mpg. If splitting the metadata dump files works, that would be a lot simpler than splitting the mpg files.

Splitting the mp4 files

We could just call /usr/local/bin/clip in the script that splits the mp4 files, or copy the commands. The clip script uses ffmpeg and should deliver frame-accurate clips. It can also split the txt3 files, though not the .srt files.

Splitting the txt3 files

If splitting the metadata dump files doesn't work, we need to write a script that shifts the base time of the current .txt3 files. The script ca:/usr/local/bin/cc-update-timestamps is a good starting point and may only need tweaking. This is likely the simplest way forward. All times are currently local, so file names will need to be calculated for UTC. Groeling has a spreadsheet with file names and modal broadcast times. 

Caption credit lines reliably mark show boundaries:

20040903020018.078|20040903020021.413|CC1|    -- Captions by VITAC --
20040903020018.078|20040903020021.413|CC1|         www.vitac.com

20060629063151.821|20060629063152.321|CC1|CAPTIONS BY VITAC

20060627033042.997|20060627033047.501|CC1| CAPTIONS PAID FOR BY ABC, INC.

20060627063145.147|20060627063145.948|CC1|CAPTIONS PAID FOR BY
20060627063146.082|20060627063150.419|CC1|NBC STUDIOS

20060627060109.415|20060627060111.983|CC1|      CAPTIONS PAID FOR BY
20060627060109.415|20060627060111.983|CC1|  PARAMOUNT DOMESTIC TELEVISION

There's typically a gap of 5-10 seconds before the first caption line of the next show appears:

20060623033040.494|20060623033044.297|CC1|WINNING.
20060623033044.365|20060623033048.769|CC1| CAPTIONS PAID FOR BY ABC, INC.
20060623033054.541|20060623033055.842|SEG_00|Type=Story start
--
20060623035807.739|20060623035811.310|CC1|AND GOOD NIGHT.
20060623035811.411|20060623035816.014|CC1|        Captions by VITAC
20060623035954.881|20060623035958.083|CC1|y     THIS IS "JEOPARDY!"
--
20060623043555.105|20060623043600.810|SEG_00|Type=Commercial
20060623043555.105|20060623043558.508|CC1| CAPTIONS PAID FOR BY ABC, INC.
20060623043600.810|20060623043602.579|SEG_00|Type=Story start

The a script cc-keyword-spacing outputs the relative time and spacing between these show boundary markers. It accepts multiple search terms, for instance:

$ cc-keyword-spacing $F 'CAPTION|Caption'

Spacing of keywords in 2006-06-06/2006-06-06_0000_US_00000157_V11_M2_VHS10_H4_JK.txt3 (08:13:56):

        CAPTION|Caption         00:00:20        00:00:20
        CAPTION|Caption         00:59:44        00:59:24
        CAPTION|Caption         02:00:42        01:00:58
        CAPTION|Caption         02:59:39        00:58:57
        CAPTION|Caption         04:59:35        01:59:56
        CAPTION|Caption         06:00:38        01:01:03
        CAPTION|Caption         06:59:56        00:59:18
        CAPTION|Caption         07:59:32        00:59:36

That tells you the previous show ends 20 seconds into the recording, and then there's a series of one-hour shows -- the second column gives the relative times and the third the duration of each show. Some shows lack the boundary marker -- at the end of the fifth hour is just this:
20060606035847.746|20060606035848.814|CC1|MELODY ♪
20060606040136.248|20060606040138.550|SEG_00|Type=Story start

There's a tiny bit of logic added to handle premature caption receipts, as they sometimes double up. I've seen one case (Oprah) where the captions credit appear a few seconds after the show has started, so that does sometimes happen.

With very minor modifications, the script will write a tag like this:

        20060606035847.746|20060606035848.814|SEG|Type=Show boundary

The second timestamp in this case is not informative.

Integrating with NewsScape

After splitting, the new files should be merged into the regular NewsScape tv trees on cartago and IFS. Hoffman2 can then pick them up for on-screen text extraction, and the NLP pipeline can operate on the text files. We will need to program this separately, but it's a matter of giving some date parameters to existing routines.

For files from the fall of 2006, which is most files to date, there will be an issue about possible duplicates in the digital collection, but the overlap is not likely to be very extensive. The caption text may at times be better in the digitized files.


Comments