— How to monitor the Rosenthal pipeline

Introduction

Hoffman2 is a high-performance computing cluster (HPC) at UCLA, administered by the Institute for Digital Research and Education (IDRE) and used by the NewsScape project for multimodal research in Red Hen. The Rosenthal pipeline was set up in February and March of 2016 to process files generated by Tim Groeling's team digitizing of the analog back catalog of the Communication Studies Archive (the Rosenthal Collection), created from 1973 to 2006 by Paul Rosenthal.

This scroll documents how the pipeline is designed and how the monitoring screens are set up to watch over the pipeline and ensure files are processed efficiently. For developing the pipeline, see Video processing pipelines.

Related links

Daily tasks

While the Rosenthal pipeline is robust and handles most cases automatically, it needs periodic attention to ensure that everything is operating as intended.

Check in on the pipeline

To confirm the pipeline is healthy:

    • Login to groeling@cartago

      • login to the Hoffman2 node where the pipeline commands are running (usually node 1, with command g 1)

      • access the suite of screens where the pipeline commands are being run with command gr

      • in screen window 1, repeat the command to count the number of completed files (see Count completed files)

    • Login to the active file servers (currently wd2, wd3, wd4, and wd5)

    • press 'd' to see the current downloads to hoffman2 (max three)

    • press 'm' to move completed files (expect 10-25 a day -- see Monitoring the file servers)

    • Login to groeling@hoffman2

      • adjust the number of nodes to request depending on the number of active file servers (see screen window 4 work)

Of course, if files are not picked up from the file servers and removed, or the daily report fails to report anything, you know there is a problem.

Re-digitize failed files

Check the daily e-mail reports for files that failed to repair or lack closed captions. Files that failed to repair typically have both degraded video and closed captions, so even if the word count for the captions looks good, the text may be junk.

If a tape is assessed to be inherently problematic, give the file the name $FIL_BE for "best effort" -- for instance

2006-08-09_0000_US_00001025_V4_MB9_VHSP4_H14_GG_BE

This signals that it's not worthwhile to try to achieve better results.

If a second or third attempt is made to digitize a tape recording, give the file a numerical extension, for instance

2006-06-13_0000_US_00000705_V10_MB7_VHS7_H11_CG_2

Increment this number for each attempt. Never re-use the same file name; this will result in the digitized file being ignored by the inventory script, which interprets it to be the same file as the one that was processed earlier.

Pipeline design

The starting point is that a file is digitized in the Public Affairs lab and exported to one of four file servers, WD2 though WD5. A file containing the show cut points may or may not be manually created. In broad strokes, the current pipeline on hoffman2 then runs as follows:

    1. On a login node, the script fetch-Rosenthal-daemon-local.sh queries each file server for inventory

    • Prefer files that have a cutpoints file (.cuts.txt)

      • Skip files that are newer than two hours

      • Create a $FIL.len file in ~/tv2/pond for each file that is ready to be processed

      • Maintain a pool of 25 .len files at all times, evenly divided from the current file servers

    1. On a login node, the script work-daemon-local-01g.sh requests new jobs from the grid engine when needed

      • Aim to maintain a steady pool of grid-engine jobs at all times (see section 4 work for how to make adjustments)

    2. On a compute node, the script node-daemon-local-14g.sh orchestrates the processing of each file

    • Starts fetch2node, repair, textract, and ts2mp4

  1. On the compute node, fetch2node-21g.sh transfers the 25GB file

    • Checks first if the file server is already transferring three files -- if it is, deletes the reservation and move on

    • Copies the file from its source on the file server to the local drive on the compute node

    1. On the compute node, run repair-06g.sh to repair the file

    • Without repair, CCExtractor cannot get the metadata dump, and the text is often degraded

    • Uses project-x first and then dvbt-ts to ensure a consistent file

  1. On the compute node, runs textract-02.sh to extract the text from the digitized file

    • CCExtractor is used to retrieve the metadate dump, the srt file, and the txt2 file

    1. On the compute node, runs ts2mp4-16g.sh to compress the file

    • The video is compressed from mpeg2 to h264 and the audio from s16p to aac, in an mp4 container

    1. On the compute node, runs frames-49g.sh to extract frames at one-second intervals

      • This takes a long time if the starting point is an mpg file; it's much faster if you start with an mp4 file

    2. On the compute node, runs jpg2ocr-78g.sh for optical character recognition of the text written on the frames

      • The output is an ocr file with timestamps and position information for the text

    3. On the compute node, copies the completed files to storage

    • The files are copied to UCLA's NetApp storage, as mounted on cartago, our developer machine

    1. On cartago, split the files on the NetApp, using the cutpoint files

    • Also cut the text files to match

The pipeline is now relatively robust. The main bottlenecks are file quality and the availability of hoffman2 compute nodes. Transfer speeds are down to around 10 minutes for non-simultaneous transfers.

The frame extraction and OCR were added to the pipeline in February 2018. In order to process files that have already passed through the pipeline, the fetch-Rosenthal-daemon-local.sh script now also imports mp4 and txt3 files from the NetApp server. These files pass through the same pipeline, skipping the text extraction and file compression steps.

Design economy

The Rosenthal pipeline attempts to maintain a good balance between the availability of files to process on the one hand and compute nodes to do the work on the other. This balance is maintained by two scripts: a budgeting script and an inventory script.

The budgeting script -- work-daemon-local-01g.sh -- will attempt to provide an optimal number of compute nodes for the pipeline; you tell it to stay below a certain number, and it will dynamically scale up to that limit depending on the need for compute nodes and their availability. The script may not succeed, since it merely makes requests for nodes, and the requests aren't always granted. Sometimes the budgeting script has to wait for hours to get a node, or may fail altogether. Nevertheless, it patiently persists in making requests. It's safe to set the request to a high number, say 30; it will ask for more nodes only if they are needed. It will submit up to four requests at once, and then wait until at least one is granted.

The availability of files to be processed is tracked by the inventory script, fetch-Rosenthal-daemon-local.sh. Its job it is to figure out where there are files available for download. It looks at the file servers and creates $FIL.len files in the pond directory on hoffman2, which the compute nodes use to find new jobs. The inventory script in turn has to be told how often to check for files and where to look -- so for instance, you could tell it to only check WD3, or to check WD3 three times as often as WD4, or to try to maintain an equal number of files from each source. The default is to check the four file servers from WD2 to WD5 every half hour.

Both the budgeting script and the inventory script are currently set to scale dynamically up to the limit you provide. When there are more files available for processing on the file servers, the inventory script will create more .len files, and the budget script will ask for more nodes. Since each node is requested for 24 hours, we will get overshoot (idle nodes) if the inventory drops. The goal is to avoid both unnecessary waiting and idle nodes; this may take some further tweaking of the flow logic in the budget and inventory scripts.

While this system will flexibly take advantage of available nodes, its throughput is still constrained by the time required to transfer files from the iMac file servers. The transfers do not start until we have reserved a node, and we transfer no more than three files concurrently from each iMac. This limits our ability to take advantage of rapid swings in node availability.

Restart the pipeline

If the pipeline is down, these are the steps to restart it:

    • Login to groeling@cartago

    • Login to Hoffman2 login node 1. This is the node where the monitoring of the pipeline typically takes place. All the commands that constitute the pipeline run here.

      • Type g 1 in the command line to access login node 1. In case of problems accessing node 1 from Cartago, login to node 2 instead (type g 2) and then go into node 1 from there by typing ssh login1

    • Now you are in Hoffman2 login node 1. The pipeline is managed through a suite of 12 windows, created with the screen application, which persist across logins.

      • To access the suite of screens, type gr. If the pipeline is working, this command will connect you to the existing screens set. If the pipeline is down, this command will create an empty screen which looks like this:

    • Now you need to reconstitute the set of screens by creating and renaming each of the 12 screens (automatically numbered 0 to 11).

      • To create a new screen, type ctrl-a c

      • To rename the screen you are on, type ctrl-a A and then edit the name provided (press enter to confirm the name). Call each screen with the names shown in the picture below (bin, pond, stages, fetch, work, jobs, day, logs, downloads, compressions, processes, files). Check the Screen Windows section for details about each of these screens.

      • To move to a different screen, type ctrl-a n (n like "next") or ctrl-a p (p like "previous").

      • After creating all the screens, the bottom of your terminal window should look like this:

      • To restart the functions of the pipeline, go to each screen (type ctrl-a n for "next screen" or ctrl-a p for "previous screen") and type in it the commands provided in the Screen Windows section. Be sure to type the appropriate commands in each screen (they will not work if they are typed into the wrong screen).

        • Some commands will produce a continuous output: a flow of continuous information appears on screen. The process will not stop unless you stop it forcibly with the command ctrl-c. Typically, these commands do not need to be stopped or repeated once you start them.

        • Other commands will produce a finite list or finite result: you can repeat these commands every time you connect to the screens suite to check the status of the pipeline.

      • To leave the screens suite and let the pipeline do its work, type ctrl-a d (d like "disconnect"). This does not mean you are closing the program: you will be disconnected, but the screens continue to work, and you can access them again with the above command gr

Screen windows

Monitoring takes place on the command line, and uses GNU screen to create a suite of windows on a particular hoffman2 login node (typically login1) that persist across logins. When hoffman2 login nodes are rebooted, which happens a few times a year, the screen windows must be reconstituted, since the pipeline is driven by commands that run in the screen windows. A typical screen session looks like this:

0 bin 1 pond 2 stages 3 fetch 4 work 5 jobs 6 day 7 logs 8 downloads 9 compressions 10 processes 11 files

These named windows within screen are made possible by the configuration file ~/.screenrc. Screen windows are named by Ctrl-a A, and you navigate by C-a n (next) and C-a p (previous) or by number: C-a 8. To reach numbers above 9, use C-a ' and enter the two-digit number. Here is what is done in the different screen windows:

    1. bin -- contains the scripts that control the pipeline processes

    2. pond -- contains the reservations for the videos that are currently being processed

  1. stages -- displays the log of the current processes

  2. fetch -- runs the inventory script, which creates the marker .len files used to track files available for processing

  3. work -- runs the budget script, which requests new jobs from the grid engine when needed

    1. jobs -- lists current jobs and job queue

    2. day -- a directory tree that tracks which files have been processed

    3. logs -- used to query the pipeline processing logs

    4. downloads -- show the progress of files currently being downloaded

    5. compressions -- show the progress of files currently being compressed

    6. processes -- show the processes running on all active nodes

    7. files -- show the files being created on all active nodes

These are the typical commands used in each screen window:

    1. bin -- used to edit bash scripts (.sh files) and SGI grid engine jobs (.cmd files)

    2. pond -- cd ~/tv2/pond; f (to list all files that are being or have been fetched); r (ditto for reserved files)

    3. stages -- cd ~/tv2/logs ; tail -n 500 -f reservations.$( date +%F )

    4. fetch -- while true ; do fetch-Rosenthal-daemon-local.sh 25 ; echo -e "\n\tResting since `date`" ; sleep 1800 ; done

    5. work -- work-daemon-local-01g.sh 18

    6. jobs -- myjobs

    7. day -- day 4 ; l ; day $FIL ; l $FIL* (where $FIL is some file name without extension)

    8. logs -- F=$FIL ; grep $F res* -h (where $FIL is some file name without extension)

    9. downloads -- j2

    10. compressions -- j3

    11. processes -- i

    12. files -- j

Several of these commands -- f, r, day, j2, j3, i, and j -- are defined in ~/.bashrc and work only in the context of the given screen windown. Below are the typical outputs of the commands in each screen window.

0 bin

See section Modifying the scripts below.

1 pond

The pond is where the fetch daemon deposits its .len files, which tell the compute nodes which files that are ready for processing. It's used to monitor the overall progress:

Rosenthal pipeline -- Output of f in GNU screen window ~/tv2/pond

Other useful commands in 1 pond:

  • e -- remove expired reservations (automated by fetch-daemon-local)

  • r -- to see all reserved files

  • grep ^wd *len -- to see the source of the .len files

2 stages

The main command is "cd ~/tv2/logs ; tail -n 500 -f reservations.$( date +%F )", which gives a continuous output like this:

Rosenthal pipeline on Hoffman2: continuous log in the stages window

For other useful commands in the 2 stages window, see Speed and output monitoring below.

3 fetch

This is the inventory script, which tracks which video files are available on the file servers. The standard command "while true ; do fetch-Rosenthal-daemon-local.sh 12 ; echo -e "\n\tResting since `date`" ; sleep 1800 ; done" gives this kind of output:

Rosenthal pipeline -- fetch files (screen window 3)

The fetch-Rosenthal-daemon-local.sh script calls the script "m" on the file servers, which in turn moves completed files out of the way, and deletes them after a couple of weeks. Note that the inventory script does not copy any video files, it just creates .len files that the compute nodes use to locate the video files. It's also the compute nodes that actually download the files from the file servers.

4 work

This is the budget script, which asks for compute nodes when it detects that the inventory script has added tracking files. The standard command "work-daemon-local-01g.sh 8" gives this kind of output:

Work daemon window in the Rosenthal pipeline

The maximum level of jobs to request is set manually; jobs will scale dynamically up to that ceiling. A modest ceiling is a safeguard, though nodes will be requested only when files are waiting. Once there are no waiting files, all jobs will terminate, freeing up the compute nodes. Note that scaling up is gradual, a new job being requested for each waiting file, but scaling down is catastrophic, all jobs terminating when no further files are waiting. This skews the system slightly in favor of the waiting files, at the expense of momentarily idling nodes, yet avoids the large-scale waste of keeping nodes idling until they expire.

The number of nodes requested is determined by the final number of the standard command: work-daemon-local-01g.sh 8. In this case it means the script is requesting 8 nodes from he cluster and it will stop requesting new nodes when 8 are running, even if more are needed to process the queue.

This number can be upped, but this should be done gradually, to avoid clogging the pipes to the file servers with requests that won't be honored -- the scripts have a limit set to copy one file at a time from each file server. If needed, the limit can be progressively increased until the queue has been processed. To do that, press ctrl-c to interrupt the script and then run the main command again with a higher limit, e.g. 12: work-daemon-local-01g.sh 12 . Check with Prof. Francis Steen concerning the change of node requests.

5 jobs

The standard command "myjobs" gives this kind of output:

Rosenthal pipeline on Hoffman2: myjobs output

In the State column, r means running and qw means queue wait.

6 day

Typical usage includes this sort of command sequence -- first navigate to the right day in the day directory tree and then list the files belonging to a particular captured file:

Rosenthal pipeline on Hoffman2: day directory tree

You can also navigate in the day tree using "day <days ago>", for instance "day 24". The day directory tree is used by the fetch-Rosenthal-daemon-local.sh script to determine if a captured file has already been converted.

7 logs

This typical command sequence:

F=2005-05-11_0000_US_00000188_V2_MB12_VHS13_H2_JK

grep $F res* -h

gives the processing history of a captured file:

Rosenthal pipeline on Hoffman2: commands in the logs screen window

Other useful ways to query the logs include the commands "group <days ago>" -- for instance "group 1" -- and searches for job IDs.

If a file has failed to process, the pipeline may make continual new attempts. These attempts may result in files being left stranded on the compute nodes. To clean up, issue

purge $FIL

for instance

purge 2005-03-04_0000_US_00006063_V13_VHSP12_MB20_H17_ES

The purge script will identify all the compute nodes that made a processing attempt and delete the stranded files.

8 downloads

The alias j2, defined in ~/.bashrc, displays the progress of downloading files, which are typically 25GB in size:

Rosenthal pipeline on Hoffman2: download progress

Downloads are faster when there are fewer simultaneous downloads from one source.

9 compressions

The alias j3, defined in ~/.bashrc, displays the progress of compressing mpg files to mp4, typically to around 2GB in size:

Rosenthal pipeline on Hoffman2: file compression progress

It may be possible to speed up compression without quality loss by using single-pass variable bitrate, but our early experiments have not been promising.

10 processes

The main processes on all active nodes can be listed with the shortcut i:

Rosenthal pipeline on Hoffman2: main processes on all active nodes

This is useful to detect active nodes waiting for work (this is also automatically tracked by the budget script), or to see the stages of files on the various nodes.

11 files

The main files on all active nodes can be listed with the shortcut j:

Rosenthal pipeline on Hoffman2: main files on active nodes

These shortcuts can be improved and modified at will in ~/.bashrc.

Speed and output monitoring

Here are some commands for monitoring processing speed and quality.

Download times

Our download times from our current file servers WD2 though WD5 are excellent -- typically ten minutes on a single download. If you look at the column with values like "WD2 Dd time 2", this gives you the source and the number of simultaneous downloads from that source. The number of jobs can vary during the course of the download; the number is just the starting condition. Our modal value is around 15 minutes from the file servers and 5 minutes from NetApp; in this old screen shot we see it could take several hours:

Rosenthal pipeline on Hoffman2: download speeds

Repair times

Repair times are quite predictable, typically around 40 minutes:

Rosenthal pipeline on Hoffman2: repair times

Text extraction

Text extraction takes only a couple of minutes, so we don't time it. We're typically getting more than ten thousand lines of text in every eight-hour file. The quality so far is good and often not clearly inferior to live capture. At the same time, there are many examples of severely degraded text.

Rosenthal pipeline on Hoffman2: text extraction counts

Compression times

Compression times show a surprisingly large range, from two to seven hours with a mode around five.

Rosenthal pipeline on Hoffman2: compression times

Completion times

Completion times range from three to eleven hours. Our modal value used to be around eight, but has now dropped closer to five.

Rosenthal pipeline on Hoffman2: completion times

Nudging the total processing time down even a little bit helps, since we can then fit three and sometimes four files in one 24-hour job. They used to time out after two.

CPU usage

In May 2016, the Rosenthal pipeline used 12,611.13 CPU hours to process 492 files, or 26 CPU hours per file. Since the processing is done on four CPUs, that means we're averaging 6.4 hours per file.

Modifying the scripts

The bash processing scripts in ~/bin are can and should be improved when possible. Scripts are numbered, and modifications should be made by copying the current script to an incremental number. The name of the currently active script is kept in a .name file:

  • fetch2node.name

    • node-daemon-local.name

    • repair.name

    • textract.name

    • ts2mp4.name

To verify the name of the currently active script, check the contents of the corresponding name file:

groeling@login1:~/bin$ cat ts2mp4.name

ts2mp4-16g.sh

Then copy that file before you make changes:

cp ts2mp4-16g.sh ts2mp4-17g.sh

Make your changes to the script, test it, and then activate it by updating the contents of the .name file. The pipeline will check the .name file next time the script is called. This avoids disrupting scripts while they are running -- a potentially very messy situation that can take days to recover from, so be careful.

Monitoring and configuring the file servers

We use iMacs with attached RAID as file servers -- they receive digitized files from the digitizing stations, and these files are then picked up by the hoffman2 scripts.

Monitoring transfers

When you ssh into an iMac file server, you can use these commands to monitor and intervene:

  • p -- navigate to the working directory

  • d -- to list current downloads to hoffman2 (set to max 3 per iMac in fetch2node.sh)

  • l1 -- list current files by modification date

    • m -- move completed files into the subdirectory mp4-only or (if a cuts file is present) mp4+cuts

The m command is automated on the iMacs and does not need to be entered manually, but it does no harm.

Remote desktop

If it's necessary to use a remote desktop to the file servers, use VNC through an ssh tunnel.

Make sure to turn off Apple Remote Desktop when it's not in use, using the shortcut command sson and ssoff.

Configuring the file structure

    • create symlinks for the path /mnt/HD/HD_a2/Comm/VHS to the working RAID directory

    • create the subdirectories mp4-only mp4+cuts Bad

Since we're no longer using the NAS for file servers, we could simplify the path the VHS, but this arrangement keeps the scripts compatible with the NAS.

Copy configuration files

  • copy the script m to /usr/local/bin -- it moves completed files into the mp4-only directory and cleans up

  • copy over ~/.bashrc ~/.screenrc ~/.nanorc

Macports

For those of our file servers that are OS X desktops, we install macports and run some scripts.

    • port install coreutils gsed bash findutils wget mp4v2 ossp-uuid dos2unix alpine moreutils fail2ban bash ffmpeg

The scripts require the GNU utilities as default, along with bash 4. To configure,

      • add path to ~/.profile: export PATH="/opt/local/libexec/gnubin:/opt/local/bin:/opt/local/sbin:$PATH"

    • add "/opt/local/bin/bash" to /etc/shells and issue "chsh -s /opt/local/bin/bash"

Change the ssh port

We can save ourselves lots of attacks just by moving the ssh port. To change ssh to use a different port, modify ssh's plist; see instructions. DCL02 has a new ssh.plist -- see /System/Library/LaunchDaemons/ssh.plist.new and /System/Library/LaunchDaemons/ssh.plist.orig. Copy that file (do a diff first to be safe, and save the original), or manually replace the Sockets stanza to read like this:

<key>Sockets</key>

<dict>

<key>Listeners</key>

<dict>

<key>SockServiceName</key>

<string>9876</string>

<key>SockFamily</key>

<string>IPv4</string>

<key>Bonjour</key>

<array>

<string>9876</string>

<string>sftp-ssh</string>

</array>

</dict>

</dict>

-- where 9876 is the newly assigned port. It may even be we don't need to disclose the new port to bonjour. Then issue

  • launchctl unload /System/Library/LaunchDaemons/ssh.plist

  • launchctl load /System/Library/LaunchDaemons/ssh.plist

This can be made transparent to the user by modifying ~/.ssh/config on incoming machines as needed -- say

Host wd87

User csa

Port 9876

Hostname fileserver7.ucla.edu

We can the reach the server with a simple "ssh wd87".

Configure fail2ban

For Linux on a Raspberry Pi, see How to set up a Red Hen capture station.

fail2ban @0.9.3 (security, python) -- http://www.fail2ban.org/

Description: Fail2ban scans log files (e.g. /var/log/apache/error_log) and bans IPs that show the malicious signs—too many password failures, seeking for exploits, etc. Generally Fail2Ban is then used to update firewall rules to reject the IP addresses for a specified amount of time, although any arbitrary other action (e.g. sending an email, or ejecting CD-ROM tray) could also be configured. Out of the box, Fail2Ban comes with filters for various services (apache, curier, ssh, etc).

To configure, issue

cp /opt/local/etc/fail2ban/fail2ban.conf /opt/local/etc/fail2ban/fail2ban.local

cp /opt/local/etc/fail2ban/jail.conf /opt/local/etc/fail2ban/jail.local

Edit /opt/local/etc/fail2ban/jail.local and add in some IP ranges for friendly computers:

ignoreip = 127.0.0.1/8 164.67.171.0/24 164.67.183.179

After changing the configuration files, first unload and then load:

sudo port unload fail2ban

sudo port load fail2ban

The log is /var/log/fail2ban.log:

# cat /var/log/fail2*g

2016-05-21 08:37:17,288 fail2ban.server [99336]: INFO Changed logging target to /var/log/fail2ban.log for Fail2ban v0.9.3

2016-05-21 08:37:17,289 fail2ban.database [99336]: INFO Connected to fail2ban persistent database '/opt/local/var/run/fail2ban/fail2ban.sqlite3'

2016-05-21 08:37:18,089 fail2ban.database [99336]: WARNING New database created. Version '2'

To block incoming traffic from a specific IP, add a line like this to /etc/pf.conf:

block in from 221.0.213.154

Then re-load the config file:

pfctl -e -f /etc/pf.conf

Not fully tested.

To unban, use iptables:

iptables -L --line-numbers (list the rules -- very slow)

iptables -L -n

iptables -D INPUT 1 (remove a rule by number)

iptables -L INPUT -v -n | grep 45.49.133.177 (search for an address -- very fast)

iptables -D INPUT -s 45.49.133.177 -j DROP (remove a rule by IP address -- repeat for multiple rules)

iptables -D fail2ban-ssh -s <banned_ip> -j DROP -- remove an ssh rule by IP address

See also Unbanning IP addresses in fail2ban.

On Hoffman2

Once the new file server is configured, add it to hoffman2's ~/.ssh/config, fetch-Rosenthal-daemon-local.sh, and fetch2node-xxg.sh.

Configuring and monitoring the NAS

We have stopped using the two NAS units WD1 and WD2 as file servers, as their transfer rates are painfully slow. However, they continue to be used for file transport and cutpoint generation.

Configuring the NAS

We first need to set up a new NAS, such as WD1 and WD2, with RSA keys to accept automatic ssh logins.

First, establish the RSA keys. Log in with the password and issue:

ssh-keygen -t rsa

Copy the freshly generated key:

cat /home/root/.ssh/id_rsa.pub

Add this key to the authorized keys of an already configured NAS.

Then copy the list of keys and hosts from that WDM:

scp sshd@<IP address>:/home/root/.ssh/authorized_keys /home/root/.ssh/

scp sshd@<IP address>:/home/root/.ssh/config /home/root/.ssh/

Verify remote access is working.

Second, create the configuration backup directory:

mkdir /mnt/HD/HD_a2/system

and back up the .ssh directory:

cp -rp /home/root/.ssh /mnt/HD/HD_a2/system/ssh-root

Check that /mnt/HD/HD_a2/comm/VHS exists. If the directory is called /mnt/HD/HD_a2/Comm with an uppercase C, create a symlink, as the scripts assume lowercase:

cd /mnt/HD/HD_a2 ; ln -s Comm comm

Third, copy the shortcuts files that makes navigation and recurring tasks simpler:

cd /mnt/HD/HD_a2/system

scp -p wd1:`pwd`/* .

Tweak the shortcut file 'hi' and copy it into position:

vi hi (make any needed changes, such as the system name)

cp -p hi /home/root

Finally, activate the shortcut file, copy over the command list, and create the tracking folders:

cd

source hi

scp -p wd1:`pwd`/commands .

mkdir mp4+cuts mp4-only Bad

Change the file ownership to something appropriate to the workflow, for instance:

chown -R admin:share /mnt/HD/HD_a2/comm/VHS

The system is now configured.

Shortcut files

The shortcut files are kept in /mnt/HD/HD_a2/system and currently include the following:

root@WD1 system # cat hi

clear

cd /mnt/HD/HD_a2/comm/VHS

echo -e "\n Welcome to WD1\n"

echo -e "\tl -- list files"

echo -e "\tl1 -- by modification time"

echo -e "\td -- show files downloading"

echo -e "\tp -- go to video file directory"

echo -e "\tm -- move completed files to mp4+cuts\n"

ENV=/mnt/HD/HD_a2/system/bashrc ash

After logging into WD1 or WD2 by ssh, type "source hi" and activate the custom environment. You also get a menu with some of the functionality that is added by the bashrc file. This is easy to expand as needed.

root@WD1 system # cat bashrc

alias nano='vi'

alias bashrc='vi /mnt/HD/HD_a2/system/bashrc'

alias sbashrc='source /mnt/HD/HD_a2/system/bashrc'

alias p='cd /mnt/HD/HD_a2/Comm/VHS'

HISTFILE=/mnt/HD/HD_a2/system/.ashfile

alias history='cat $HISTFILE'

alias ls='ls --color=always'

alias l='ls -Ll'

alias md='mkdir'

alias rd='rmdir'

alias ..='cd ..'

alias d='date ; ps x | grep rsync | grep -v grep | grep -v "sh -c"'

alias ll='ls -goh'

alias la='ls -lA'

alias l1='ls -lt -r' # sort by date, reversed

alias l2='ls -l -S -r' # sort by file size, reversed

alias l3='ls -lt' # sort by date, most recent last

alias m='for i in mp4-only/*done ; do j=${i##*/} ; if [ -f ${j%%.*}*.txt ] ; then echo -e "mp4+cuts \t ${j%%.*}" ; mv $i ${j%%.*}*.txt ${j%%.*}.mpg mp4-only/${j%%.*}.mpg mp4+cuts 2>/dev/null ; elif [ -f ${j%%.*}.mpg ] ; then echo -e "mp4-only \t ${j%%.*}" ; mv ${j%%.*}.mpg mp4-only 2>/dev/null ; fi ; done'

fix_hyphen () { cd /mnt/HD/HD_a2/Comm/VHS/ ; for i in `find . -maxdepth 1 -regextype sed -regex '.*\/[0-9\-]\{10\}_0000_US_[A-Za-z0-9]\{7,8\}_.*\.\(txt\|mpg\)'` ; do e=${i#*_} ; if [ "$( echo $e | grep '\-' )" ] ; then mv -vn $i ${i%%_*}_${e//-/_} ; fi ; done }

We could do more here -- this bashrc file gives us the ability to define shortcuts for routine tasks, such as navigating or moving files around according to fixed criteria. Note the "m" alias, which moves files into mp4+cuts when ready. Hoffman2 puts a marker for converted files in mp4-only, and m moves the ones that also have a cutpoints file into mp4+cuts, along with the mpg file. We could also put this on a crontab on the WDM, or integrate it into a hoffman script. Similarly, the "fix_hyphen" function removes stay hyphens from filenames.

Finally, system contains the directory ssh-root, which has the RSA keys.

Reestablishing the configuration after a reboot

When the WDM is reset -- and this doesn't take much; a reboot is enough -- the /home/root directory is wiped (very user friendly). To reconstitute, log in with the password, cd /mnt/HD/HD_a2/system, and

cp -rp ssh-root /home/root/.ssh

cp -p hi /home/root/

Log out and back in; the keys will now work and you can enter "source hi".

Monitoring the file servers

When you ssh into the file servers, named wd2 through wd5, you can use these commands to monitor and intervene:

  • p -- navigate to the working directory

  • d -- to list current downloads to hoffman2 (set to max 3 per file server in fetch2node.sh)

  • l1 -- list current files by modification date

    • m -- move completed files into the subdirectory mp4+cuts

The m command is automatically triggered by the fetch-Rosenthal-daemon-local.sh script running on Hoffman2 .

To determine the proportion of files that are waiting to be processed on hoffman2, count the files in the working directory:

l *mpg | wc -l

and subtract the number of files in the mp4-only directory:

l mp4-only/*mp4.done | wc -l

When the difference is zero, you should switch exporting files to this file server. Alternatively, run this command:

for i in *mpg ; do ls -l mp4-only/${i%.*}.mp4.done ; done

If you get "No such file or directory", that file has not been processed.

The file transfer management is now fully automated and is not giving us problems.

Monitoring the NetApp storage

Completed files are transferred to the NetApp storage at cartago: /mnt/netapp/NewsScape/Rosenthal. Cartago produces a daily and a weekly report that is sent to all e-mails listed in /mnt/netapp/NewsScape/Rosenthal/logs/z-mail-recipients. The person running the digitizing lab can also monitor the results by logging into user groeling on cartago.

Daily reports

Cartago generates a daily e-mail report that looks like this:


Processing Report from the Rosenthal Pipeline on Hoffman2 for August 16, 2021 UTC

~ 1994 ~
15 files from 1994 completed processing on 2021-08-16 UTC
0 06:03:02.18 1994-09-23_0000_US_NA047376_V4_VHS35_MB5_H9_JM.mp4 10206 06:03:08.45 1994-09-25_0000_US_NA047390_V18_VHS41_MB30_H34_AD.mp4 5205 06:02:50.86 1994-09-25_0000_US_NA047392_V20_VHS52_MB4_H31_JM.mp4 806 02:02:23.18 1994-09-26_0000_US_NA047386_V14_VHS47_MB56_E11_JM.mp4 21462 06:03:42.74 1994-09-26_0000_US_NA047393_V21_VHSP21_MB57_H49_JM.mp4 4038 06:02:52.30 1994-09-26_0000_US_NA047401_V4_VHSP25_MB22_H26_JM.mp4 9926 02:04:00.79 1994-09-26_0000_US_NA047406_V9_VHSP18_MB2_H24_JM.mp4 3979 02:06:13.01 1994-09-26_0000_US_NA047407_V10_VHS42_M25_H37_JM.mp4 13254 02:04:35.11 1994-09-26_0000_US_NA047408_V11_VHS64_MB27_H32_JM.mp4 6318 02:03:47.06 1994-09-26_0000_US_NA047409_V12_VHSP16_MB11_H2_JM.mp4 5798 02:04:06.84 1994-09-26_0000_US_NA047411_V14_VHS60_MB28_E10_JM.mp4 13611 02:03:49.13 1994-09-26_0000_US_NA047414_V19_VHSP20_MB23_H33_JM.mp4 9165 02:03:42.36 1994-09-27_0000_US_NA047410_V13_VHS61_MB6_H43_JM.mp4 1136 02:04:13.20 1994-09-27_0000_US_NA047412_V15_VHS51_MB8_H16_JM.mp4 2901 02:04:02.64 1994-09-28_0000_US_NA047415_V18_VHS65_MB26_H17_JM.mp4
The following files failed to repair, so the quality of the video and text is likely degraded:
All files repaired successfully
The following files have no closed captions:
1994-09-23_0000_US_NA047376_V4_VHS35_MB5_H9_JM.mpg
Summary Report for 2021-08-16:
15 files completed processing 107805 words of closed captioning 183384 seconds of video (50:56:24 hours) 15 files processed correctly All files repaired successfully 1 files have no captions
Command: /usr/local/bin/Rosenthal-daily-report 1 s m
This is an automatically generated daily report for the UCLA Communication Studies Archive digitizing project. To add or remove recipients, edit groeling at cartago in the file /mnt/netapp/Rosenthal/logs/z-mail-recipients

Use this information to reprocess whatever files failed because of some error in the digitizing process.

Weekly reports

Cartago also generates a weekly e-mail report that looks like this:


Summary Processing Report from the Rosenthal Pipeline on Hoffman2
August 7, 2021 to August 15, 2021 UTC
334 files completed processing
* 04396 02:04:04.10 1989-07-02_0000_US_NA023980_V1_VHS61_MB6_H43_JM 13908 02:03:50.47 1989-07-19_0000_US_NA024306_V15_VHS42_M25_H37_JM * 00191 02:06:10.10 1989-07-22_0000_US_NA024323_V1_VHSP18_MB2_H24_AD * 00000 02:05:44.14 1990-04-23_0000_US_NA028123_V9_VHSP15_MB13_E12_LA * 01048 05:13:57.55 1990-08-11_0000_US_NA029114_V4_VHS40_MB16_H35_JM * 01356 06:03:33.46 1990-09-27_0000_US_NA029592_V8_VHS48_MB9_H23_JM * 00000 03:40:49.42 1990-10-04_0000_US_NA029658_V5_VHSP20_MB23_H33_JM Failed repair * 00000 06:01:52.99 1990-10-05_0000_US_NA029660_V7_VHS41_MB30_H34_AD Failed repair * 00000 02:08:51.86 1990-10-05_0000_US_NA029661_V8_VHSP21_MB57_H49_JM Failed repair 10335 02:04:27.34 1990-10-18_0000_US_NA029792_V14_VHSP25_MB22_H26_LA * 00000 06:02:51.46 1990-11-06_0000_US_NA029950_V8_VHS50_MB24_H5_JM Failed repair * 00000 06:03:03.60 1990-11-20_0000_US_NA030066_V5_VHS64_MB27_H32_JM Failed repair * 04923 06:07:47.42 1991-01-24_0000_US_NA030593_V12_VHS50_MB24_H5_LA * 21000 06:17:58.99 1991-01-28_0000_US_NA030669_V7_VHS52_MB4_H31_LA * 01351 06:03:38.04 1991-01-30_0000_US_NA030726_V1_VHS38_M36_H20_JM * 00696 02:03:59.81 1991-03-22_0000_US_NA031315_V10_VHSP16_MB11_H2_JM * 04039 02:03:38.35 1991-04-26_0000_US_NA031595_V4_VHS60_MB28_E10_JM * 00432 02:03:26.35 1991-05-28_0000_US_NA031868_V1_VHS65_MB26_H17_JM 14783 02:03:34.68 1991-06-19_0000_US_NA032072_V15_VHS51_MB8_H16_JM 09285 02:03:31.99 1991-08-11_0000_US_NA032551_V7_VHSP25_MB22_H26_JM ... * 04465 02:03:24.10 1994-09-25_0000_US_NA047397_V0_VHSP16_MB11_H2_JM * 07099 02:03:44.83 1994-09-26_0000_US_NA047398_V1_VHS61_MB6_H43_JM * 00828 02:04:20.40 1994-09-26_0000_US_NA047399_V2_VHS60_MB28_E10_JM * 00189 00:59:58.01 1994-09-27_0000_US_NA047400_V3_VHS51_MB8_H16_JM 08550 02:04:04.32 1994-09-27_0000_US_NA047402_V5_VHS48_MB9_H23_JM 14954 02:04:04.73 1994-09-27_0000_US_NA047403_V6_VHSP20_MB23_H33_JM 13702 02:04:10.80 1994-09-27_0000_US_NA047404_V7_VHS64_MB27_H32_JM 15461 02:03:39.96 1994-09-27_0000_US_NA047405_V8_VHS65_MB26_H17_JM
Totals 2021-08-07 to 2021-08-15:
334 files completed processing 2632949 words of closed captioning 2944687 seconds of video (817:58:07 hours) 322 files processed correctly 12 files failed repair 19 files failed to extract captions
Command: /usr/local/bin/Rosenthal-periodic-report 9 8 s m

Check inventory

You can also check the inventory directly. To navigate in the Rosenthal tree, use "day $FIL" -- for instance "day 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12" -- or "day $DATE" -- for instance "day 2006-07-04".

Each captured file should have these generated files:

groeling@cartago: $ day 2006-10-23 ; l 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12*

-rw-r--r-- 1 groeling staff 2893853 Feb 13 17:41 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.ccx.bin

-rw-r--r-- 1 groeling staff 549 Feb 21 04:13 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.cuts

-rw-r--r-- 1 groeling staff 2055491965 Feb 13 20:44 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.mp4

-rw-r--r-- 1 groeling staff 804538 Feb 23 07:50 2006-10-23_0000_US_Archive_V2_MB10_VHS6_H12_JN.txt3

The cuts file is not mandatory. See also Checking the cutpoint files for the fetch-cutfiles script, which copies in missing files.

Identify partial failures

The file cartago:/home/groeling/commands shows how to ask about repairs and text extraction failures:

# List failed repairs

cd /mnt/netapp/NewsScape/Rosenthal/2006 && grep 'Repair failed' */*/*3

# List failed text extractions

cd /mnt/netapp/NewsScape/Rosenthal/2006 && grep 'Text extraction failed' */*/*3

Run these commands daily (or use the information from the daily report) and reprocess whatever files failed because of some error in the recording process.

Count completed files

On cartago you can also query how many files have been processed per day:

groeling@cartago:/mnt/netapp/NewsScape/Rosenthal/2006$ for i in {1..31} ; do echo -en "March $i:\t" ; l1 */*/*4 | grep -P "Mar\s*$i\ " -c ; done

March 1: 33

March 2: 19

March 3: 25

March 4: 10

March 5: 34

March 6: 54

March 7: 63

March 8: 35

Creating cutpoint files

Each digitized file contains several news programs. The start of each program is identified by cut points, either before or after file compression.

Generating cutpoints before compression

In the digitizing lab, videos are digitized to 10-20 Macs and the files exported to a file server, currently one of several iMacs with a RAID. Staff and students then create cutpoint files; they're able to do them quickly thanks to a system Groeling devised, using around five minutes per 8-hour file.

Hoffman2 picks up the files with cutpoints first. These sometimes have a final two-letter code that is different from the mpg file, which the scripts can handle. The pipeline then extracts the metadata and compresses the video and audio to mp4. After processing, the mp4 files are copied to the NetApp server mounted at ca:/mnt/netapp/NewsScape/Rosenthal, along with the extracted text and commercial segmentation ($FIL.txt3), the cutpoint files ($FIL.cuts), the srt file ($FIL.srt), and the metadata dump ($FIL.ccx.bin).

If no cutpoints file has been created from the mpg file, we will need to create the cut points for the mp4 file instead.

Generating cutpoints after compression

The compressed files can be copied in large numbers onto the WDMs, which students can take home and use to generate cutpoint files. The system designed for rapid splitting should work just as well on the mp4 files, and these file are of course smaller and more tractable.

Checking the cutpoint files

The cutpoint files are currently being created on the WDMs. If they are ready by the time Hoffman2 transfers the file, the cutpoint file will be copied to Hoffman2 and to the NetApp server. On the WDMs, the cutpoint files variably have the extension .txt or .cuts.txt; on Hoffman2 and NetApp they are given the extension .cuts. Again, note that the final two letters in the file name of the .cuts file may differ from that of the .mpg file.

In parallel, the script fetch-cutfiles runs once a day on groeling@cartago, copying any new cutpoint files into the Rosenthal tree; the logs are in /mnt/netapp/NewsScape/Rosenthal/logs. Note that cutpoint files already present in the tree do not get updated by the script -- the version on the tree is treated as master, as it may need to be reformatted before it is fed into the cut script. To trigger an update, just remove or change the extension of the .cuts file.

The cutpoint files may have format anomalies that require pre-processing. Provide feedback to Groeling and his staff if the format is inconsistent or unsuited to machine processing.

Splitting the files into news shows

The next task is to split the files into news shows according to the cutpoint files. We still don't have any experience doing this, so the method will need to be refined. Our assumption is that we will simply split the .mp4 files and the accompanying .txt3 files, but this section also explores some other possibilities that should be attempted.

Splitting will take a steady hand and some manual checking, but the premise is that the cutpoint files will be correct and the process can be fully automated.

Selecting which file to split

In some cases, there is more than one version of the digitized file. This may happen because the first attempt resulted in a file that failed to repair, or lacked closed captions.

If a tape is assessed to be inherently problematic, the file is given the name $FIL_BE for "best effort" -- for instance

2006-08-09_0000_US_00001025_V4_MB9_VHSP4_H14_GG_BE

This is used to signal that it's not worthwhile trying to achieve better results.

If a second or third attempt is made to digitize a tape recording, they will be given a numerical extension, for instance

2006-06-13_0000_US_00000705_V10_MB7_VHS7_H11_CG_2

If there is more than one version of the file present, the version that has an accompanying $FIL.ccx.bin file should always be selected, this the presence of this file means that the original transport stream repaired successfully, and thus is almost certainly of better quality.

If there is no $FIL.ccx.bin, we should typically use the version that generated the most caption text, though it may be worth comparing the amount of non-words in the captions.

Embedding text

The Hoffman2 pipeline creates .srt files where possible. This file may be usable for embedding the closed captioning into the mp4 file. It's possible this should be done before the files are split.

Splitting the metadata dump

An interesting question is whether the metadata dump file ($FIL.ccx.bin) can be split, using the cutpoints file -- if so, you could just extract the text per show from the pieces. A relatively small number of files failed to repair, which also leads to a failed metadata dump, though interestingly normal text extraction and file compression typically works. These files may get redigitized, and should receive the extension _2.mpg. If splitting the metadata dump files works, that would be a lot simpler than splitting the mpg files.

Splitting the mp4 files

We could just call /usr/local/bin/clip in the script that splits the mp4 files, or copy the commands. The clip script uses ffmpeg and should deliver frame-accurate clips. It can also split the txt3 files, though not the .srt files.

Splitting the txt3 files

If splitting the metadata dump files doesn't work, we need to write a script that shifts the base time of the current .txt3 files. The script ca:/usr/local/bin/cc-update-timestamps is a good starting point and may only need tweaking. This is likely the simplest way forward. All times are currently local, so file names will need to be calculated for UTC. Groeling has a spreadsheet with file names and modal broadcast times.

Caption credit lines reliably mark show boundaries:

20040903020018.078|20040903020021.413|CC1| -- Captions by VITAC --

20040903020018.078|20040903020021.413|CC1| www.vitac.com

20060629063151.821|20060629063152.321|CC1|CAPTIONS BY VITAC

20060627033042.997|20060627033047.501|CC1| CAPTIONS PAID FOR BY ABC, INC.

20060627063145.147|20060627063145.948|CC1|CAPTIONS PAID FOR BY

20060627063146.082|20060627063150.419|CC1|NBC STUDIOS

20060627060109.415|20060627060111.983|CC1| CAPTIONS PAID FOR BY

20060627060109.415|20060627060111.983|CC1| PARAMOUNT DOMESTIC TELEVISION

There's typically a gap of 5-10 seconds before the first caption line of the next show appears:

20060623033040.494|20060623033044.297|CC1|WINNING.

20060623033044.365|20060623033048.769|CC1| CAPTIONS PAID FOR BY ABC, INC.

20060623033054.541|20060623033055.842|SEG_00|Type=Story start

--

20060623035807.739|20060623035811.310|CC1|AND GOOD NIGHT.

20060623035811.411|20060623035816.014|CC1| Captions by VITAC

20060623035954.881|20060623035958.083|CC1|y THIS IS "JEOPARDY!"

--

20060623043555.105|20060623043600.810|SEG_00|Type=Commercial

20060623043555.105|20060623043558.508|CC1| CAPTIONS PAID FOR BY ABC, INC.

20060623043600.810|20060623043602.579|SEG_00|Type=Story start

The a script cc-keyword-spacing outputs the relative time and spacing between these show boundary markers. It accepts multiple search terms, for instance:

$ cc-keyword-spacing $F 'CAPTION|Caption'

Spacing of keywords in 2006-06-06/2006-06-06_0000_US_00000157_V11_M2_VHS10_H4_JK.txt3 (08:13:56):

CAPTION|Caption 00:00:20 00:00:20

CAPTION|Caption 00:59:44 00:59:24

CAPTION|Caption 02:00:42 01:00:58

CAPTION|Caption 02:59:39 00:58:57

CAPTION|Caption 04:59:35 01:59:56

CAPTION|Caption 06:00:38 01:01:03

CAPTION|Caption 06:59:56 00:59:18

CAPTION|Caption 07:59:32 00:59:36

That tells you the previous show ends 20 seconds into the recording, and then there's a series of one-hour shows -- the second column gives the relative times and the third the duration of each show. Some shows lack the boundary marker -- at the end of the fifth hour is just this:

20060606035847.746|20060606035848.814|CC1|MELODY ♪

20060606040136.248|20060606040138.550|SEG_00|Type=Story start

There's a tiny bit of logic added to handle premature caption receipts, as they sometimes double up. I've seen one case (Oprah) where the captions credit appear a few seconds after the show has started, so that does sometimes happen.

With very minor modifications, the script will write a tag like this:

20060606035847.746|20060606035848.814|SEG|Type=Show boundary

The second timestamp in this case is not informative.

Integrating with NewsScape

After splitting, the new files should be merged into the regular NewsScape tv trees on cartago and NetApp. Hoffman2 can then pick them up for on-screen text extraction, and the NLP pipeline can operate on the text files. We will need to program this separately, but it's a matter of giving some date parameters to existing routines.

For files from the fall of 2006, which is most files to date, there will be an issue about possible duplicates in the digital collection, but the overlap is not likely to be very extensive. The caption text may at times be better in the digitized files.