— FileMaker Import system

Documentation of FileMaker data import pipeline

The system described below is an automated process to import video-related data from the Cartago server into FileMaker, providing information about digitized videos in order to support the auditing process of the UCLA CommStudies lab (prof. Groeling).

For information, contact Anna Bonazzi at annabonazzi@ucla.edu.

Background

The NewsScape archive lab digitizes numerous videotapes every day and sends the resulting videos to the Cartago server, where they are processed through the Rosenthal Pipeline (the process includes compression, quality check, and separation of closed captioning/ subtitles). In Fall 2018, the lab started an audit process to conduct a final quality check on the videos digitized so far, before the videotapes are permanently stored with the UCLA library. The auditors go through each videotape and check whether the digital videos extracted from them show any problems, errors, or quality issues that can potentially be improved with a new digitization attempt. Auditors work with a digitized video database in FileMaker and have access to data about each video’s format, quality, and the presence of problems either in the video itself or in the video’s filename.

FileMaker pipeline

The pipeline described below gathers specific data about the videos that are added daily to the server, sends the data to a folder accessible by FileMaker on the local CommStudies lab computer, and has FileMaker automatically import the data to provide updated information for the auditors.

Introduction

This section describes the scripts and files/folders required to run this process.

The main script used for this task is video_data_server.py. This script and its related files are currently stored on Cartago in /home/groeling/audit/.

The folder containing the script must also contain the following:

  1. subfolders: temp_logs and audit_logs. Both these folders contain an extra backup copy of each new file output by the script. The file’s title is in the format 2019-06-17_all_videos_data.tsv (the date changes). Each of these files is a copy of the final file sent to FileMaker daily with data about the newly digitized videos.
    1. file: all_videos_yyyy-mm-dd.txt (for example all_videos_2018-10-25.txt). This file contains a list of all videos present on the server on a given date. The script uses this list to identify which new videos have been added to the server after a specific date. This file must only be provided once when the script is run for the first time; when the script is set to run everyday automatically, it updates the list on its own. For the first run, this file can be copied from the backup folder described below or it can be generated with the commands:
    • cd /mnt/netapp/Rosenthal
      • ls */*/*/*mp4 > /home/groeling/audit/all_videos_2019-01-01.txt (use current date and the destination folder where the main script is located).
    1. subfolder: updated_all_videos. This folder contains backup copies of each all_videos_yyyy-mm-dd.txt file: each file gets progressively longer as new videos get added to the server every day. Any of these files can be used as a checkpoint to restart the script from scratch in case of errors / interruptions in the pipeline: by copying one of these files in the main script folder, the script will start collecting the data of all videos added to the server after that date.
    2. script: bad_character_function.py. This is a dependent of the main script and contains a function to calculate the amount of junk characters present in a video’s CC file, when available. The presence of junk characters indicates poor quality subtitles and is a good indicator of the quality of the video itself.

The script requires one argument, i.e. the name of the main folder where the necessary subdirectories and files are stored. If the script is to be run from /home/groeling/audit/, the command to run it is python video_data_server.py /home/groeling/audit/.

This script is scheduled to run automatically every day at 9 am with cron. To view or edit the schedule (e.g. change the starting time), type the command crontab -e and scroll down to this line:

The format of the above line means: "execute the python script located at this path (/home/groeling/audit/video_data_server.py) with the given argument (/home/groeling/audit/) at minute 00, hour 09 am, every day of the month, every month, every day of the week".

Steps

In summary, these are the main steps of the system:

    1. The video_data_server.py script on Cartago finds the videos that have been added to the server since the previous run.
    2. It matches these videos with information from the daily reports issued by the Rosenthal Pipeline, which processes videos sent in after their digitization. The reports provide information on: a) if a video “failed to repair”, that is, it had problems while being processed (this might indicate poor video quality); b) if a video has no CC file (this might mean that the video did not have CC in the first place, or that it was impossible to extract CC from it during processing, again a possible quality problem).
    3. It gets the size and processing time & date of each new video.
    4. It parses the filename of new videos to identify possible naming errors. These are the errors that the script looks for:
      • mismatch between the video recorder name and the videotape name (e.g. if a VHS recorder was associated with a Betamax tape or vice versa);
      • date errors (incorrect format, possible typos - however, unable to catch all potential mistakes);
      • barcode errors (barcode is not 8 digits; barcode is missing altogether);
      • misplaced filename components (e.g. the MacBook/computer code comes before the video recorder code while it shouldn’t);
      • errors in the numbering/naming of filename components like recorder, tape, encoder, computer;
      • bad filename altogether (multiple or unspecified naming errors; obligatory filename components are missing; the filename does not look as it should).
    5. It quantifies the amount of junk characters in a video’s CC file to predict the quality of the video itself. Values below 2% are good, values above 10% are probably bad videos.
    6. It outputs a spreadsheet with the following data for each video: 1) Filename, 2) link to online player, 3) video size, 4) processing date, 5) processing time, 6) missing CC (optional), 7) CC problems (optional), 8) percentage of bad characters, 9) “failed repair” status (optional), 10) video duration, 11) word count (from CC), 12) pull date, 13) tape barcode, 14) V number (recorder schedule number), 15) videotape type, 16) computer code (MacBook), 17) encoder code, 18) person doing the digitization, 19) “Best Effort” status (optional), 20) description of errors (optional)
    7. It sends the spreadsheet to the local computer serving as FileMaker host in the CommStudies lab (Altair, unix name csna.sscnet.ucla.edu, IP 128.97.231.4). The output file is saved daily in several locations:
      • A copy named all_videos_data.tsv goes to a folder accessible to FileMaker (/Library/FileMaker Server/Data/Documents) in the local lab computer. This copy is not dated, so its name remains the same, although the content is updated daily, for FileMaker to import this file using a fixed reference path. !! It is important for the file to be saved in a folder accessible to FileMaker: while FileMaker can usually import files from any location if the import is done manually, it seems unable to do so when the import is started automatically by a command of FileMaker Server schedule. In this case, the file to be imported must be in the FM environment.
      • A dated copy (e.g. 2019-05-10_all_videos_data.tsv) and an undated copy (all_videos_data.tsv) go to the lab computer, in a user created specifically for this purpose, tna (/Users/tna/Documents/logs and /Users/tna/Documents respectively). This user was first created to be a safe exchange location between the server and the lab computer. The public RSA key of groeling@cartago was put on user tna on the FileMaker machine and user tna’s public key, /Users/tna/.ssh/id_rsa.pub, on Cartago. It was later found that FileMaker is unable to automatically import files from this location, so this account now remains as a backup location. If necessary, it can be deleted.
      • Extra backup copies go to folders in Cartago: a dated copy goes among other logs on Netapp in the Rosenthal tree (/mnt/netapp/Rosenthal/audit_logs), and two copies (a dated and an undated one) in a subfolder in the script environment (temp_logs/). These two can be deleted if redundant, but they currently don’t occupy much space.
    8. Once the spreadsheet is in the lab computer, it is imported in FileMaker. The following script, import_updated_records, was written in FileMaker to import the data from the spreadsheet into FileMaker’s HoffmanReports table:
    1. The Import command on line 3 has these settings:

    1. A schedule was created on FileMaker Server, hosted on the same computer, to run the import_updated_records script once a day at 1:40 am (other times are fine too, but they should be in the early morning so that the database does not get updated while auditors are working on it, and it should not clash with the time of scheduled backups).
    1. The details of the schedule are summarized below:
    1. The schedule requires a FileMaker user and password to run the script. The current setting is username Anna and Anna’s password. Any other FM account will work too, as long as it has full permission to edit the database and execute scripts (so for example the user tna that was created for this purpose will not work, as it has limited access).
  1. If the FileMaker script does not run for some reason or the scheduler fails, an error message is emailed to groelinglab@gmail.com. This option was set up on FileMaker Server as shown in the following picture: