Video processing pipelines

Introduction

There are currently two video processing pipelines on Hoffman2: one to process incoming files and one to process files from the digitizing project (the Rosenthal pipeline). They both compress the files; the first also extracts the on-screen text using OCR. This scroll provides the development perspective; for the monitoring perspective, see How to monitor the Rosenthal pipeline on Hoffman2.

Related scrolls

Compression tools

(Peter Uhrig, 14 Jan 2016)

Audio compression

I have taken a look at the AAC Encoding guide:

https://trac.ffmpeg.org/wiki/Encode/AAC

It turns out that ffmpeg in its latest version has a default built-in converter that likely is better than libfdk_aac. This seems to be a very recent development:

Note: -strict experimental was previously required for this encoder, but is unneeded since December 2015.

Results are usually as good or better than libfdk_aac at 128kbps but will occasionally sound worse below 96kbps.

Since we are AT 96kbps and not below, this may be adequate. However, libfdk_aac is still described as the best aac codec.

Video compression

I read a bit more about one-pass with Constant Rate Factor (CRF) vs two-pass. The basic point is that if the file size is the same, the quality of the two methods will be pretty much the same.

So if we use two-pass, we can target a bitrate and with thus, keep file size per minute constant. If the video is very easy to compress, this will lead to higher quality. If the video is harder to compress, this will lead to lower quality. The method is particularly suitable if you want to compress to exactly one DVD or CD or something. With two-pass you can make sure that the resulting mp4 is exactly 4.7 Gigs or 700 MB and you get the best possible quality for the size. Thus you do not waste space on the CD/DVD, which would be lost anyway.

If we choose a CRF, we target a certain quality level and keep that constant. If the file is harder to compress, this will lead to a larger file, if the file is easier to compress, this will result in a smaller file. This means we will not ruin the quality of a video that is particularly difficult to compress or have unnecessarily good quality on a video that is easy to compress, which may happen with the fixed bitrate of a two-pass encoding. The only trouble is that file size will vary, but I do not see an actual issue here. As long as we choose a CRF that results in the same AVERAGE file size as our current choice of bitrate, we should get more stable quality without extra demands on storage. Also, since it is faster (but apparently not by factor two, if I read that correctly), one could choose a slower preset and indeed get a better image quality at the same size and the same time needed for encoding.

Given all that, I recommend we follow the advice from the encoding guide (https://trac.ffmpeg.org/wiki/Encode/H.264) and use a CRF. Probably, a CRF around 25 or so (higher numbers give lower bitrates) would give us a good file size in conjunction with the “veryslow” preset:

ffmpeg -analyzeduration 2G -probesize 2G -i $FIL.mpg -y $FFMAP -c:v libx264 -r 29.97 -preset:v veryslow -crf 25 -vf scale="640:trunc(ow/a/2)*2" -c:a libfdk_aac -vbr 3 -ac 2 -ar 44100 $FIL.mp4

ffmpeg version 2.6.5 Copyright (c) 2000-2015 the FFmpeg developers

built with gcc 4.9.2 (Debian 4.9.2-10)

configuration: --prefix=/usr --extra-cflags='-g -O2 -fstack-protector-strong -Wformat -Werror=format-security ' --extra-ldflags='-Wl,-z,relro' --cc='ccache cc' --enable-shared --enable-libmp3lame --enable-gpl --enable-nonfree --enable-libvorbis --enable-pthreads --enable-libfaac --enable-libxvid --enable-postproc --enable-x11grab --enable-libgsm --enable-libtheora --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libx264 --enable-libspeex --enable-nonfree --disable-stripping --enable-libvpx --enable-libschroedinger --disable-encoder=libschroedinger --enable-version3 --enable-libopenjpeg --enable-librtmp --enable-avfilter --enable-libfreetype --enable-libvo-aacenc --disable-decoder=amrnb --enable-libvo-amrwbenc --enable-libaacplus --libdir=/usr/lib/x86_64-linux-gnu --disable-vda --enable-libbluray --enable-libcdio --enable-gnutls --enable-frei0r --enable-openssl --enable-libass --enable-libopus --enable-fontconfig --enable-libpulse --disable-mips32r2 --disable-mipsdspr1 --disable-mipsdspr2 --enable-libvidstab --enable-libzvbi --enable-avresample --disable-htmlpages --disable-podpages --enable-libutvideo --enable-libfdk-aac --enable-libx265 --enable-libiec61883 --enable-vaapi --enable-libdc1394 --disable-altivec --shlibdir=/usr/lib/x86_64-linux-gnu

libavutil 54. 20.100 / 54. 20.100

libavcodec 56. 26.100 / 56. 26.100

libavformat 56. 25.101 / 56. 25.101

libavdevice 56. 4.100 / 56. 4.100

libavfilter 5. 11.102 / 5. 11.102

libavresample 2. 1. 0 / 2. 1. 0

libswscale 3. 1.101 / 3. 1.101

libswresample 1. 1.100 / 1. 1.100

libpostproc 53. 3.100 / 53. 3.100

Input #0, mpegts, from '2016-03-31_0300_US_ComedyCentral_Daily_Show.mpg':
  Duration: 00:30:55.19, start: 61751.464967, bitrate: 3360 kb/s
  Program 2 
    Stream #0:0[0x200]: Video: mpeg2video (Main) ([2][0][0][0] / 0x0002), yuv420p(tv), 720x480 [SAR 8:9 DAR 4:3], 2500 kb/s, 29.97 fps, 29.97 tbr, 90k tbn, 59.94 tbc
    Stream #0:1[0x203]: Audio: ac3 ([129][0][0][0] / 0x0081), 48000 Hz, stereo, fltp, 320 kb/s
[libx264 @ 0x1017b80] using SAR=71/80
[libx264 @ 0x1017b80] using cpu capabilities: MMX2 SSE2Fast LZCNT
[libx264 @ 0x1017b80] profile High, level 3.1
[libx264 @ 0x1017b80] 264 - core 146 - H.264/MPEG-4 AVC codec - Copyleft 2003-2015 - http://www.videolan.org/x264.html - options: cabac=1 ref=16 deblock=1:0:0 analyse=0x3:0x133 me=umh subme=10 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=24 chroma_me=1 trellis=2 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=6 lookahead_threads=1 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=8 b_pyramid=2 b_adapt=2 b_bias=0 direct=3 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=25 scenecut=40 intra_refresh=0 rc_lookahead=60 rc=crf mbtree=1 crf=25.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00
[libfdk_aac @ 0x1018f60] Note, the VBR setting is unsupported and only works with some parameter combinations
Output #0, mp4, to '2016-03-31_0300_US_ComedyCentral_Daily_Show.mp4':
  Metadata:
    encoder         : Lavf56.25.101
    Stream #0:0: Video: h264 (libx264) ([33][0][0][0] / 0x0021), yuv420p, 640x426 [SAR 71:80 DAR 4:3], q=-1--1, 29.97 fps, 11988 tbn, 29.97 tbc
    Metadata:
      encoder         : Lavc56.26.100 libx264
    Stream #0:1: Audio: aac (libfdk_aac) ([64][0][0][0] / 0x0040), 44100 Hz, stereo, s16
    Metadata:
      encoder         : Lavc56.26.100 libfdk_aac
Stream mapping:
  Stream #0:0 -> #0:0 (mpeg2video (native) -> h264 (libx264))
  Stream #0:1 -> #0:1 (ac3 (native) -> aac (libfdk_aac))

This is still experimental; so far we've used a constant bitrate encoding. The preliminary verdict is that this method gives us files that have the target quality and a file size that is 25% smaller, but it is too slow -- it encodes at 30 frames a second or worse.

On-screen text extraction

Our main video processing pipeline on Hoffman2 extracts the on-screen text from images at one-second intervals, in several languages: English, Spanish, French, German, Italian, Danish, Norwegian, and Swedish.

We just activated on-screen text OCR for Portuguese. This is the first-stage output from tesseract, with added screen location information:

PRESlDENTE 247 585 25 151
Cavaco 73 632 28 74 Silva 156 632 28 50 diz 215 632 28 28 que 250 639 28 36 não 295 632 28 35 vai 338 632 28 29 abdicar 375 632 28 78 dos 461 632 28 35 puderes 504 632 35 83
constitucionais 73 670 28 162

Here's the final interpretation of this image from 2015-11-26_2000_PT_RTP-1_Telejornal.ocr:

Portuguese OCR

20151126200001.000|20151126200026.999|OCR1|000002|115 51 113 21|RTP 1525

20151126200001.001|20151126200234.999|OCR1|000002|76 585 322 24|OS AVISOS DO PRESIDENTE

20151126200001.002|20151126200142.999|OCR1|000002|72 629 515 34|Cavaco Silva diz que nÃo vai abdicar dos poder poderes

20151126200001.003|20151126200142.999|OCR1|000002|73 670 162 28|constitucionais

20151126200003.000|20151126200005.999|OCR1|000004|718 267 125 81|XII

20151126200003.001|20151126200006.999|OCR1|000004|712 360 223 56|GOVERNO

This is correct, with the exception of the capital Ã, which should have been lowercase. Even the blue "XXI Governo" on a low-contrast background is captured.

Here is the teletext:

20151126200001.280|20151126200004.040|CC1|recordou que o único poder que já não tem é o de

20151126200004.080|20151126200007.120|CC1|dissolver a assembleia, chefe de Estado fez questão ainda

20151126200007.160|20151126200010.120|CC1|sublinhar que as dúvidas em relação a

20151126200010.160|20151126200013.360|CC1|durabilidade novo Governo. Ainda não foram dissipadas.

20151126200014.600|20151126200017.760|CC1|Trouxe a promessa que todos os presidentes fazem aos governos

20151126200017.800|20151126200021.000|CC1|que acabam de chegar. Lealdade

20151126200021.040|20151126200024.120|CC1|institucional, mas trouxe também

20151126200024.160|20151126200027.240|CC1|um aviso. Não abdica.

The on-screen text provides a valuable complement to the captions. It can be used to search for specific content, as is done by the Edge2 search engine, and it can be used computationally, for instance to determine topic boundaries.