— CCExtractor

Introduction

CCExtractor is an open-source project, led by Carlos Fernandez, that collaborates closely with Red Hen. CCExtractor extracts closed captioning, teletext, and other metadata from television transport streams. See their project page at http://ccextractor.com.

Instructions

To use the new OCR capabilities, see Abhinav Shukla's GSoC2016 Report.

Installation

We should use the latest version, which is on github:

https://github.com/CCExtractor/ccextractor

and often not yet on Sourceforge or http://ccextractor.com. It typically has new features we want.

To download it, issue this command in Linux (or Mac):

wget https://github.com/CCExtractor/ccextractor/archive/master.zip

This command will download the software in a zipped (compressed) format in a file called master.zip. To unzip (decompress) the file, issue

unzip master.zip

The files will be unzipped into a directory (folder) called ccextractor-master. Rename it to the current version number (which keeps incrementing):

mv ccextractor-master ccextractor_0.84

Walk into the directory:

cd ccextractor_0.84

You'll see the file raspberrypi.md -- read it for the simple instructions to build ccextractor for these devices. Typically, you'll need these:

  apt-get install libleptonica-dev libtesseract-dev libcurl4-gnutls-dev tesseract-ocr 

You'll also see several subdirectories, including one called "linux" and one called "mac". Walk into the appropriate subdirectory:

cd linux

You'll see a file called "build". Run it like this:

./build

This compiles (builds) the CCExtractor program; it can take anywhere from a few seconds to a couple of minutes, depending on how fast your computer is.

The build command creates a file that's always called 'ccextractor'. Rename it to track which version you just built:

mv ccextractor ccextractor-0.78

Copy that file into your program directory:

sudo cp ccextractor-0.78 /usr/local/bin

Now test the new ccextractor version (e.g. ccextractor-0.78) for both previous and new functionality. When you are satisfied, then . . .

Walk into your program directory and create a symbolic link to the new version:

cd /usr/local/bin

sudo ln -sf ccextractor-0.78 ccextractor

In the list of files (ls -l), you should see something like this:

lrwxrwxrwx 1 root staff 16 Oct 2 05:48 ccextractor -> ccextractor-0.78

-rwxr-xr-x 1 root staff 1687840 Oct 2 05:47 ccextractor-0.78

The program is now fully installed.

Locate the captions or teletext in a transport stream

ccextractor 0.65 and up -- please update as needed

Instructions for locating the closed captions or teletext in a transport stream:

1. Select a file (no extension):

F=2012-12-04_0358_US_WEWS_NewsChannel_5_at_11pm

2. Select a binary (plain teletext is usually fine, or we can use a bleeding-edge version):

CX=ccextractor-0.67-a08
CX=~/software/ccextractor-0.63-kai/mac/ccextractor

2. Look for teletext tracks -- so-called program numbers (leave out -pn <number>):

$CX -debug -ts -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg

You'll get an output like "1 2 3 4 5 6 7 8" -- one number per line. These are candidate teletext tracks.

3. Use each program number candidate to check for a live teletext:

$CX -debug -ts -pn 1 -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg
$CX -debug -ts -pn 2 -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg

and so forth -- or use this system:

PN=1
$CX -debug -ts -pn $PN -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg

and then change the value of PN each iteration until you've tested all the candidates.

The output will be verbose, but it won't show teletext unless you've found the right program number. In most teletext shows, the correct program number is the first one. In US closed captions, the correct program number typically shows up if you select the wrong one, like this:

New PID found: 512 (MPEG-2 video), belongs to program: 2
New PID found: 515 (AC3 audio), belongs to program: 2

4. Once you've found the right program number, add it to the switch box in /usr/local/bin/cc-extract-cmm (or cc-extract-mmm etc):

# Program ID
case $NWK in
 WEWS ) PID=1 ;;
 WKYC ) PID=1 ;;
 WOIO ) PID=2 ;;
 WUAB ) PID=7 ;;

We should always prefer programs that have closed captioning.

Locate the DVB bitmap captions

$CX -debug -ts -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg

Improvement for DVB bitmap captions against a semi-transparent textbox

Abhinav Shukla changed CCextractor 0.85 to improve its recognition of such captions, which are used in e.g. the evening news on FR2.

$ export TESSDATA_PREFIX=/usr/share/tesseract-ocr
$ $CX $FIL.mpg -datets -ttxt -UCLA -noru -utf8 -unixts 0 -delay 1500055200000

FIL=2017-07-14_1800_FR_FR2_Journal_20h00

CX=ccextractor_0.85e