— CCExtractor
Introduction
CCExtractor is an open-source project, led by Carlos Fernandez, that collaborates closely with Red Hen. CCExtractor extracts closed captioning, teletext, and other metadata from television transport streams. See their project page at http://ccextractor.com.
Instructions
To use the new OCR capabilities, see Abhinav Shukla's GSoC2016 Report.
Installation
We should use the latest version, which is on github:
https://github.com/CCExtractor/ccextractor
and often not yet on Sourceforge or http://ccextractor.com. It typically has new features we want.
To download it, issue this command in Linux (or Mac):
wget https://github.com/CCExtractor/ccextractor/archive/master.zip
This command will download the software in a zipped (compressed) format in a file called master.zip. To unzip (decompress) the file, issue
unzip master.zip
The files will be unzipped into a directory (folder) called ccextractor-master. Rename it to the current version number (which keeps incrementing):
mv ccextractor-master ccextractor_0.84
Walk into the directory:
cd ccextractor_0.84
You'll see the file raspberrypi.md -- read it for the simple instructions to build ccextractor for these devices. Typically, you'll need these:
apt-get install libleptonica-dev libtesseract-dev libcurl4-gnutls-dev tesseract-ocr
You'll also see several subdirectories, including one called "linux" and one called "mac". Walk into the appropriate subdirectory:
cd linux
You'll see a file called "build". Run it like this:
./build
This compiles (builds) the CCExtractor program; it can take anywhere from a few seconds to a couple of minutes, depending on how fast your computer is.
The build command creates a file that's always called 'ccextractor'. Rename it to track which version you just built:
mv ccextractor ccextractor-0.78
Copy that file into your program directory:
sudo cp ccextractor-0.78 /usr/local/bin
Now test the new ccextractor version (e.g. ccextractor-0.78) for both previous and new functionality. When you are satisfied, then . . .
Walk into your program directory and create a symbolic link to the new version:
cd /usr/local/bin
sudo ln -sf ccextractor-0.78 ccextractor
In the list of files (ls -l), you should see something like this:
lrwxrwxrwx 1 root staff 16 Oct 2 05:48 ccextractor -> ccextractor-0.78
-rwxr-xr-x 1 root staff 1687840 Oct 2 05:47 ccextractor-0.78
The program is now fully installed.
Locate the captions or teletext in a transport stream
ccextractor 0.65 and up -- please update as needed
Instructions for locating the closed captions or teletext in a transport stream:
1. Select a file (no extension):
F=2012-12-04_0358_US_WEWS_NewsChannel_5_at_11pm
2. Select a binary (plain teletext is usually fine, or we can use a bleeding-edge version):
CX=ccextractor-0.67-a08
CX=~/software/ccextractor-0.63-kai/mac/ccextractor
2. Look for teletext tracks -- so-called program numbers (leave out -pn <number>):
$CX -debug -ts -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg
You'll get an output like "1 2 3 4 5 6 7 8" -- one number per line. These are candidate teletext tracks.
3. Use each program number candidate to check for a live teletext:
$CX -debug -ts -pn 1 -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg
$CX -debug -ts -pn 2 -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg
and so forth -- or use this system:
PN=1
$CX -debug -ts -pn $PN -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg
and then change the value of PN each iteration until you've tested all the candidates.
The output will be verbose, but it won't show teletext unless you've found the right program number. In most teletext shows, the correct program number is the first one. In US closed captions, the correct program number typically shows up if you select the wrong one, like this:
New PID found: 512 (MPEG-2 video), belongs to program: 2
New PID found: 515 (AC3 audio), belongs to program: 2
4. Once you've found the right program number, add it to the switch box in /usr/local/bin/cc-extract-cmm (or cc-extract-mmm etc):
# Program ID
case $NWK in
WEWS ) PID=1 ;;
WKYC ) PID=1 ;;
WOIO ) PID=2 ;;
WUAB ) PID=7 ;;
We should always prefer programs that have closed captioning.
Locate the DVB bitmap captions
$CX -debug -ts -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg
Improvement for DVB bitmap captions against a semi-transparent textbox
Abhinav Shukla changed CCextractor 0.85 to improve its recognition of such captions, which are used in e.g. the evening news on FR2.
$ export TESSDATA_PREFIX=/usr/share/tesseract-ocr
$ $CX $FIL.mpg -datets -ttxt -UCLA -noru -utf8 -unixts 0 -delay 1500055200000
FIL=2017-07-14_1800_FR_FR2_Journal_20h00
CX=ccextractor_0.85e