— CCExtractor

Introduction

CCExtractor is an open-source project, led by Carlos Fernandez, that collaborates closely with Red Hen. CCExtractor extracts closed captioning, teletext, and other metadata from television transport streams. See their project page at http://ccextractor.com.

Instructions

To use the new OCR capabilities, see Abhinav Shukla's GSoC2016 Report.

Installation

We should use the latest version, which is on github:

       https://github.com/CCExtractor/ccextractor

and often not yet on Sourceforge or http://ccextractor.com. It typically has new features we want.

To download it, issue this command in Linux (or Mac):

    wget https://github.com/CCExtractor/ccextractor/archive/master.zip

This command will download the software in a zipped (compressed) format in a file called master.zip. To unzip (decompress) the file, issue

    unzip master.zip

The files will be unzipped into a directory (folder) called ccextractor-master. Rename it to the current version number (which keeps incrementing):

    mv ccextractor-master ccextractor_0.84

Walk into the directory:

    cd ccextractor_0.84

You'll see the file raspberrypi.md -- read it for the simple instructions to build ccextractor for these devices. Typically, you'll need these:

  apt-get install libleptonica-dev libtesseract-dev libcurl4-gnutls-dev tesseract-ocr 

You'll also see several subdirectories, including one called "linux" and one called "mac". Walk into the appropriate subdirectory:

    cd linux

You'll see a file called "build". Run it like this:

    ./build

This compiles (builds) the CCExtractor program; it can take anywhere from a few seconds to a couple of minutes, depending on how fast your computer is.

The build command creates a file that's always called 'ccextractor'. Rename it to track which version you just built:

    mv ccextractor ccextractor-0.78

Copy that file into your program directory:

    sudo cp ccextractor-0.78 /usr/local/bin

Now test the new ccextractor version (e.g. ccextractor-0.78) for both previous and new functionality. When you are satisfied, then . . .

Walk into your program directory and create a symbolic link to the new version:

    cd /usr/local/bin

    sudo ln -sf ccextractor-0.78 ccextractor

In the list of files (ls -l), you should see something like this:

lrwxrwxrwx 1 root staff          16 Oct  2 05:48 ccextractor -> ccextractor-0.78

-rwxr-xr-x 1 root staff     1687840 Oct  2 05:47 ccextractor-0.78

The program is now fully installed. 

Locate the teletext

$CX -debug -ts -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg

Improvement for DVB bitmap captions against a semi-transparent textbox

Abhinav Shukla changed CCextractor 0.85 to improve its recognition of such captions, which are used in e.g. the evening news on FR2.

$ export TESSDATA_PREFIX=/usr/share/tesseract-ocr
$ $CX $FIL.mpg -datets -ttxt -UCLA -noru -utf8 -unixts 0 -delay 1500055200000
FIL=2017-07-14_1800_FR_FR2_Journal_20h00
CX=ccextractor_0.85e