Mails from Francis - Chapter 3

installing software, install.sh, creating directories

Dear Jacek,

It will take some time to install Raspbian on vila, so this is looking ahead. Once you have the Raspbian OS working and generally configured (I can help), try this:

· apt-get install build-essential ffmpeg tesseract-ocr

· apt-get install tesseract-ocr-pol tesseract-ocr-ces (Polish and Czech -- it would be nice if you also added the other Red Hen languages next, but we can also wait)

· apt-get install tesseract-ocr-rus tesseract-ocr-eng tesseract-ocr-por tesseract-ocr-dan tesseract-ocr-nor tesseract-ocr-swe tesseract-ocr-ita tesseract-ocr-deu tesseract-ocr-fra

· mdkir -p ~/system/software/git/ccextractor

· cd ~/system/software/git/ccextractor

· wget https://github.com/CCExtractor/ccextractor/archive/master.zip

· unzip master.zip

· mv master* ccextractor-0.82a

· cd ccextractor-0.82a/linux

What we want is DVB subtitle support, which CCExtractor calls HARDSUBX. So issue:

· make ENABLE_HARDSUBX=yes

· ./build

The build has been tested with ffmpeg version 3.1.0 and tesseract 3.04. Raspbian has ffmpeg 3.0.2, which is likely close enough, and tesseract 3.04.

This should give us a version of CCExtractor capable of reading DVB subtitles. However, it's possible the task is too CPU-intensive to be usefully done on the RPi. In any case we'll learn.

I was thinking of just eyeballing it to see how good it is. It's probably not worth trying to get a player to read the txt file as a subtitle file, though of course a converter would be handy. CCExtractor will produce .srt files if you ask for it; they will play in VLC. Of course cc-update-timestamps-fix-jumps won't work on srt files.

In the Edge search engine we may develop the ability to display the text as the video plays.

Best,

Francis

It's been a month since I have Vila and I still didn't do any of it. But we did other things and now vila does some important work (see webgrab++ appendix). So Francis is telling me to install some software manually. I have just recently realized how much (excellent, open source, well documented) software is being used in Red Hen capture stations. I managed to run one of Francis's huge scripts, used to set up a new capture station, called install.sh (please, be brave and click on it). It is 'just' 5 pages of text but this script calls other scripts that install tons and tons of software and the installation goes on for hours. I managed to capture just a fragment of the magnificent screen output of install.sh (go on, please, have a look, it's 'only' 48 pages of text).

So apart from apt-get install we have for example mdkir -p ~/system/software/git/ccextractor this will create a directory and (with the -p parameter) all 'parent' directories. The ~ character stands for the home directory of a user- for user csa@dola it is /home/csa. We already know what cd- change directory - does (see Chapter 1). And when Francis says This should give us a version of CCExtractor capable of reading DVB subtitles he means that the subtitles (closed captions) are hidden inside video files (Digital Video Broadcast) and this extractor program could pull them out.

repairing closed captions

Dear Jacek,

Here is a small detail in the saga of capture, still experimental, but one that we might eventually incorporate into the automated system.

Every so often, the text extraction succeeds, but at some points the timestamps go awry. A glitch in the reception has reset the time in the middle of the show.

Consider for instance 2016-09-30_1700_CZ_ČT1_Události.txt. Around line 804 we see this:

20160930174440.800|20160930174444.560|CC1|Později vedl brněnskou městskou policii.

20160930174444.600|20160930174447.760|CC1|Koncem roku 2012 ho ale policie obvinila ze zneužití pravomoci

20160930174447.800|20161001201534.347|CC1|při zahlazování přestupků.

As you can see, the timestamp suddenly skips from 17:44:47 on September 9 to 20:15:34 on September 10, and then continues normally, but now with the wrong time. A single error that propagates.

And in another mail Francis adds:

Red Hen has had occasional files with timestamp jumps for years; I've only just now developed a solution that seems pretty good. The problem has several dimensions. The script (called cc-update-timestamps-fix-jumps; you can see it on dola) focuses on blatant errors -- that is to say, timestamps that exceed the temporal window defined by the name and duration of the file.

Oh yes, he calls it a small detail in the saga of capture - 'small' indeed. I am only beginning to have the first idea of how much hard (also creative) work it is to capture video and text and convert it to searchable corpus / data base format. I always thought: well - just set up your Sony VCR and Bob's your uncle.

redhen.config, timezone, hostname

Red Hen's search engine, Edge, runs on php, which is more restrictive on its timezones than the Linux utility GNU date. So while our newly captured files now show up in the search engine, they generate an error on the timezone, which makes them inaccessible. On dola, the timezone designation is set in /nest/cfg/redhen.config, and I've modified it to read:

# Red Hen configuration file

location=Wroclaw, Poland

hostname=dola.pl

timezone=Europe/Warsaw

I also changed the txt files:

sed -i 's/SRC|Poland/SRC|Wroclaw, Poland/' *{_CZ_,_PL_}*.txt

sed -i 's/Europe\/Wroclaw/Europe\/Warsaw/' *{_CZ_,_PL_}*.txt

However, this little bump means we won't see the files in the search engine until tomorrow.

php is a programming language Google says and /nest/cfg/redhen.config is one of the config files Francis's huge Install.sh uses when setting up a station. So this is where its name (dola, invented by Mark) was decided. And this name captures the essence (100% proof) of Polishness I say.

recording quality

The Czech recording has 21 thousand transport errors and 800 sequence errors -- that's quite high and may result in lost text or degraded video. The idea is just to use this information to see if you can improve the signal.

Francis speaks of those video logs, we mentioned in Chapter 1 already and I did improve the signal (see the TV antenna appendix).

transmitting files to central storage

Dear Mark and Jacek,

I've activated the high-performance computing cluster at UCLA to pick up the files from dola and compress the video from mpeg2 to mp4. The files will then be transmitted to Red Hen's central storage; the first files should be available in the edge search engine tomorrow.

Oh, yes - the Eagle has landed, high performance clusters deployed, one small step for a man... and we are exporting Polish (mój kochany język, Polski, Polska, Polskę, Polsce, Polskimi, Polską, Polsko, Polskich, Polskiego, etc.) and Czech together with their funny letters (ąęśćżźół) and incredibly varied inflection to California, UCLA, USA. Yes.

automatic scheduling, bash scripting (4), scp and ssh

Hi Mark,

Jacek found the huro grabber in FreeGuide; it has the Czech schedules. I trimmed the /home/csa/.freeguide/xmltv-configs/huro.conf file to include only ČT1 -- see /home/csa/.freeguide/xmltv-configs/huro.conf-orig for the full list of available networks. The network name has the accented C, which we now also use in the lineup.

The Czech xmltv file is in the iso-8859-2 character set, so we have to convert it to UTF-8 first; the download script now does this. I also made some changes to the schedule script so that it displays a sorted list of all unique matching shows, and then adds the exact matches only to crontab. Of course if you leave out the accents, as Jacek found, we match nothing. This means we have fully automated scheduling of Czech news. In addition, because we've already worked on Czech files, the timestamps and text both look perfect.

So for a time we had Czech scheduling automated (the schedule command in crontab, see Chapter 2) and manual (or semi-automatic) Polish. But now the problem is solved. Francis (with a little contribution from Cpt. Dummy here) has created a great automated system in which one of my RPIs (Vila) captures the tv schedule for Poland, Czechia and Russia (and even one US station). This is what it looks like in crontab on Vila:

One of Francis's scripts - grab-tv-schedule.sh - grabs tv schedule with a 2-letter parameter for country (one of my additions - PROUD) and puts it in a directory where it waits for other RPIs around the world to import from. Let's see how they do it using dola's crontab as an example (to see what's in crontab, issue crontab -l):

Another of Francis's Scripts - xmltv-convert-04.sh - takes the xmltv schedule from vila, massages it a little (Francis's words), and puts it in the correct directory on dola where it waits to be used by the schedule script. And it's fully automated. We don't have to do a thing now- just party wildly, sing in the shower, go tree hugging, etc. - the usual stuff. Let's look inside the xmltv-convert-04.sh script because there's is much to be learned from there. Here's a little fragment:

# Input file or system

if [ -z "$1" ]

then echo -e "\n\tPlease supply an input file name, optionally \"pl\", \"cz\", etc.\n" ; exit

else INFIL=$1

fi

$1 is the first input parameter and [ -z "$1" ] is how you check whether it has zero length. If it does - the script will shout at you and then exit in a huff . We will not analyze the whole script, just one more line that can be very useful

scp $FIL ca:/tmp ; ssh ca "rm -f /tmp/$($DAT -d "-1 day" +%F)-$CC.xmltv"

What it does first (scp) is copy a file (named $Fil) from Poland (dola RPI) to California (cartago) and puts it in /tmp directory there. And then it moves there (ssh) and removes another file (the yesterday's schedule, to clean up). Let's play with scp and ssh a little. I have this unimportant file called jwresults.csv in my home directory on dola:

So with scp I copied it to my other RPI (vila) /tmp directory. And then I went there with ssh vila (full format would be ssh csa@vila), changed directory (cd) to /tmp and listed (ls) its content. And now I am jumping up and down, shouting: It worked, har har, it worked! Because the file jwresults.csv is indeed there. See what it syas in the welcome message: "Jacek's developer machine". Mark and Francis decided I am so underdeveloped that they invented this machine to improve me. And it works- I was a dummy and now I am cpt. dummy, see?

bash aliases

Let's talk now about one of the greatest invention of mankind called bash aliases:

Dear Jacek,

For simple functions, you can use bash aliases; they're easier to track than lots of small scripts.

Check out the file /home/csa/.bash_aliases. It defines a series of shortcuts and functions. Try this on vila:

nano ~/.bash_aliases

Add this line:

alias d='ssh dola'

Save and issue

source .bash_aliases

Then press d. You don't need to include the user name (csa@dola) if you're connecting to the same user on the remote system.

Francis

So here is a fragment of the list of aliases (shortcuts) on dola:

Have you ever wondered (I know you did) why my ls displays are so elegantly coloured? Because on my list you will find: alias ls='ls --color=auto'. Which means that every time I use the ls (list) command it does ls --color=auto. And instead of cd'ing 5 times up the directory tree to get out of some remote subdirectory in the woods I can just go ...... . I bet even Mark and Francis don't have so many clever aliases on their lists. To continue, go to Chapter 4.