treetagger

by Jacek Woźny (jacekifa@gmail.com)

TreeTagger is a program that you can run and it will tell you what part of speech (POS) a word in a text is. Yes, some peculiar people, the yawn-lot I call them (linguists) won't get a wink of sleep until they know what POS a word is. They just need everything to be neatly categorized - orderly. "Part" is a noun, "of" is a preposition, and "speech" is a noun again- Good Night. Well, in fact - the treetagger could be very useful for the slavic languages (Polish, Czech, Russian) CC's (closed captions- the subtitles) that are captured by dola and rusalka stations. Because the slavic text is not tagged (parsed) yet (Nov. 20th 2016). And we want researchers to be able to search for sequences of say: noun is the noun of noun, etc. (X is the Y of Z - ask Mark Turner and he will tell all there is to know about it).

Basic information about treetager can be found at http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ and there we will learn that the treetagger can do this for example:

Clicking the link above, we will find detailed instructions of how to download the program in linux, windows or Mac OS. And, as we know, Raspberry PI - the atom of the Red Hen network - is a linux machine. So it would seem we can just follow the instructions for linux and all will be well. Well, of course it isn't. I mean - is it ever? I know next to nothing about linux or Rasppberries so I just followed the installation instructions from the above link and installed the treetager on a Raspberry PI. Created a directory (md name) and uploaded all the components to it using winscp - just dragged the files from my windows desktop to the folder on RPI. And then, when I tried to use the program to parse a simple sentence, "hello world", I got this, a lot of fluffy errors:

I tried and retried various installation options - no success. installed treetagger in widows and it worked there but not on the RPI. And then Francis swooped upon me and asked me to join his session using GNU screen (a terminal multiplexer, a software application that can be used to multiplex several virtual consoles, allowing a user to access multiple separate login sessions inside a single terminal window, says Wikipedia). I joined it by writing screen -x in the console and then navigated the various screens Francis was working on by pressing ctrl-a SPACE. One of the screens was used for chat - Francis was working on many screens at once and explaining on the chat screen. It works like this: you just write your comments in turns but instead of pressing Enter (that would result in 'bash not found, etc') you press ctrl C. Here's an example:

Francis: csa@vila:~ $ It's very complicated - see https://github.com/opener-project/kaf/w

Jacek: csa@vila:~ $ I have complicated for breakfast!^C

Jacek: csa@vila:~ $ they have Linguistic Processors - i could use one:)^C

Jacek: csa@vila:~ $ Jeez - so much work there!^C

Jacek: csa@vila:~ $ and they say quantum physics is difficult^C

Jacek: csa@vila:~ $ Hello - Mark? - please come and make a small spatial story out of t

Francis: csa@vila:~ $ Ha ha - so kind of you " Jacek and I are looking at treetagger and

Jacek: csa@vila:~ $ But It was a good work (yours I mean) we HAVE a working tagger on C

Francis: csa@vila:~ $ OK, so treetagger works on cartago. Will you take it from here? ^C

Jacek: csa@vila:~ $ Certainly - where?^C

Francis: csa@vila:~ $ ::-))^C

and so on

I invented a fast response method there- just write I and press ctrl C and the line reads: I^C. here's an example:

csa@vila:~ $ OK I'll show you a script in 4 ^C

csa@vila:~ $ Are you seeing the python script? The way I've used python is quit different from bash -- basically, python does its work by calling in specialized modules

csa@vila:~ $ So you see at the top, "import sys, os.path, datetime, re, codecs" -- these are python modules we activate for this script.

csa@vila:~ $ I^C (should have rather written: err, ahem, Uh-huh, WHAT???? - but there wasn't time)

And so I was jumping various screens frantically, trying to follow what Francis was doing in them. And he was executing looong commands, editing files and chatting (explaining). AND documenting. Yes, he was writing down in files what he was doing, step by step. For example:

wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.1.tar.gz

These components work fine, but obviously can do nothing without tree-tagger itself

wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/install-tagger.sh

wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz

See http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/#parfiles for supported languages

wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/polish-par-linux-3.2-utf8.bin.gz

wget http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/russian-par-linux-3.2-utf8.bin.gz

See http://treetaggerwrapper.readthedocs.io/en/latest/ and https://github.com/hltfbk/Excitement-Open-Platform/wiki/Step-by-Step,-TreeTagger-Installation

just install python-six

just install ant maven

Jessie's ant is 1.9.4, which has an ftp problem, so activate stretch and install 1.9.7:

just install ant/stretch

Get the build file

wget https://raw.githubusercontent.com/dkpro/dkpro-core/master/dkpro-core-langdetect-asl/src/scripts/build.xml

It complains

Unable to locate tools.jar. Expected to find it in /usr/lib/jvm/java-7-openjdk-armhf/lib/tools.jar

but installs successfully. If we have problems, upgrade java:

oracle-java8-jdk

It has tools.jar -- we would then need to run the build script again. But let's try java 7 first.

etc.

And he was doing it all at once- blast beat, multiple drum strokes, perfect rhythm. I am not exaggerating- it just was a concert. I can't explain it in any other way.

After running some diagnostics, Francis concluded that: In this case, the binaries provided by the developer are compiled for the x86 platform, while the RPi is on the armhf platform. So who is surprised it doesn't work -- we need to compile the code ourselves on an RPi, or have the developer do it for us.

I^C - so there are those 'platforms' for which you compile (I almost get it) and RPI is a different platform and that's why it wasn't working. Also I learned that if I look to install something on an RPI - it's better to include 'raspbian' or 'debian' (Unix-type operating systems, raspbian is a version of debian) in the search to get something that might work on a Raspberry PI.

And it's really AMAZING what you can find if you do. Take this page for example (google "debian treetagger")

https://github.com/opener-project/opener-tree-tagger/blob/master/task/tree_tagger.rake

etc. etc.

The (holy) spirit of generosity, good will, sharing - it's not lost in this our world. One just has to learn where to look for it. And this one of those places.

Anyway Francis did make treetagger run in linux but on cartago. And until yesterday I thought cartago was a Rasperry PI but Francis told me that in fact it was a huge linux server with 50 drives. Shows you how much I know, doesn't it. So he made it run there. We have it now. It's working. He did it. There is hope for slavic tagging.

Thank you - to Francis, the master bash drummer, and to you, dear reader.