SyntaxNet and Parsey McParseface on Red Hen
Google released SyntaxNet to open source on 12 May 2016, as described in its research blog post. Red Hen runs a variety of Natural Language Processing systems on its data to tag for grammar. Can we run SyntaxNet on the Red Hen dataset? Would you like to help with this task? If so, write to
and we will try to connect you with a mentor.
Update 19 Dec 2016: Parsey McParseface is now installed in a Singularity image and can be used to parse English text. The image is available from Peter Uhrig.
This tutorial details how to setup the tools needed to build a multilingual text processing pipeline. We first show how to setup singularity containers for SyntaxNet and TreeTagger, so that they can be run on servers, even without root access.
Begin by installing singularity and debootstrap following Google's instructions.
Download the syntaxnet.def file and build a singularity container for syntaxnet using the following commands. Note that you would need 20GB of free space on your machine.
singularity create --size 20000 syntaxnet.img
sudo singularity bootstrap syntaxnet.img syntaxnet.def
This installs syntaxnet inside the container and also downloads parameter files for some languages. You can enter the container using:
singularity shell -w --cleanenv syntaxnet.img
You should now be able to run the different syntaxnet models after unzipping them. To unzip the files, go to the directory where it was downloaded (/opt/models/syntaxnet/syntaxnet/models/other_language_models in our case) and run unzip <filename>.
You should tokenize your text before passing it to the parser. This separates the punctuation marks from the words, thereby increasing accuracy of the parsers. Syntaxnet provides tokenizers for some languages, these can be found on github's tensorflow site. If it’s available it can be done as follows:
cat sentences.txt | syntaxnet/models/parsey_universal/tokenize.sh $MODEL_DIRECTORY > output.txt
Here, sentences.txt is the input file and the output of the tokenizer will be in output.txt. MODEL_DIRECTORY in our case was: /opt/models/syntaxnet/syntaxnet/models/other_language_models/<insert-language-name>
An already tokenized file can be parsed as follows:
cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY > output.txt
You should now have the parsed output from syntaxnet in output.txt.
Here we describe how to make a singularity container for treetagger. Treetagger is another parser and provides lemmatization (getting root words) too, which syntaxnet doesn’t. The process is similar to what we did with syntaxnet.
Download the treetagger.def file and build the container as follows:
singularity create --size 5000 treetagger.img
sudo singularity bootstrap treetagger.img treetagger.def
Enter the container using:
singularity shell -w --cleanenv treetagger.img
The treetagger.def already contained scripts to download parameter files for a few languages. New languages can be downloaded by getting the corresponding link from the website. Note that you have to run install-tagger.sh after downloading new parameter files, to be able to use them.
To run the parser, goto the directory where treetagger was installed, (in our case /opt) and run:
cat input.txt | cmd/tree-tagger-<insert language name>
This will give the parsed text with corresponding lemmas as output.
The tree-tagger-<language> files contain commands to take the input text, tokenize and parse it. The output sometimes contains “<unknown>” in the lemma column for words the parser doesn’t recognize. This can be changed to output the same word in the lemma column by adding the “-no-unknown” tag to OPTIONS in the corresponding tree-tagger file in the cmd directory.
OPTIONS="-no-unknown -token -lemma -sgml -pt-with-lemma"
Note for Portuguese: While running the script on portuguese, we noticed that both tree-tagger-portuguese and tree-tagger-portuguese-finegrained stop at the first special character and don’t give output after that nor any error. It turns out the script contained a 'grep' for removing blank lines, which was somehow eliminating the text after a special character. We found this could be avoided by commenting out line 23 in the file:
#grep -v '^$' |