This page is based on Prannoy Mupparaju's 2017 GSoC project. In the meantime we have identified some drawbacks in the approach we chose back then and we suggest a modified pipeline building on Prannoy's work and the work of Edward Seley, a student at CWRU. The following paragraphs give an overview of how to approach this project.
The first version of the pipeline is available on GitHub. We are going to keep it stable for now and instead work on the newly-created repository for version 2. So even though this is currently quite empty, please fork this repository, work on it and then submit pull requests. If you work on this for longer, we can also give you rights on the repository itself. Please write to Peter Uhrig for this.
One advantage of the new pipeline over the old one is that it makes use of relatively portable software only, so we will probably be able to do this without the overhead of a singularity container. You will need the pragmatic segmenter and UDPipe including the full set of models.
Processing steps:
As you see, most of this is relatively straightforward. The only aspect that may appear strange is the fact that we create two versions, one with and one without timestamp and then merge them at the end. The reason for this is that UDPipe (and possibly also the pragmatic segmenter) cannot deal with XML-style annotations. Since processing in UDPipe will split up words that would be regarded as one word by other software, we cannot simply count or find the words from the original text in the processed text. This is why we keep the timestamps in as "words" and then we know exactly where they end up in UDPipe's output. However, since having these unknown "words" in the text negatively effects the analyses of UDPipe (or any other NLP system), we cannot use the linguistic annotations from the output with timestamps. (They would have an unnecessarily high error rate.)
[End of Update October 2018]
This tutorial details how to setup the tools needed to build a multilingual text processing pipeline.
We first show how to setup singularity containers for SyntaxNet and TreeTagger, so that they can be run on servers, even without root access.
Install singularity and debootstrap as described here.
Download the syntaxnet.def
file and build a singularity container for syntaxnet using the following commands. Note that you would need 20GB of free space on your machine.
singularity create --size 20000 syntaxnet.img
sudo singularity bootstrap syntaxnet.img syntaxnet.def
This installs syntaxnet inside the container in the /opt directory and also downloads parameter files for some languages.
You can enter the container using:
sudo singularity shell -w --cleanenv syntaxnet.img #for write access singularity shell --cleanenv syntaxnet.img #for read access and testing without elevated user rights
You should now be able to run the different syntaxnet models after unzipping them. To unzip the files, go to the directory where it was downloaded(/opt/models/syntaxnet/syntaxnet/models/other_language_models in our case) and run unzip <filename>
.
You should tokenize your text before passing it to the parser. This separates the punctuation marks from the words, thereby increasing accuracy of the parsers. Syntaxnet provides tokenizers for some languages, these can be found on the website. If it’s available it can be done as follows:
cd /opt/models/syntaxnet
MODEL_DIRECTORY=/where/you/unzipped/the/model/files
cat sentences.txt | syntaxnet/models/parsey_universal/tokenize.sh $MODEL_DIRECTORY > output.txt
Here, sentences.txt is the input file and the output of the tokenizer will be in output.txt. MODEL_DIRECTORY
in our case was: /opt/models/syntaxnet/syntaxnet/models/other_language_models/<insert-language-name>
An already tokenized file can be parsed as follows:
cd /opt/models/syntaxnet
MODEL_DIRECTORY=/where/you/unzipped/the/model/files
cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY > output.txt
You should now have the parsed output from syntaxnet in output.txt.
Here we describe how to make a singularity container for treetagger. Treetagger is another parser and provides lemmatization (getting root words) too, which syntaxnet doesn’t. The process is similar to what we did with syntaxnet.
Download the treetagger.def file and build the container as follows:
singularity create --size 5000 treetagger.img
sudo singularity bootstrap treetagger.img treetagger.def
Enter the container using:
singularity shell -w --cleanenv treetagger.img
The treetagger.def
already contained scripts to download parameter files for a few languages. New languages can be downloaded by getting the corresponding link from the website. Note that you have to run install-tagger.sh after downloading new parameter files, to be able to use them.
To run the parser, goto the directory where treetagger was installed, (in our case /opt) and run:
cat input.txt | cmd/tree-tagger-<insert language name>
This will give the parsed text with corresponding lemmas as output.
The tree-tagger-<language>
files contain commands to take the input text, tokenize and parse it. The output sometimes contains “<unknown>” in the lemma column for words the parser doesn’t recognize. This can be changed to output the same word in the lemma column by adding the “-no-unknown
” tag to OPTIONS in the corresponding tree-tagger file in the cmd directory.
OPTIONS="-no-unknown -token -lemma -sgml -pt-with-lemma"
Note for Portuguese: While running the script on portuguese, we noticed that both tree-tagger-portuguese
and tree-tagger-portuguese-finegrained
stop at the first special character and don’t give output after that nor any error. It was found that the script contained a 'grep' for removing blank lines, which was somehow eliminating the text after a special character. We found this could be avoided by commenting line 23 in the file as:
#grep -v '^$' |
The aim of this pipeline is to take as input one of the files in the NewsScape dataset and output an XML-style file with sentence splits; lemmas, POS tagging and dependency information for each word. The pipeline can be summarized in 5 major steps:
The following figure shows the sequence of operations in the pipeline from input to output.