— Multilingual Corpus Pipeline

Update (October 2018)

This page is based on Prannoy Mupparaju's 2017 GSoC project. In the meantime we have identified some drawbacks in the approach we chose back then and we suggest a modified pipeline building on Prannoy's work and the work of Edward Seley, a student at CWRU. The following paragraphs give an overview of how to approach this project.

The first version of the pipeline is available on GitHub. We are going to keep it stable for now and instead work on the newly-created repository for version 2. So even though this is currently quite empty, please fork this repository, work on it and then submit pull requests. If you work on this for longer, we can also give you rights on the repository itself. Please write to Peter Uhrig for this.

One advantage of the new pipeline over the old one is that it makes use of relatively portable software only, so we will probably be able to do this without the overhead of a singularity container. You will need the pragmatic segmenter and UDPipe including the full set of models.

Processing steps:

  1. Extract the text from the NewsScape TXT files in 2 versions: one with timestamps and one without timestamps. Prannoy's preprocess.py from the first version of the pipeline can be used for that without modifications (see run.sh there for how to call it).
  2. Apply Edward Seley's quotefix.rb (this is the only file already found in the new repository) to both versions. This script resolves a problem we have with the pragmatic segmenter when quotation marks open but do not close.
    1. Run the results through the pragmatic segmenter. You can use the existing ss.rb script from the first version of the pipeline.
  3. For languages/programmes for which we find only upper-case captions (e.g. Brazilian Portuguese), it probably makes sense to lowercase everything except the first letter in a sentence to improve the accuracy of the tagging and parsing in subsequent steps. We may also need to evaluate the performance of the pragmatic segmenter for such data.
  4. Run the results through UDPipe; for the version with timestamps, it is sufficient to run the tokenizer; tagging and parsing would probably just be a waste of time (but double check how forms such as "am" (see next step) behave).
  5. Now use the version without timestamps and add the timestamps to it from the version with timestamps so we can obtain a document in the target format. You can modify Prannoy's script parser.py to work with UDPipe's ouptput instead of SyntaxNet's output (but these are very similar). The old version also takes in output from the separate lemmatizer, but this is no longer necessary with lemma information already present in UDPipe's output. The target format is a tabular format. You can find a Russian file attached to this website. Ideally, we keep all columns found in UDPipe's output. Even if some are not used for a specific language they may be used in another language. In this step language-specific adjustments may be necessary for languages that feature contractions, cliticization, or similar phenomena. For instance UDPipe outputs the word "am" but also the separate forms "an dem" when parsing German text. We still have to decide how to cope with such cases.

As you see, most of this is relatively straightforward. The only aspect that may appear strange is the fact that we create two versions, one with and one without timestamp and then merge them at the end. The reason for this is that UDPipe (and possibly also the pragmatic segmenter) cannot deal with XML-style annotations. Since processing in UDPipe will split up words that would be regarded as one word by other software, we cannot simply count or find the words from the original text in the processed text. This is why we keep the timestamps in as "words" and then we know exactly where they end up in UDPipe's output. However, since having these unknown "words" in the text negatively effects the analyses of UDPipe (or any other NLP system), we cannot use the linguistic annotations from the output with timestamps. (They would have an unnecessarily high error rate.)

[End of Update October 2018]


This tutorial details how to setup the tools needed to build a multilingual text processing pipeline.

We first show how to setup singularity containers for SyntaxNet and TreeTagger, so that they can be run on servers, even without root access.

Install singularity and debootstrap as described here.

Related pages

Syntaxnet Container

Download the syntaxnet.def file and build a singularity container for syntaxnet using the following commands. Note that you would need 20GB of free space on your machine.

singularity create --size 20000 syntaxnet.img
sudo singularity bootstrap syntaxnet.img syntaxnet.def

This installs syntaxnet inside the container in the /opt directory and also downloads parameter files for some languages.

You can enter the container using:

sudo singularity shell -w --cleanenv syntaxnet.img #for write access singularity shell --cleanenv syntaxnet.img #for read access and testing without elevated user rights

You should now be able to run the different syntaxnet models after unzipping them. To unzip the files, go to the directory where it was downloaded(/opt/models/syntaxnet/syntaxnet/models/other_language_models in our case) and run unzip <filename>.

You should tokenize your text before passing it to the parser. This separates the punctuation marks from the words, thereby increasing accuracy of the parsers. Syntaxnet provides tokenizers for some languages, these can be found on the website. If it’s available it can be done as follows:

cd /opt/models/syntaxnet
cat sentences.txt | syntaxnet/models/parsey_universal/tokenize.sh $MODEL_DIRECTORY > output.txt

Here, sentences.txt is the input file and the output of the tokenizer will be in output.txt. MODEL_DIRECTORY in our case was: /opt/models/syntaxnet/syntaxnet/models/other_language_models/<insert-language-name>

An already tokenized file can be parsed as follows:

cd /opt/models/syntaxnet
cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY > output.txt

You should now have the parsed output from syntaxnet in output.txt.

Treetagger Container

Here we describe how to make a singularity container for treetagger. Treetagger is another parser and provides lemmatization (getting root words) too, which syntaxnet doesn’t. The process is similar to what we did with syntaxnet.

Download the treetagger.def file and build the container as follows:

singularity create --size 5000 treetagger.img
sudo singularity bootstrap treetagger.img treetagger.def

Enter the container using:

singularity shell -w --cleanenv treetagger.img

The treetagger.def already contained scripts to download parameter files for a few languages. New languages can be downloaded by getting the corresponding link from the website. Note that you have to run install-tagger.sh after downloading new parameter files, to be able to use them.

To run the parser, goto the directory where treetagger was installed, (in our case /opt) and run:

cat input.txt | cmd/tree-tagger-<insert language name>

This will give the parsed text with corresponding lemmas as output.

The tree-tagger-<language> files contain commands to take the input text, tokenize and parse it. The output sometimes contains “<unknown>” in the lemma column for words the parser doesn’t recognize. This can be changed to output the same word in the lemma column by adding the “-no-unknown” tag to OPTIONS in the corresponding tree-tagger file in the cmd directory.

OPTIONS="-no-unknown -token -lemma -sgml -pt-with-lemma"

Note for Portuguese: While running the script on portuguese, we noticed that both tree-tagger-portuguese and tree-tagger-portuguese-finegrained stop at the first special character and don’t give output after that nor any error. It was found that the script contained a 'grep' for removing blank lines, which was somehow eliminating the text after a special character. We found this could be avoided by commenting line 23 in the file as:

#grep -v '^$' |

Pipelines for different languages

The aim of this pipeline is to take as input one of the files in the NewsScape dataset and output an XML-style file with sentence splits; lemmas, POS tagging and dependency information for each word. The pipeline can be summarized in 5 major steps:

  • Extracting useful text from the input file - using a custom python script
  • Sentence splitting - using Pragmatic Segmenter
  • Tokenization - using Syntaxnet for supported languages(German, Portuguese, Polish, Swedish) and Treetagger for some others(Russian)
  • POS Tagging and Dependency Parsing - using Syntaxnet
  • Lemmatizing - using CST's Lemmatizer for supported languages(German, Portuguese, Polish, Russian) and Treetagger for some others (Swedish)

The following figure shows the sequence of operations in the pipeline from input to output.