Russian NLP

What are the available Natural Language Processing tools for Russian? What are the existing functionalities, and which projects are currently active?

Active projects

  • myStem is one of the most popular tool for text tokenization, lemmatization and morphological analysis.
  • Freeling is an open source language analysis tool suite, they have a Russian language component with the following services:The Russian dictionary contains over 1,630,000 forms corresponding to more than 510,000 lemma-PoS combinations. see online demo here
  • TreeTagger - a language independent part-of-speech tagger: The TreeTagger is a tool for annotating text with part-of-speech and lemma information, can be used to tag Russian -- see also  treetagger

Inactive Projects

  • RussNet project (last site update 2005): RussNet is a lexical semantic database for the Russian language
  • AGFL Grammar and Lexicon for Russian (last site update?): Create the Rus4IR system (Russian parser for Information Retrieval) - a powerful natural language processing tool aimed to generate parses from texts written in Russian.
  • Seman (last site update 2013) : Linguistic Environment
  • TTC: Terminology Extraction, Translation Tools and Comparable Corpora (2013), Russian talk 
  • Surrey Morphology Group (2011): Past Projects for Russian language from the university of Surrey
More to come ...

Russian in NewsScape

Red Hen has recently made some test recordings, which have been integrated into the NewsScape collection. As of mid-March 2016, there are currently TVC (TV Center) news broadcasts from the second week of February 2016. We believe we now have full support for cyrillic teletext in CCExtractor:

Russian news -- cyrillic teletext

We can use Google Translate in the browser:

We can search the collection in cyrillic:

We do not currently have any NLP tools applied to the Russian language collection.