What are the available Natural Language Processing tools for Russian? What are the existing functionalities, and which projects are currently active?
- myStem is one of the most popular tool for text tokenization, lemmatization and morphological analysis.
- Freeling is an open source language analysis tool suite, they have a Russian language component with the following services:The Russian dictionary contains over 1,630,000 forms corresponding to more than 510,000 lemma-PoS combinations. see online demo here http://nlp.lsi.upc.edu/freeling/demo/demo.php
- TreeTagger - a language independent part-of-speech tagger: The TreeTagger is a tool for annotating text with part-of-speech and lemma information, can be used to tag Russian -- see also treetagger
- RU Syntax (2015): This service provides syntactic parsing for Russian
- Tesseract open source OCR engine for multiple scripts and languages : Can estimating x-height in Cyrillic Text
- LinguaGrid ... to investigate
- The Tower of Babel: (2015): An Etymological Database Project
- Google: (2013) Russian Stress Prediction using Maximum Entropy Ranking
- Morphology analyser (Czech and several other languages)
- Excitement Open Platform (EOP) -- textual inference
- The CMU cross-lingual metaphor detector - a toolkit for identifying instances of figurative language in English and any other language for which a bilingual dictionary is available (Russian is available).
- RussNet project (last site update 2005): RussNet is a lexical semantic database for the Russian language
- AGFL Grammar and Lexicon for Russian (last site update?): Create the Rus4IR system (Russian parser for Information Retrieval) - a powerful natural language processing tool aimed to generate parses from texts written in Russian.
- Seman (last site update 2013) : Linguistic Environment
- TTC: Terminology Extraction, Translation Tools and Comparable Corpora (2013), Russian talk
- Surrey Morphology Group (2011): Past Projects for Russian language from the university of Surrey
More to come ...
Russian in NewsScape
Red Hen has recently made some test recordings, which have been integrated into the NewsScape collection. As of mid-March 2016, there are currently TVC (TV Center) news broadcasts from the second week of February 2016. We believe we now have full support for cyrillic teletext in CCExtractor:
We can use Google Translate in the browser:
We can search the collection in cyrillic:
We do not currently have any NLP tools applied to the Russian language collection.