Tagging Spanish text

Can we build an archive of literary texts in Spanish and tag it with Stanford Core NLP, in such a way that it would then be susceptible to manipulation by Red Hen utilities?

See http://www.foldl.me/2014/spanish-summarizer-corenlp/.

For full information, see http://nlp.stanford.edu/software/corenlp.shtml.

Would you like to accomplish all or part of this task?

If so, write to

and we will try to connect you with a mentor.

Some additional information:

Download CoreNLP and the Spanish models for it from here:


Then run it with the following command line:

java -cp stanford-corenlp-3.5.2.jar;stanford-spanish-corenlp-2015-01-08-models.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file hola.txt

This is the output:

Adding annotator tokenize

TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.

Adding annotator ssplit

Ready to process: 1 files, skipped 0, total 1

Processing file D:\stanford-corenlp-full-2015-04-20\hola.txt ... writing to D:\s

tanford-corenlp-full-2015-04-20\hola.txt.out {

Annotating file D:\stanford-corenlp-full-2015-04-20\hola.txt

} [0.170 seconds]

Processed 1 documents

Skipped 0 documents, error annotating 0 documents

Annotation pipeline timing information:

TokenizerAnnotator: 0,0 sec.

WordsToSentencesAnnotator: 0,0 sec.

TOTAL: 0,0 sec. for 5 tokens at 147,1 tokens/sec.

Pipeline setup: 0,0 sec.

Total time for StanfordCoreNLP pipeline: 0,2 sec.

D:\stanford-corenlp-full-2015-04-20>cat hola.txt.out

Sentence #1 (2 tokens):


[Text=Hola CharacterOffsetBegin=0 CharacterOffsetEnd=4]

[Text=. CharacterOffsetBegin=4 CharacterOffsetEnd=5]

Sentence #2 (3 tokens):

Que tal.

[Text=Que CharacterOffsetBegin=6 CharacterOffsetEnd=9]

[Text=tal CharacterOffsetBegin=10 CharacterOffsetEnd=13]

[Text=. CharacterOffsetBegin=13 CharacterOffsetEnd=14]