Tagging Spanish text
Can we build an archive of literary texts in Spanish and tag it with Stanford Core NLP, in such a way that it would then be susceptible to manipulation by Red Hen utilities?
See http://www.foldl.me/2014/spanish-summarizer-corenlp/.
For full information, see http://nlp.stanford.edu/software/corenlp.shtml.
Would you like to accomplish all or part of this task?
If so, write to
and we will try to connect you with a mentor.
Some additional information:
Download CoreNLP and the Spanish models for it from here:
http://nlp.stanford.edu/software/corenlp.shtml
Then run it with the following command line:
java -cp stanford-corenlp-3.5.2.jar;stanford-spanish-corenlp-2015-01-08-models.jar -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file hola.txt
This is the output:
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Ready to process: 1 files, skipped 0, total 1
Processing file D:\stanford-corenlp-full-2015-04-20\hola.txt ... writing to D:\s
tanford-corenlp-full-2015-04-20\hola.txt.out {
Annotating file D:\stanford-corenlp-full-2015-04-20\hola.txt
} [0.170 seconds]
Processed 1 documents
Skipped 0 documents, error annotating 0 documents
Annotation pipeline timing information:
TokenizerAnnotator: 0,0 sec.
WordsToSentencesAnnotator: 0,0 sec.
TOTAL: 0,0 sec. for 5 tokens at 147,1 tokens/sec.
Pipeline setup: 0,0 sec.
Total time for StanfordCoreNLP pipeline: 0,2 sec.
D:\stanford-corenlp-full-2015-04-20>cat hola.txt.out
Sentence #1 (2 tokens):
Hola.
[Text=Hola CharacterOffsetBegin=0 CharacterOffsetEnd=4]
[Text=. CharacterOffsetBegin=4 CharacterOffsetEnd=5]
Sentence #2 (3 tokens):
Que tal.
[Text=Que CharacterOffsetBegin=6 CharacterOffsetEnd=9]
[Text=tal CharacterOffsetBegin=10 CharacterOffsetEnd=13]
[Text=. CharacterOffsetBegin=13 CharacterOffsetEnd=14]
D:\stanford-corenlp-full-2015-04-20>