Current state of text tagging
Overview
Red Hen has developed a joint text- and image-engineering framework for parsing the semantics of its television news dataset, using Natural Language Processing (NLP) tools to annotate the caption text. These automated tools complement the online tagging interface and ELAN for manual annotations.
NLP tools in python and java include several types of sentiment detection, named entity recognition with person, place, and calendar time, transcript integration, and syntactic parsers with lemmatization, all deployed on the whole collection. Conceptual frame annotation with FrameNet is in place and can be extended to Spanish FrameNet. Visual tools in C++ include multi-language on-screen text extraction (deployed), shot boundary detection and face analysis (developed and being deployed). Joint text-image tools include story segmentation and topic clustering; these are both deployed and under continuous development. The framework is designed for rapid integration of new tools.
For command-line searches of the annotated files, see Command line access.
Our open-source repository is RedHenLab (github). For working with python, see the Python Tutorial.
Related pages
Current state of text tagging (this page)
Update the FrameNet tagger to OpenSesame
LASER. Language-Agnostic SEntence Representations. (external)
spaCy open source natural language processing: official site, Wikipedia
Tagging pipeline
Incoming .txt
Commercial detection
Sentence splitting (.seg)
Story segmentation
Sentiment detection (two kinds)
Parts of Speech (two kinds for English, one for French, German, and Spanish
Named Entity Recognition
FrameNet Parsing.
Annotation types
The Primary Tags identify the annotation type:
CC
Closed Captioning
Inside file .txt as CC1, CC2, etc.
Czech, Danish, English, German, Italian, Norwegian, Pashto, Portuguese, Spanish, Swedish
All files and incoming
SEG
automated commercial detection from caption styles (SEG_00)
automated commercial detection from context (SEG_01)
automated topic detection
All English files and incoming
NER
Named Entity Recognition (7 categories)
Inside file .seg as NER_03
Stanford NER—NER-StanfordNLP-annotate.py
All English files and incoming
POS
Parts of speech
Inside file .seg
MBSP—PartsOfSpeech-MBSP-annotate.py, with primary tag POS_01
Stanford POS—PartsOfSpeech-StanfordNLP-annotate.py, with primary tag POS_02
CLiPS pattern.de—PartsOfSpeech-pattern_de.py, German parts of speech POS_03
CLiPS pattern.fr—PartsOfSpeech-pattern_fr.py, French parts of speech POS_04
CLiPS pattern.es—PartsOfSpeech-pattern_es.py, Spanish parts of speech POS_05
All files and incoming
The parts-of-speech annotations use the Penn Treebank II tag set
SMT
Sentiment detection
Inside file .seg
Pattern.en sentiment detection (SMT_01)
SentiWordNet positivity and subjectivity (as SMT_02)
All files and incoming
Preliminary analysis by Babar Ali
OCR
Using file extension .ocr
Custom tesseract-ocr for on-screen text at one-second intervals
Performed on the Hoffman2 cluster
Danish, English, German, Italian, Norwegian, Portuguese, Spanish, and Swedish
All files and incoming
TPT
Using file extension .tpt
CNN transcripts integrated with the timestamps from the caption files
We're getting transcripts for CNN, FOX-News, and MSNBC
FRM
Using file extension .frm
Semafor (currently 3.0-alpha4) from the ARK group selects frames from FrameNet, using automatic semantic role labeling (ASRL), frame identification, and argument identification
FrameNet-06.py converts the json output to the RedHen format with primary tag FRM_01
Tagging has been completed for the years 2012 - 2014 files using FrameNet 1.5
See Examples of .seg files to see how data are tagged. Each .seg file corresponds to a .txt file. The .seg file contains the tagging. All files for a given holding in the Red Hen archive have the same title and coordinated timestamps, but different extensions. E.g.,
2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.mp4
2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.txt
2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.ocr
2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.seg
See command-line access for details on command-line search
Developments
NewsReader (demo) -- IXA pipeline -- OpenNER -- qtleap
Possible collaboration
Slavic languages
NLP support in TreeTagger and pattern CLiPS
MetaNet
Currently discussing with the MetaNet team the opportunities for tagging the RedHen archive according to MetaNet data structures
"Chickenfeed"
RedHen plans to develop a data format compatible with both FrameNet and MetaNet so that RedHen researchers can build new data structures to be queried to tag the RedHen archive according to these new data structures. Some Red Hens are calling these potential new data structures "Chickenfeed." Chickenfeed might be thought of as forking from both FrameNet and MetaNet in a format that is compatible with them, so that the open-source software already developed for tagging according to the results of querying FrameNet and MetaNet would work seamlessly on Chickenfeed. See our open-source repository at RedHenLab (github) for such software.
Resources
Currently exploring the development of a treebank.info-like GUI front-end for searching text in Red Hen tagged with the Stanford full parser. Principal: Peter Uhrig.
Guides
Guide to datamining -- Resources
Mining of Massive Datasets -- book
Audio
Kaldi toolkit
Gentle aligner
HIPSTAS -- High Performance Sound Technologies for Access and Scholarship
Frameworks
NewsReader (demo) -- IXA pipeline -- OpenNER -- qtleap
TreeTagger - for tagging Polish, Russian and more -- wrapper -- treetagger_plain2naf
Morphology analyser (Czech and several other languages)
Excitement Open Platform (EOP) -- textual inference