Current state of text tagging

Overview

Red Hen has developed a joint text- and image-engineering framework for parsing the semantics of its television news dataset, using Natural Language Processing (NLP) tools to annotate the caption text. These automated tools complement the online tagging interface and ELAN for manual annotations.

NLP tools in python and java include several types of sentiment detection, named entity recognition with person, place, and calendar time, transcript integration, and syntactic parsers with lemmatization, all deployed on the whole collection. Conceptual frame annotation with FrameNet is in place and can be extended to Spanish FrameNet. Visual tools in C++ include multi-language on-screen text extraction (deployed), shot boundary detection and face analysis (developed and being deployed). Joint text-image tools include story segmentation and topic clustering; these are both deployed and under continuous development. The framework is designed for rapid integration of new tools.

For command-line searches of the annotated files, see Command line access.

Our open-source repository is RedHenLab (github). For working with python, see the Python Tutorial.

- Current state of text tagging (this page)

Multilingual Corpus Pipeline
Audio processing pipeline
Machine learning
Red Hen data format
Video processing pipelines
Update the FrameNet tagger to OpenSesame
- LASER. Language-Agnostic SEntence Representations. (external)
- spaCy open source natural language processing: official site, Wikipedia

Tagging pipeline

1. Incoming .txt
2. Commercial detection
3. Sentence splitting (.seg)
4. Story segmentation
5. Sentiment detection (two kinds)
6. Parts of Speech (two kinds for English, one for French, German, and Spanish
7. Named Entity Recognition
8. FrameNet Parsing.

Annotation types

The Primary Tags identify the annotation type:

CC
- Closed Captioning
- Inside file .txt as CC1, CC2, etc.
- Czech, Danish, English, German, Italian, Norwegian, Pashto, Portuguese, Spanish, Swedish
- All files and incoming
SEG
- automated commercial detection from caption styles (SEG_00)
- automated commercial detection from context (SEG_01)
- automated topic detection
- All English files and incoming
NER
- Named Entity Recognition (7 categories)
- Inside file .seg as NER_03
- Stanford NER—NER-StanfordNLP-annotate.py
- All English files and incoming
POS
- Parts of speech
- Inside file .seg
- MBSP—PartsOfSpeech-MBSP-annotate.py, with primary tag POS_01
- Stanford POS—PartsOfSpeech-StanfordNLP-annotate.py, with primary tag POS_02
- CLiPS pattern.de—PartsOfSpeech-pattern_de.py, German parts of speech POS_03
- CLiPS pattern.fr—PartsOfSpeech-pattern_fr.py, French parts of speech POS_04
- CLiPS pattern.es—PartsOfSpeech-pattern_es.py, Spanish parts of speech POS_05
- All files and incoming
- The parts-of-speech annotations use the Penn Treebank II tag set
SMT
- Sentiment detection
- Inside file .seg
- Pattern.en sentiment detection (SMT_01)
- SentiWordNet positivity and subjectivity (as SMT_02)
- All files and incoming
- Preliminary analysis by Babar Ali
OCR
- Using file extension .ocr
- Custom tesseract-ocr for on-screen text at one-second intervals
- Performed on the Hoffman2 cluster
- Danish, English, German, Italian, Norwegian, Portuguese, Spanish, and Swedish
- All files and incoming
TPT
- Using file extension .tpt
- CNN transcripts integrated with the timestamps from the caption files
- We're getting transcripts for CNN, FOX-News, and MSNBC
FRM
- Using file extension .frm
- Semafor (currently 3.0-alpha4) from the ARK group selects frames from FrameNet, using automatic semantic role labeling (ASRL), frame identification, and argument identification
- FrameNet-06.py converts the json output to the RedHen format with primary tag FRM_01
- Tagging has been completed for the years 2012 - 2014 files using FrameNet 1.5
See Examples of .seg files to see how data are tagged. Each .seg file corresponds to a .txt file. The .seg file contains the tagging. All files for a given holding in the Red Hen archive have the same title and coordinated timestamps, but different extensions. E.g.,
- 2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.mp4
- 2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.txt
- 2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.ocr
- 2015-02-12_2300_US_WEWS_NewsChannel_5_at_6pm.seg
See command-line access for details on command-line search

Developments

- NewsReader (demo) -- IXA pipeline -- OpenNER -- qtleap
- Possible collaboration

Slavic languages
- Russian NLP
- NLP support in TreeTagger and pattern CLiPS
- Universal Dependencies project
MetaNet
- Currently discussing with the MetaNet team the opportunities for tagging the RedHen archive according to MetaNet data structures
"Chickenfeed"
- RedHen plans to develop a data format compatible with both FrameNet and MetaNet so that RedHen researchers can build new data structures to be queried to tag the RedHen archive according to these new data structures. Some Red Hens are calling these potential new data structures "Chickenfeed." Chickenfeed might be thought of as forking from both FrameNet and MetaNet in a format that is compatible with them, so that the open-source software already developed for tagging according to the results of querying FrameNet and MetaNet would work seamlessly on Chickenfeed. See our open-source repository at RedHenLab (github) for such software.

Resources