Red Hen data format
Red Hen aims to facilitate collaborative work across different types of expertise, geographical locations, and time. To facilitate this, we have developed a shared data format, specified below. To simplify interoperability and ensure we can scale, we currently rely on flat files rather than databases, and on metadata stored in time-stamped single lines rather than in multi-line hierarchical systems like xml.
Related
- Overview of research (with dataset description)
- Red Hen corpus data format
- Edge Search Engine Documentation (lists searchable tags)
- How to use the Edge2 search engine
- Current state of text tagging
File names
The basic unit in the Red Hen dataset is a video file, typically a one-hour news program, though it could also be a one-minute campaign ad. A series of files are then created around this video file, for instance,
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.frm.json
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.img
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.jpg
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.json
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.mp4
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.ocr
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.seg
2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.txt
The text files here are .ocr (on-screent text from optical character recognition), .seg (metadata from NLP), and .txt (closed captioning or teletext from the television transport stream).
Main file types
- Caption text (txt) (extracted from the television transport stream)
- Online transcript (tpt) (downloaded and mechanically aligned)
- On-screen text (ocr) (created through optical character recognition)
- Annotated text (seg) (automated and manual tags)
- Thumbnails (img directory) (extracted at ten-second intervals)
- Image montages (jpg) (assembled from thumbnails)
- Video (mp4) (compressed and resized)
Text file data structure
The data in the text files is structured as follows:
- A header with file-level information
- A legend with information about the different modules that have been run on the file
- The data section
Header block
So for instance in a seg file, we have these field names and values in the header (see also example .seg files):
TOP|20151125153900|2015-11-25_2000_PT_RTP-1_Telejornal
COL|Communication Studies Archive, UCLA
UID|697bf308-939a-11e5-92bc-4f7a6b3fa3a8
DUR|0:59:53.62
VID|576x432|720x576
SRC|IPG Portugal
CMT|
LAN|POR
TTP|885
LBT|2015-11-25 15:39:00 Europe/Lisbon
A campaign ad example:
TOP|20151121000000|2015-11-21_0000_US_CampaignAds_Americans_for_Prosperity
COL|Communication Studies Archive, UCLA
UID|591e3ed8-8ff0-11e5-9a62-003048ce8836
AQD|2015-11-21 01:36:55 UTC
DUR|0:00:30.07
VID|1280x720
TTL|We are AFP Alaska!
URL|http://youtube.com/watch?v=g-5BPZt5uPA
TTS|Youtube English machine transcript 2015-11-21 0137 UTC
CMT|
HED|
LBT|2015-11-20 19:00:00 America/New_York
The header always starts with TOP and ends with LBT (local broadcast time); it's generated on capture.
These are the fields used in the header block:
TOP -- contains the starting timestamp and the file name
COL -- contains the collection name
UID -- a unique ID for the collection
PID -- the show's episode (EP) or show (SH) ID
AQD -- the time of acquisition
DUR -- the duration of the recording in hours:minutes:seconds.hundredths of a second
VID -- the picture size of the compressed video and of the original video
TTL -- the title of the event if applicable, or the series, if it contains non-ascii characters
URL -- the web source if applicable
TTS -- the type of transcript if applicable
SRC -- the recording location
CMT -- a comment added by the person scheduling the recording, or to indicate its quality ("Garbled captions")
LAN -- three-letter ISO language code (see list below)
TTP -- the teletext page
HED -- the header if available, typically with summary information about the content
OBT -- the original broadcast time, when it differs from the local broadcast time
"OBT|Estimated" is used in digitized files when the precise broadcast time is unknown
LBT -- the local broadcast time, with time zone
END -- [at the end of the file] contains end timestamp and filename
Add any missing fields.
The language fields currently used include the following, using ISO639-3 (some may need to be updated, and we may want to start using combination codes for dialect identification, cf. ISO639-1):
AFR -- Africaans
ARA -- Arabic
CES -- Czech
CMN -- Mandarin Chinese
DAN -- Danish
ENG -- English
FRA -- French
DEU -- German
HRV -- Croatian
HUN -- Hungarian
GLG -- Galician
ITA -- Italian
NLD -- Dutch
NOR -- Norwegian
PER -- Persian
POL -- Polish
POR -- Portuguese
PUS -- Pashto
RUS -- Russian
SME -- Sami
SPA -- Spanish
ES-MX -- Mexican Spanish
SWE -- Swedish
ZHO -- Chinese
Credit block
In the case of .seg files, the header is followed by the credit block, which also contains a codebook legend:
FRM_01|2015-10-22 06:34|Source_Program=FrameNet 1.5, Semafor 3.0-alpha4, FrameNet-06.py|
Source_Person=Charles Fillmore, Dipanjan Das, FFS|Codebook=Token|Position|Frame name|
Semantic Role Labeling|Token|Position|Frame element
NER_03|2015-10-22 22:55|Source_Program=stanford-ner 3.4, NER-StanfordNLP-02.py|
Source_Person=Jenny Rose Finkel, FFS|Codebook=Category/Entity
POS_01|2015-10-22 06:32|Source_Program=MBSP 1.4, PartsOfSpeech-MBSP-05.py|Source_Person=Walter Daelemans,
FFS|Codebook=Treebank II
POS_02|2015-10-22 22:56|Source_Program=stanford-postagger 3.4,
PartsOfSpeech-StanfordNLP-02.py|Source_Person=Kristina Toutanova, FFS|Codebook=Treebank II
SEG_02|2015-10-22 01:46|Source_Program=RunTextStorySegmentation.jar|Source_Person=Rongda Zhu
SMT_01|2015-10-22 06:32|Source_Program=Pattern 2.6, Sentiment-02.py|Source_Person=Tom De Smedt,
FFS|Codebook=polarity, subjectivity
SMT_02|2015-10-22 06:32|Source_Program=SentiWordNet 3.0, Sentiment-02.py|Source_Person=Andrea Esuli,
FFS|Codebook=polarity, subjectivity
You see the syntax: each line starts with primary tag, say SEG_02. The primary tags are numbered to allow different techniques to generate the same kind of data. For instance, we might imagine two methods of speaker diarization using DIA_01 and DIA_02. The next field is the date the module was run -- YYYY-mm-DD HH:MM. The field separator is the pipe symbol. Next are fields for Source_Program, Source_Person, and Codebook with labels for each field in the annotation. It's critical we have good information in the legends, so that the significance of each column in the main data section is systematically tracked.
The full specification of the data structure of a particular annotation is provided in the Edge2 Search Engine Definitions. These definitions are dynamically read by the Edge2 search engine at startup.
Main body
Again in .seg files, the main body of data follows the legends, using the primary tags they define. Each line has an absolute start time and end time in Universal time, a primary tag, and some content:
20151020220010.209|20151020220013.212|CC1|>>> "SPECIAL REPORT" IS NEXT.
20151020220010.209|20151020220013.212|POS_02|"SPECIAL/JJ|REPORT"/NN|IS/VBZ|NEXT./NNP|
20151020220010.209|20151020220013.212|FRM_01|REPORT|2-3|Statement
20151020220010.209|20151020220013.212|POS_01|"/``/I-NP/O/"|special/JJ/I-NP/O/special|report/NN/I-NP/O/report|"/NN/I-NP/O/"|is/VBZ/I-VP/O/be|next/JJ/I-ADJP/O/next|././O/O/.
20151020220010.209|20151020220013.212|SMT_01|0.178571428571|0.285714285714|special|0.357142857143|0.571428571429|next|0.0|0.0
20151020220010.209|20151020220013.212|SMT_02|SPECIAL|0.0|0.0|NEXT|0.0|0.0
20151020220013.212|20151020220017.516|CC1|>>> OUTGOING HOUSE SPEAKER JOHN BOEHNER LIVE ON THE COMPLICATED RACE TO SUCCEED HIM.
20151020220013.212|20151020220017.516|POS_02|OUTGOING/JJ|HOUSE/NNP|SPEAKER/NNP|JOHN/NNP|BOEHNER/NNP|LIVE/VB|ON/IN|THE/DT|COMPLICATED/JJ|RACE/NN|TO/TO|SUCCEED/VB|HIM./NNP|
20151020220013.212|20151020220017.516|NER_03|ORGANIZATION/HOUSE|PERSON/JOHN BOEHNER
20151020220013.212|20151020220017.516|FRM_01|OUTGOING|0-1|Sociability
20151020220013.212|20151020220017.516|FRM_01|HOUSE|1-2|Buildings|SRL|HOUSE|1-2|Building
20151020220013.212|20151020220017.516|FRM_01|RACE|9-10|Type|SRL|RACE|9-10|Subtype
20151020220013.212|20151020220017.516|FRM_01|SUCCEED|11-12|Success_or_failure|SRL|RACE TO|9-11|Agent
20151020220013.212|20151020220017.516|POS_01|outgoing/JJ/I-NP/O/outgoing|house/NN/I-NP/O/house|speaker/NN/I-NP/O/speaker|john/NN/I-NP/O/john|boehner/NN/I-NP/O/boehner|live/VB/I-VP/O/live|on/IN/I-PP/B-PNP/on|the/DT/I-NP/I-PNP/the|complicated/VBN/I-NP/I-PNP/complicate|race/NN/I-NP/I-PNP/race|to/TO/I-VP/O/to|succeed/VB/I-VP/O/succeed|him/PRP/I-NP/O/him|././O/O/.
20151020220013.212|20151020220017.516|SMT_01|-0.181818181818|0.75|live|0.136363636364|0.5|complicated|-0.5|1.0
20151020220013.212|20151020220017.516|SMT_02|OUTGOING|0.0|0.0|LIVE|0.0|0.0|ON|0.0|0.0|COMPLICATED|-0.5|0.75
Each primary tag must be explained in the legend section. For instance, if we add speech-to-text lines like this:
20150703230055.590|20150703230056.459|CC1|A LITTLE BIT OF RAIN ON THE
20150703230055.590|20150703230056.459|S2T_01|A little bit of rain on the
20150703230056.559|20150703230057.859|CC1|RADAR TONIGHT.
20150703230056.559|20150703230057.859|S2T_01|radar tonight.
20150703230057.959|20150703230058.698|CC1|YUST SOUTH OF DISTURB JUST --
20150703230057.959|20150703230058.698|S2T_01|Just south of us,
20150703230057.959|20150703230058.067|S2T_01|mainly south of Cleveland, into the east of 71.
20150703230059.167|20150703230100.968|CC1|THIS KEEPS DRYING UP AS IT
20150703230059.167|20150703230100.968|S2T_01|This keeps drying up as it
then we need a legend that defines the new primary tag, such as
S2T_01|2015-07-07 08:32|Source_Program=KaldiPipeline.py|Source_Person=Sai Krishna Rallabandi
In this case there's no codebook, just the text.
Primary tag inventory
These are the primary tags currently used in the NewsScape collection:
- Text tags
- CCO -- spoken language in US broadcasts -- either English or Spanish (interpolated timestamps)
- CC1 -- spoken language in US broadcasts -- either English or Spanish
- CC2 -- translated Spanish text provided by the network in US broadcasts (before 2012)
- CC3 -- translated Spanish text provided by the network in US broadcasts (after 2012)
- OCR1 -- on-screen text, obtained through optical character recognition
- TIC1 -- tickertape text, ideally captions, obtained through CCExtractor's OCR functionality
- TR0 -- English transcripts downloaded from the web (typically official transcripts from the network)
- TR1 -- English transcript, typically generated by machine transcription (speech to text) in Youtube
- TR4 -- Russian transcript, typically generated by machine transcription (speech to text) in Youtube
- ASR_01 -- Chinese transcript, generated by DeepSpeech2 (speech to text) in Red Hen's Chinese pipeline
- XDS -- metadata sent through Extended Data Services
- 888 -- teletext page in European transmissions (any three digits) (no longer used)
- To be completed
- Annotation tags
- FRM_01 -- linguistic frames, from FrameNet 1.5 via Semafor 3.0-alpha4
- GES_02 -- 160 timeline gestures tagged manually by the Spanish gesture research group
- GES_03 -- gestures tagged manually with ELAN
- NER_03 -- named entities, using the Stanford NER tagger 3.4
- POS_01 -- English parts of speech with dependencies, using MBSP 1.4
- POS_02 -- English parts of speech, using the Stanford POS tagger 3.4
- POS_03 -- German parts of speech, using the parser in Pattern.de
- POS_04 -- French parts of speech, using the parser in Pattern.fr
- POS_05 -- Spanish parts of speech, using the parser in pattern.es
- SEG -- Story boundaries by Weixin Li, UCLA
- SEG_00 -- Commercial boundaries, using caption type information from CCExtractor 0.74
- SEG_01 -- Commercial Detection by Weixin Li
- SEG_02 -- Story boundaries by Rongda Zhu, UIUC
- SMT_01 -- Sentiment detection, using Pattern 2.6
- SMT_02 -- Sentiment detection, using SentiWordNet 3.0
- DEU_01 -- German to English machine translation
- To be completed
Frame tags
The Frame tags are added with Semafor using FrameNet 1.5. The fields under the FRM_01 tag have the following structure:
- Token
- Token position
- Frame Name
- SRL (invariant label)
- Semantic Role Label
- SRL token position
- Frame Element
-- where there may be more than one instance of the 1-4 block.
As an example, consider
FRM_01|name|11-12|Being_named|SRL|of Mr. Davies|12-15|Name|SRL|a physics teacher|6-9|Entity
Here we see the structure:
- Token is "name"
- Token position is 11-12
- Frame Name is "Being named"
- First SRL
- Semantic Role is "of Mr. Davies"
- SRL token position is 12-15
- Frame Element is "Name"
- Second SRL
- Semantic Role is "a physics teacher"
- SRL token position is 6-9
- Frame Element is "Entity"
In the Edge2 search engine, we leave out the position information and invariant labels in the search interface, which leaves these fields to search under FRM_01:
- Token
- Frame Name
- Semantic Role (one or more SRL fields)
- Frame Element (one or more FE fields)
Parts of speech tags
The Treebank II page at http://www.clips.ua.ac.be/pages/mbsp-tags lists the tags used by the MBSP (Memory-Based Shallow Parser) engine, which generates our POS_01 tags. Each word is encoded with the original form, a part of speech tag, two chunk tags with relation tags, and a lemma. So for instance the caption line "CC1|GET RID OF PREPAID PROBLEMS." is encoded like this:
POS_01|get/VB/I-VP/O/get|rid/JJ/I-ADJP/O/rid|of/IN/I-PP/B-PNP/of|prepaid/JJ/
I-NP/I-PNP/prepaid|problems/NNS/I-NP/I-PNP/problem|././O/O/.
The original tokens are in bold black, and the lemmas in bold blue. The bold red is the part-of-speech tag. The two remaining tags are so-called chunk tags that provide information about the word's syntactic relations -- its roles in the sentence or phrase.
This is entirely systematic -- each key word follows a pipe symbol, and there are always exactly three annotations for each word, including for the final period, which is treated as if it were its own word. Each annotation is separated by a forward slash. Empty chunk values are indicated by a captal O.
To make these searchable, we might give them names as follows:
- POS_01 word
- POS_01 part of speech
- POS_01 relation 1
- POS_01 relation 2
- POS_01 lemma
This should make all of these entries searchable. The user would of course need to know the Codebook; the help screens and tutorial could refer them to the Treebank II reference page for MBSP.
POS_02 is much simpler, since it just has the first two fields.
Audio Pipeline Tags
Red Hen's audio pipeline automatically tags audio features in recordings. Red Hen's work on audio detection began during her Google Summer of Code 2015, during which a number of open-source student coders, mentored by more senior Red Hens, tackled individual projects in audio detection. Their code was integrated into a single pipeline by Xu He in early 2016. Red Hen is grateful to the Alexander von Humboldt Foundation for the funding to employ Xu He during this period. The Audio Pipeline produces a line in the Credit Block (see above), but otherwise creates tags in the main body, as follows.
20150807005009.500|20150807005012.000|GEN_01|Gender=Male|Log Likelihood=-21.5807774474
where GEN_01 is the primary tag, other primary tags are:
GEN_01
GENR_01
SPK_01
SPKR_01
GEN and SPK are results with the speaker boundaries given by the Speaker Diarization algorithm, whereas GENR and SPKR are results produced for 5-second segments.
Log Likelihood=-21.5807774474
is the natural logarithm of the likelihood under a Mixture of Gaussians, and it indicates how confident the algorithm is about its recognition results, the higher the likelihood, the more confident it is.
End tag
Finally, the last line of the file should have this kind of information, with an end timestamp:
END|20150703232959|2015-07-03_2300_US_WKYC_Channel_3_News_at_7
The end timestamp is derived from the start time plus the video duration. This is useful for running quick checks that the entire file was processed and not truncated.
We need a very high level of consistency in the output files, since they need to be reliably parsed by our statistical tools and search engines.
Not implemented
Tags may be derived from the downloaded CNN transcript, integrated in .tpt files.
ANC ANCHOR BEG (BEGIN VIDEO CLIP) BET (BEGIN VIDEOTAPE) CMB (COMMERCIAL BREAK) CRT (CROSSTALK) ENC (END VIDEO CLIP) ENT (END VIDEOTAPE) LAU (LAUGHTER) NER Named entity OCM (on camera) PER [A-Z]{3,}: (ideally $NAME|$ROLE|$PARTY) TR0 Transcript text, default language TR1 Transcript text, second language
ELAN eaf files
In the summer of 2017, Peter Uhrig at FAU Erlangen created some 300,000 eaf files, the file format used by ELAN, for English-language files between 2007 and 2016. These files have now been added to the Red Hen dataset. They integrate the output of the Gentle forced aligner with Sergiy Turchyn's computer-vision-based gesture detection code. They contain precise timestamps for the beginning and end of each word, in this case the word "the":
<ANNOTATION>
<ALIGNABLE_ANNOTATION ANNOTATION_ID="a974" TIME_SLOT_REF1="ts1948" TIME_SLOT_REF2="ts1949">
<ANNOTATION_VALUE>the</ANNOTATION_VALUE>
</ALIGNABLE_ANNOTATION>
</ANNOTATION>
TIME_SLOT_REF1 gives the onset time as 1948 seconds after the start of the recording, and TIME_SLOT_REF2 the end time as 1949 seconds.
Read into ELAN, these files allow researchers to take advantage of annotations that result from the automatic detection of a suite of features, as listed in this screenshot:
These files should be used for additional additional annotations to Red Hen.