Red Hen data format

Red Hen aims to facilitate collaborative work across different types of expertise, geographical locations, and time. To facilitate this, we have developed a shared data format, specified below. To simplify interoperability and ensure we can scale, we currently rely on flat files rather than databases, and on metadata stored in time-stamped single lines rather than in multi-line hierarchical systems like xml.


File names

The basic unit in the Red Hen dataset is a video file, typically a one-hour news program, though it could also be a one-minute campaign ad. A series of files are then created around this video file, for instance, 


The text files here are .ocr (on-screent text from optical character recognition), .seg (metadata from NLP), and .txt (closed captioning or teletext from the television transport stream). 

Main file types

  • Caption text (txt) (extracted from the television transport stream)
  • Online transcript (tpt) (downloaded and mechanically aligned)
  • On-screen text (ocr) (created through optical character recognition)
  • Annotated text (seg) (automated and manual tags)
  • Thumbnails (img directory) (extracted at ten-second intervals)
  • Image montages (jpg) (assembled from thumbnails)
  • Video (mp4) (compressed and resized)

Text file data structure

The data in the text files is structured as follows:
  1. A header with file-level information
  2. A legend with information about the different modules that have been run on the file
  3. The data section

Header block

So for instance in a seg file, we have these field names and values in the header (see also example .seg files):

COL|Communication Studies Archive, UCLA
SRC|IPG Portugal
LBT|2015-11-25 15:39:00 Europe/Lisbon

A campaign ad example:

COL|Communication Studies Archive, UCLA
AQD|2015-11-21 01:36:55 UTC
TTL|We are AFP Alaska!
TTS|Youtube English machine transcript 2015-11-21 0137 UTC
LBT|2015-11-20 19:00:00 America/New_York

The header always starts with TOP and ends with LBT (local broadcast time); it's generated on capture. 

These are the fields used in the header block:

TOP -- contains the starting timestamp and the file name
COL -- contains the collection name
UID -- a unique ID for the collection
PID -- the show's episode (EP) or show (SH) ID
ACQ -- the time of acquisition
DUR -- the duration of the recording in hours:minutes:seconds.hundredths of a second
VID -- the picture size of the compressed video and of the original video
TTL -- the title of the event if applicable, or the series, if it contains non-ascii characters
URL -- the web source if applicable
TTS -- the type of transcript if applicable
SRC -- the recording location
CMT -- a comment added by the person scheduling the recording, or to indicate its quality ("Garbled captions")
LAN -- three-letter ISO language code (see list below)
TTP -- the teletext page
HED -- the header if available, typically with summary information about the content
OBT -- the original broadcast time, when it differs from the local broadcast time
           "OBT|Estimated" is used in digitized files when the precise broadcast time is unknown
LBT -- the local broadcast time, with time zone

        END -- [at the end of the file] contains end timestamp and filename
Add any missing fields.

The language fields currently used include the following, using 639-2T:

        ARA -- Arabic
        CES -- Czech
        DAN -- Danish
        ENG -- English
        FRA -- French
        DEU -- German
        ITA -- Italian
        NLD -- Dutch
        NOR -- Norwegian
        PER -- Persian
        POL -- Polish
        POR -- Portuguese
        PUS -- Pashto
        RUS -- Russian
        SME -- Sami
        SPA -- Spanish
        SWE -- Swedish
        ZHO -- Chinese

Credit block

In the case of .seg files, the header is followed by the credit block, which also contains a codebook legend:

FRM_01|2015-10-22 06:34|Source_Program=FrameNet 1.5, Semafor 3.0-alpha4,|
    Source_Person=Charles Fillmore, Dipanjan Das, FFS|Codebook=Token|Position|Frame name|
    Semantic Role Labeling|Token|Position|Frame element
NER_03|2015-10-22 22:55|Source_Program=stanford-ner 3.4,|
     Source_Person=Jenny Rose Finkel, FFS|Codebook=Category/Entity
POS_01|2015-10-22 06:32|Source_Program=MBSP 1.4,|Source_Person=Walter Daelemans,
    FFS|Codebook=Treebank II
POS_02|2015-10-22 22:56|Source_Program=stanford-postagger 3.4,|Source_Person=Kristina Toutanova, FFS|Codebook=Treebank II
SEG_02|2015-10-22 01:46|Source_Program=RunTextStorySegmentation.jar|Source_Person=Rongda Zhu
SMT_01|2015-10-22 06:32|Source_Program=Pattern 2.6,|Source_Person=Tom De Smedt,
    FFS|Codebook=polarity, subjectivity
SMT_02|2015-10-22 06:32|Source_Program=SentiWordNet 3.0,|Source_Person=Andrea Esuli,
    FFS|Codebook=polarity, subjectivity

You see the syntax: each line starts with primary tag, say SEG_02. The primary tags are numbered to allow different techniques to generate the same kind of data. For instance, we might imagine two methods of speaker diarization using DIA_01 and DIA_02. The next field is the date the module was run -- YYYY-mm-DD HH:MM. The field separator is the pipe symbol. Next are fields for Source_Program, Source_Person, and Codebook with labels for each field in the annotation. It's critical we have good information in the legends, so that the significance of each column in the main data section is systematically tracked.

Main body

Again in .seg files, the main body of data follows the legends, using the primary tags they define. Each line has an absolute start time and end time in Universal time, a primary tag, and some content:

20151020220010.209|20151020220013.212|CC1|>>> "SPECIAL REPORT" IS NEXT.
20151020220013.212|20151020220017.516|NER_03|ORGANIZATION/HOUSE|PERSON/JOHN BOEHNER
20151020220013.212|20151020220017.516|FRM_01|SUCCEED|11-12|Success_or_failure|SRL|RACE TO|9-11|Agent

Each primary tag must be explained in the legend section. For instance, if we add speech-to-text lines like this:

20150703230055.590|20150703230056.459|CC1|A LITTLE BIT OF RAIN ON THE
20150703230055.590|20150703230056.459|S2T_01|A little bit of rain on the
20150703230056.559|20150703230057.859|CC1|RADAR TONIGHT.
20150703230056.559|20150703230057.859|S2T_01|radar tonight.
20150703230057.959|20150703230058.698|CC1|YUST SOUTH OF DISTURB JUST --
20150703230057.959|20150703230058.698|S2T_01|Just south of us,
20150703230057.959|20150703230058.067|S2T_01|mainly south of Cleveland, into the east of 71.
20150703230059.167|20150703230100.968|CC1|THIS KEEPS DRYING UP AS IT
20150703230059.167|20150703230100.968|S2T_01|This keeps drying up as it

then we need a legend that defines the new primary tag, such as

         S2T_01|2015-07-07 08:32||Source_Person=Sai Krishna Rallabandi

In this case there's no codebook, just the text. 

Primary tag inventory

These are the primary tags currently used in the NewsScape collection:
  • Text tags
    • CCO -- spoken language in US broadcasts -- either English or Spanish (interpolated timestamps)
    • CC1 -- spoken language in US broadcasts -- either English or Spanish
    • CC2 -- translated Spanish text provided by the network in US broadcasts (before 2012)
    • CC3 -- translated Spanish text provided by the network in US broadcasts (after 2012)
    • TR0 -- transcripts downloaded from the web (typically official transcripts from the network)
    • TR1 -- transcript, typically generated by machine transcription (speech to text) in Youtube
    • 888 -- teletext page in European transmissions (any three digits) (no longer used)
    • OCR1 -- on-screen text, obtained through optical character recognition
    • XDS -- metadata sent through Extended Data Services
    • To be completed
  • Annotation tags
    • FRM_01 -- linguistic frames, from FrameNet 1.5 via Semafor 3.0-alpha4
    • GES_02 -- timeline gestures by Javier Valenzuela and Cristobal Pagan Canovas
    • NER_03 -- named entities, using the Stanford NER tagger 3.4
    • POS_01 -- English parts of speech with dependencies, using MBSP 1.4
    • POS_02 -- English parts of speech, using the Stanford POS tagger 3.4
    • POS_03 -- German parts of speech, using the parser in
    • POS_04 -- French parts of speech, using the parser in
    • POS_05 -- Spanish parts of speech, using the parser in
    • SEG      -- Story boundaries by Weixin Li, UCLA
    • SEG_00 -- Commercial boundaries, using caption type information from CCExtractor 0.74
    • SEG_01 -- Commercial Detection by Weixin Li
    • SEG_02 -- Story boundaries by Rongda Zhu, UIUC
    • SMT_01 -- Sentiment detection, using Pattern 2.6
    • SMT_02 -- Sentiment detection, using SentiWordNet 3.0
    • To be completed

Frame tags

The Frame tags are added with Semafor using FrameNet 1.5. The fields under the FRM_01 tag have the following structure:
  • Token
  • Token position
  • Frame Name
    1. SRL (invariant label)
    2. Semantic Role Label
    3. SRL token position
    4. Frame Element
-- where there may be more than one instance of the 1-4 block.

As an example, consider

    FRM_01|name|11-12|Being_named|SRL|of Mr. Davies|12-15|Name|SRL|a physics teacher|6-9|Entity

Here we see the structure:
  • Token is "name"
  • Token position is 11-12
  • Frame Name is "Being named"
  • First SRL
    • Semantic Role is "of Mr. Davies"
    • SRL token position is 12-15
    • Frame Element is "Name"
  • Second SRL
    • Semantic Role is "a physics teacher"
    • SRL token position is 6-9
    • Frame Element is "Entity"
In the Edge2 search engine, we leave out the position information and invariant labels in the search interface, which leaves these fields to search under FRM_01:
  • Token
  • Frame Name
  • Semantic Role (one or more SRL fields)
  • Frame Element (one or more FE fields)

Parts of speech tags

The Treebank II page at lists the tags used by the MBSP (Memory-Based Shallow Parser) engine, which generates our POS_01 tags. Each word is encoded with the original form, a part of speech tag, two chunk tags with relation tags, and a lemma. So for instance the caption line "CC1|GET RID OF PREPAID PROBLEMS." is encoded like this:


The original tokens are in bold black, and the lemmas in bold blue. The bold red is the part-of-speech tag. The two remaining tags are so-called chunk tags that provide information about the word's syntactic relations -- its roles in the sentence or phrase.

This is entirely systematic -- each key word follows a pipe symbol, and there are always exactly three annotations for each word, including for the final period, which is treated as if it were its own word. Each annotation is separated by a forward slash. Empty chunk values are indicated by a captal O.

To make these searchable, we might give them names as follows:
  1. POS_01 word
  2. POS_01 part of speech
  3. POS_01 relation 1
  4. POS_01 relation 2
  5. POS_01 lemma

This should make all of these entries searchable. The user would of course need to know the Codebook; the help screens and tutorial could refer them to the Treebank II reference page for MBSP.

POS_02 is much simpler, since it just has the first two fields.

Audio Pipeline Tags

Red Hen's audio pipeline automatically tags audio features in recordings. Red Hen's work on audio detection began during her Google Summer of Code 2015, during which a number of open-source student coders, mentored by more senior Red Hens, tackled individual projects in audio detection. Their code was integrated into a single pipeline by Xu He in early 2016. Red Hen is grateful to the Alexander von Humboldt Foundation for the funding to employ Xu He during this period. The Audio Pipeline produces a line in the Credit Block (see above), but otherwise creates tags in the main body, as follows. 

20150807005009.500|20150807005012.000|GEN_01|Gender=Male|Log Likelihood=-21.5807774474

where GEN_01 is the primary tag, other primary tags are:





GEN and SPK are results with the speaker boundaries given by the Speaker Diarization algorithm, whereas GENR and SPKR are results produced for 5-second segments.

Log Likelihood=-21.5807774474

is the natural logarithm of the likelihood under a Mixture of Gaussians, and it indicates how confident the algorithm is about its recognition results, the higher the likelihood, the more confident it is.

End tag

Finally, the last line of the file should have this kind of information, with an end timestamp:


The end timestamp is derived from the start time plus the video duration. This is useful for running quick checks that the entire file was processed and not truncated.

We need a very high level of consistency in the output files, since they need to be reliably parsed by our statistical tools and search engines.

Not implemented

Tags may be derived from the downloaded CNN transcript, integrated in .tpt files.
NER  Named entity
OCM  (on camera)
PER  [A-Z]{3,}:    (ideally $NAME|$ROLE|$PARTY)
TR0  Transcript text, default language
TR1  Transcript text, second language