Red Hen data format

Red Hen aims to facilitate collaborative work across different types of expertise, geographical locations, and time. To facilitate this, we have developed a shared data format, specified below. To simplify interoperability and ensure we can scale, we currently rely on flat files rather than databases, and on metadata stored in time-stamped single lines rather than in multi-line hierarchical systems like xml.

Related

File names

The basic unit in the Red Hen dataset is a video file, typically a one-hour news program, though it could also be a one-minute campaign ad. A series of files are then created around this video file, for instance,

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.frm.json

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.img

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.jpg

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.json

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.mp4

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.ocr

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.seg

2015-10-20_2200_US_FOX-News_Special_Report_with_Bret_Baier.txt

The text files here are .ocr (on-screent text from optical character recognition), .seg (metadata from NLP), and .txt (closed captioning or teletext from the television transport stream).

Main file types

  • Caption text (txt) (extracted from the television transport stream)
  • Online transcript (tpt) (downloaded and mechanically aligned)
  • On-screen text (ocr) (created through optical character recognition)
  • Annotated text (seg) (automated and manual tags)
    • Thumbnails (img directory) (extracted at ten-second intervals)
    • Image montages (jpg) (assembled from thumbnails)
    • Video (mp4) (compressed and resized)

Text file data structure

The data in the text files is structured as follows:

  1. A header with file-level information
  2. A legend with information about the different modules that have been run on the file
  3. The data section

Header block

So for instance in a seg file, we have these field names and values in the header (see also example .seg files):

TOP|20151125153900|2015-11-25_2000_PT_RTP-1_Telejornal

COL|Communication Studies Archive, UCLA

UID|697bf308-939a-11e5-92bc-4f7a6b3fa3a8

DUR|0:59:53.62

VID|576x432|720x576

SRC|IPG Portugal

CMT|

LAN|POR

TTP|885

LBT|2015-11-25 15:39:00 Europe/Lisbon

A campaign ad example:

TOP|20151121000000|2015-11-21_0000_US_CampaignAds_Americans_for_Prosperity

COL|Communication Studies Archive, UCLA

UID|591e3ed8-8ff0-11e5-9a62-003048ce8836

AQD|2015-11-21 01:36:55 UTC

DUR|0:00:30.07

VID|1280x720

TTL|We are AFP Alaska!

URL|http://youtube.com/watch?v=g-5BPZt5uPA

TTS|Youtube English machine transcript 2015-11-21 0137 UTC

CMT|

HED|

LBT|2015-11-20 19:00:00 America/New_York

The header always starts with TOP and ends with LBT (local broadcast time); it's generated on capture.

These are the fields used in the header block:

TOP -- contains the starting timestamp and the file name

COL -- contains the collection name

UID -- a unique ID for the collection

PID -- the show's episode (EP) or show (SH) ID

AQD -- the time of acquisition

DUR -- the duration of the recording in hours:minutes:seconds.hundredths of a second

VID -- the picture size of the compressed video and of the original video

TTL -- the title of the event if applicable, or the series, if it contains non-ascii characters

URL -- the web source if applicable

TTS -- the type of transcript if applicable

SRC -- the recording location

CMT -- a comment added by the person scheduling the recording, or to indicate its quality ("Garbled captions")

LAN -- three-letter ISO language code (see list below)

TTP -- the teletext page

HED -- the header if available, typically with summary information about the content

OBT -- the original broadcast time, when it differs from the local broadcast time

"OBT|Estimated" is used in digitized files when the precise broadcast time is unknown

LBT -- the local broadcast time, with time zone

END -- [at the end of the file] contains end timestamp and filename

Add any missing fields.

The language fields currently used include the following, using ISO639-3 (some may need to be updated, and we may want to start using combination codes for dialect identification, cf. ISO639-1):

AFR -- Africaans

ARA -- Arabic

CES -- Czech

CMN -- Mandarin Chinese

DAN -- Danish

ENG -- English

FRA -- French

DEU -- German

HRV -- Croatian

HUN -- Hungarian

GLG -- Galician

ITA -- Italian

NLD -- Dutch

NOR -- Norwegian

PER -- Persian

POL -- Polish

POR -- Portuguese

PUS -- Pashto

RUS -- Russian

SME -- Sami

SPA -- Spanish

ES-MX -- Mexican Spanish

SWE -- Swedish

ZHO -- Chinese

Credit block

In the case of .seg files, the header is followed by the credit block, which also contains a codebook legend:

FRM_01|2015-10-22 06:34|Source_Program=FrameNet 1.5, Semafor 3.0-alpha4, FrameNet-06.py|

Source_Person=Charles Fillmore, Dipanjan Das, FFS|Codebook=Token|Position|Frame name|

Semantic Role Labeling|Token|Position|Frame element

NER_03|2015-10-22 22:55|Source_Program=stanford-ner 3.4, NER-StanfordNLP-02.py|

Source_Person=Jenny Rose Finkel, FFS|Codebook=Category/Entity

POS_01|2015-10-22 06:32|Source_Program=MBSP 1.4, PartsOfSpeech-MBSP-05.py|Source_Person=Walter Daelemans,

FFS|Codebook=Treebank II

POS_02|2015-10-22 22:56|Source_Program=stanford-postagger 3.4,

PartsOfSpeech-StanfordNLP-02.py|Source_Person=Kristina Toutanova, FFS|Codebook=Treebank II

SEG_02|2015-10-22 01:46|Source_Program=RunTextStorySegmentation.jar|Source_Person=Rongda Zhu

SMT_01|2015-10-22 06:32|Source_Program=Pattern 2.6, Sentiment-02.py|Source_Person=Tom De Smedt,

FFS|Codebook=polarity, subjectivity

SMT_02|2015-10-22 06:32|Source_Program=SentiWordNet 3.0, Sentiment-02.py|Source_Person=Andrea Esuli,

FFS|Codebook=polarity, subjectivity

You see the syntax: each line starts with primary tag, say SEG_02. The primary tags are numbered to allow different techniques to generate the same kind of data. For instance, we might imagine two methods of speaker diarization using DIA_01 and DIA_02. The next field is the date the module was run -- YYYY-mm-DD HH:MM. The field separator is the pipe symbol. Next are fields for Source_Program, Source_Person, and Codebook with labels for each field in the annotation. It's critical we have good information in the legends, so that the significance of each column in the main data section is systematically tracked.

The full specification of the data structure of a particular annotation is provided in the Edge2 Search Engine Definitions. These definitions are dynamically read by the Edge2 search engine at startup.

Main body

Again in .seg files, the main body of data follows the legends, using the primary tags they define. Each line has an absolute start time and end time in Universal time, a primary tag, and some content:

20151020220010.209|20151020220013.212|CC1|>>> "SPECIAL REPORT" IS NEXT.

20151020220010.209|20151020220013.212|POS_02|"SPECIAL/JJ|REPORT"/NN|IS/VBZ|NEXT./NNP|

20151020220010.209|20151020220013.212|FRM_01|REPORT|2-3|Statement

20151020220010.209|20151020220013.212|POS_01|"/``/I-NP/O/"|special/JJ/I-NP/O/special|report/NN/I-NP/O/report|"/NN/I-NP/O/"|is/VBZ/I-VP/O/be|next/JJ/I-ADJP/O/next|././O/O/.

20151020220010.209|20151020220013.212|SMT_01|0.178571428571|0.285714285714|special|0.357142857143|0.571428571429|next|0.0|0.0

20151020220010.209|20151020220013.212|SMT_02|SPECIAL|0.0|0.0|NEXT|0.0|0.0

20151020220013.212|20151020220017.516|CC1|>>> OUTGOING HOUSE SPEAKER JOHN BOEHNER LIVE ON THE COMPLICATED RACE TO SUCCEED HIM.

20151020220013.212|20151020220017.516|POS_02|OUTGOING/JJ|HOUSE/NNP|SPEAKER/NNP|JOHN/NNP|BOEHNER/NNP|LIVE/VB|ON/IN|THE/DT|COMPLICATED/JJ|RACE/NN|TO/TO|SUCCEED/VB|HIM./NNP|

20151020220013.212|20151020220017.516|NER_03|ORGANIZATION/HOUSE|PERSON/JOHN BOEHNER

20151020220013.212|20151020220017.516|FRM_01|OUTGOING|0-1|Sociability

20151020220013.212|20151020220017.516|FRM_01|HOUSE|1-2|Buildings|SRL|HOUSE|1-2|Building

20151020220013.212|20151020220017.516|FRM_01|RACE|9-10|Type|SRL|RACE|9-10|Subtype

20151020220013.212|20151020220017.516|FRM_01|SUCCEED|11-12|Success_or_failure|SRL|RACE TO|9-11|Agent

20151020220013.212|20151020220017.516|POS_01|outgoing/JJ/I-NP/O/outgoing|house/NN/I-NP/O/house|speaker/NN/I-NP/O/speaker|john/NN/I-NP/O/john|boehner/NN/I-NP/O/boehner|live/VB/I-VP/O/live|on/IN/I-PP/B-PNP/on|the/DT/I-NP/I-PNP/the|complicated/VBN/I-NP/I-PNP/complicate|race/NN/I-NP/I-PNP/race|to/TO/I-VP/O/to|succeed/VB/I-VP/O/succeed|him/PRP/I-NP/O/him|././O/O/.

20151020220013.212|20151020220017.516|SMT_01|-0.181818181818|0.75|live|0.136363636364|0.5|complicated|-0.5|1.0

20151020220013.212|20151020220017.516|SMT_02|OUTGOING|0.0|0.0|LIVE|0.0|0.0|ON|0.0|0.0|COMPLICATED|-0.5|0.75

Each primary tag must be explained in the legend section. For instance, if we add speech-to-text lines like this:

20150703230055.590|20150703230056.459|CC1|A LITTLE BIT OF RAIN ON THE

20150703230055.590|20150703230056.459|S2T_01|A little bit of rain on the

20150703230056.559|20150703230057.859|CC1|RADAR TONIGHT.

20150703230056.559|20150703230057.859|S2T_01|radar tonight.

20150703230057.959|20150703230058.698|CC1|YUST SOUTH OF DISTURB JUST --

20150703230057.959|20150703230058.698|S2T_01|Just south of us,

20150703230057.959|20150703230058.067|S2T_01|mainly south of Cleveland, into the east of 71.

20150703230059.167|20150703230100.968|CC1|THIS KEEPS DRYING UP AS IT

20150703230059.167|20150703230100.968|S2T_01|This keeps drying up as it

then we need a legend that defines the new primary tag, such as

S2T_01|2015-07-07 08:32|Source_Program=KaldiPipeline.py|Source_Person=Sai Krishna Rallabandi

In this case there's no codebook, just the text.

Primary tag inventory

These are the primary tags currently used in the NewsScape collection:

    • Text tags
      • CCO -- spoken language in US broadcasts -- either English or Spanish (interpolated timestamps)
      • CC1 -- spoken language in US broadcasts -- either English or Spanish
      • CC2 -- translated Spanish text provided by the network in US broadcasts (before 2012)
      • CC3 -- translated Spanish text provided by the network in US broadcasts (after 2012)
      • OCR1 -- on-screen text, obtained through optical character recognition
      • TIC1 -- tickertape text, ideally captions, obtained through CCExtractor's OCR functionality
      • TR0 -- English transcripts downloaded from the web (typically official transcripts from the network)
      • TR1 -- English transcript, typically generated by machine transcription (speech to text) in Youtube
      • TR4 -- Russian transcript, typically generated by machine transcription (speech to text) in Youtube
      • ASR_01 -- Chinese transcript, generated by DeepSpeech2 (speech to text) in Red Hen's Chinese pipeline
      • XDS -- metadata sent through Extended Data Services
      • 888 -- teletext page in European transmissions (any three digits) (no longer used)
    • To be completed
    • Annotation tags
      • FRM_01 -- linguistic frames, from FrameNet 1.5 via Semafor 3.0-alpha4
      • GES_02 -- 160 timeline gestures tagged manually by the Spanish gesture research group
      • GES_03 -- gestures tagged manually with ELAN
      • NER_03 -- named entities, using the Stanford NER tagger 3.4
      • POS_01 -- English parts of speech with dependencies, using MBSP 1.4
      • POS_02 -- English parts of speech, using the Stanford POS tagger 3.4
      • POS_03 -- German parts of speech, using the parser in Pattern.de
      • POS_04 -- French parts of speech, using the parser in Pattern.fr
      • POS_05 -- Spanish parts of speech, using the parser in pattern.es
      • SEG -- Story boundaries by Weixin Li, UCLA
      • SEG_00 -- Commercial boundaries, using caption type information from CCExtractor 0.74
    • SEG_01 -- Commercial Detection by Weixin Li
      • SEG_02 -- Story boundaries by Rongda Zhu, UIUC
      • SMT_01 -- Sentiment detection, using Pattern 2.6
      • SMT_02 -- Sentiment detection, using SentiWordNet 3.0
      • DEU_01 -- German to English machine translation
      • To be completed

Frame tags

The Frame tags are added with Semafor using FrameNet 1.5. The fields under the FRM_01 tag have the following structure:

    • Token
    • Token position
    • Frame Name
      1. SRL (invariant label)
      2. Semantic Role Label
      3. SRL token position
      4. Frame Element

-- where there may be more than one instance of the 1-4 block.

As an example, consider

FRM_01|name|11-12|Being_named|SRL|of Mr. Davies|12-15|Name|SRL|a physics teacher|6-9|Entity

Here we see the structure:

    • Token is "name"
    • Token position is 11-12
    • Frame Name is "Being named"
    • First SRL
      • Semantic Role is "of Mr. Davies"
      • SRL token position is 12-15
      • Frame Element is "Name"
    • Second SRL
      • Semantic Role is "a physics teacher"
      • SRL token position is 6-9
      • Frame Element is "Entity"

In the Edge2 search engine, we leave out the position information and invariant labels in the search interface, which leaves these fields to search under FRM_01:

    • Token
    • Frame Name
    • Semantic Role (one or more SRL fields)
    • Frame Element (one or more FE fields)

Parts of speech tags

The Treebank II page at http://www.clips.ua.ac.be/pages/mbsp-tags lists the tags used by the MBSP (Memory-Based Shallow Parser) engine, which generates our POS_01 tags. Each word is encoded with the original form, a part of speech tag, two chunk tags with relation tags, and a lemma. So for instance the caption line "CC1|GET RID OF PREPAID PROBLEMS." is encoded like this:

POS_01|get/VB/I-VP/O/get|rid/JJ/I-ADJP/O/rid|of/IN/I-PP/B-PNP/of|prepaid/JJ/

I-NP/I-PNP/prepaid|problems/NNS/I-NP/I-PNP/problem|././O/O/.

The original tokens are in bold black, and the lemmas in bold blue. The bold red is the part-of-speech tag. The two remaining tags are so-called chunk tags that provide information about the word's syntactic relations -- its roles in the sentence or phrase.

This is entirely systematic -- each key word follows a pipe symbol, and there are always exactly three annotations for each word, including for the final period, which is treated as if it were its own word. Each annotation is separated by a forward slash. Empty chunk values are indicated by a captal O.

To make these searchable, we might give them names as follows:

    1. POS_01 word
    2. POS_01 part of speech
    3. POS_01 relation 1
    4. POS_01 relation 2
    5. POS_01 lemma

This should make all of these entries searchable. The user would of course need to know the Codebook; the help screens and tutorial could refer them to the Treebank II reference page for MBSP.

POS_02 is much simpler, since it just has the first two fields.

Audio Pipeline Tags

Red Hen's audio pipeline automatically tags audio features in recordings. Red Hen's work on audio detection began during her Google Summer of Code 2015, during which a number of open-source student coders, mentored by more senior Red Hens, tackled individual projects in audio detection. Their code was integrated into a single pipeline by Xu He in early 2016. Red Hen is grateful to the Alexander von Humboldt Foundation for the funding to employ Xu He during this period. The Audio Pipeline produces a line in the Credit Block (see above), but otherwise creates tags in the main body, as follows.

20150807005009.500|20150807005012.000|GEN_01|Gender=Male|Log Likelihood=-21.5807774474

where GEN_01 is the primary tag, other primary tags are:

GEN_01

GENR_01

SPK_01

SPKR_01

GEN and SPK are results with the speaker boundaries given by the Speaker Diarization algorithm, whereas GENR and SPKR are results produced for 5-second segments.

Log Likelihood=-21.5807774474

is the natural logarithm of the likelihood under a Mixture of Gaussians, and it indicates how confident the algorithm is about its recognition results, the higher the likelihood, the more confident it is.

End tag

Finally, the last line of the file should have this kind of information, with an end timestamp:

END|20150703232959|2015-07-03_2300_US_WKYC_Channel_3_News_at_7

The end timestamp is derived from the start time plus the video duration. This is useful for running quick checks that the entire file was processed and not truncated.

We need a very high level of consistency in the output files, since they need to be reliably parsed by our statistical tools and search engines.

Not implemented

Tags may be derived from the downloaded CNN transcript, integrated in .tpt files.

ANC ANCHOR BEG (BEGIN VIDEO CLIP) BET (BEGIN VIDEOTAPE) CMB (COMMERCIAL BREAK) CRT (CROSSTALK) ENC (END VIDEO CLIP) ENT (END VIDEOTAPE) LAU (LAUGHTER) NER Named entity OCM (on camera) PER [A-Z]{3,}: (ideally $NAME|$ROLE|$PARTY) TR0 Transcript text, default language TR1 Transcript text, second language

ELAN eaf files

In the summer of 2017, Peter Uhrig at FAU Erlangen created some 300,000 eaf files, the file format used by ELAN, for English-language files between 2007 and 2016. These files have now been added to the Red Hen dataset. They integrate the output of the Gentle forced aligner with Sergiy Turchyn's computer-vision-based gesture detection code. They contain precise timestamps for the beginning and end of each word, in this case the word "the":

<ANNOTATION>

<ALIGNABLE_ANNOTATION ANNOTATION_ID="a974" TIME_SLOT_REF1="ts1948" TIME_SLOT_REF2="ts1949">

<ANNOTATION_VALUE>the</ANNOTATION_VALUE>

</ALIGNABLE_ANNOTATION>

</ANNOTATION>

TIME_SLOT_REF1 gives the onset time as 1948 seconds after the start of the recording, and TIME_SLOT_REF2 the end time as 1949 seconds.

Read into ELAN, these files allow researchers to take advantage of annotations that result from the automatic detection of a suite of features, as listed in this screenshot:

These files should be used for additional additional annotations to Red Hen.