Red Hen corpus data format

What is this?

The Red Hen data format has certain drawbacks for the use as a corpus that are to be remedied by the Red Hen corpus data format. It follows a relatively standard and easy-to-process pattern called "vertical format" that is used as input format by many corpus managers, e.g. CQP (the backend to CQPweb) or manatee (the backend to the SketchEngine and NoSektchEngine).

Basic Concepts

There are two levels of representation in the Red Hen corpus data format:
  1. Token-level annotation, i.e. every word/punctuation mark has annotation, e.g. Part-of-Speech, lemma, etc. These are called p[ositional]-attributes in CQP.
  2. Annotation potentially spanning multiple words or not directly related to individual words, such as texts, sentences, pauses, gestures, etc. These are called s[tructural]-attributes in CQP.
Thus, in the following example, the lines containing the actual text have the p-attribute word in the first column, the p-attribute Part-of-Speech (in the Lancaster CLAWS5 tagset) in the second and the p-attribute lemma (i.e. the base form of the word) in the third column. The columns are not labeled in the file itself. The s-attributes in this snippet are text and s (s-unit, "a sentence-like division of a text", TEI). They can have multiple attributes (things like id, title, author, publisher) themselves.
<text id="file0001" title="Cat stories" author="Tom Cat" publisher="Feline Press">
<s id="1">
The    AT0    the
cat    NN1    cat
sat    VVD    sit
on     PRP    on
the    AT0    the
mat    NN1    mat
.      PUN    .
</s>
<s id="2">
Its    DPS    its
name   NN1    name
is     VBZ    be
Pi     NP0    pi
.      PUN    .
</s>
</text>

File-level metadata

The attributes within the <text> s-attribute correspond to the header information in the Red Hen data format. However, since every attribute can only have one value, fields with multiple values are distributed over multiple attributes (see the example of VID below).

Field Description attribute(s) within s-attribute <text>
TOP contains the starting timestamp and the file name file
date [In addition, currently year, month, day are used due to restrictions in the search capabilities of CQPweb. Do not rely on them as they may be removed in the future.]
time [this is the time scheduled, not the exact time]
timestamp [this is the exact time at which the recording started]
country
channel
title [Please note that the title is taken from the filename with underscores replaced by spaces. Use event_title if available for a possibly more accurate and human-readable version.]
TODO: VERIFY event_title ON LARGE SCALE
COL contains the collection name collection
UID a unique ID for the collection id [the original UID is modified in that a "t" and two underscores "__" are added in front. All hyphens are replaced by underscores, e.g. t__d3fd32de_e3e5_11e3_857a_001fc65c7848
(there are technical reasons for this due to the way IDs are handled in CQPweb)
PID the show's own program ID (when available) program_id
ACQ the time of acquisition acquisition_time
DUR the duration of the recording in hours:minutes:seconds.hundredths of a second duration
VID the picture size of the compressed video and of the original video video_resolution
video_resolution_original
TTL the title of the event if applicable, or the series, if it contains non-ascii characters event_title
URL the web source if applicable url
TTS the type of transcript if applicable transcript_type
SRC the recording location recording_location
CMT a comment added by the person scheduling the recording scheduler_comment
LAN three-letter ISO language code language
TTP the teletext page teletext_page
HED the header if available, typically with summary information about the content header
OBT the original broadcast time, when it differs from the local broadcast time "OBT|Estimated" is used in digitized files when the precise broadcast time is unknown original_broadcast_date
original_broadcast_time
original_broadcast_timezone
potentially: original_broadcast_estimated="true"
LBT the local broadcast time, with time zone local_broadcast_date
local_broadcast_time
local_broadcast_timezone
 END The end timestamp is derived from the start time plus the video duration. It is followed by a repetition of the filename ---

[TODO: ADD EXAMPLE HERE]

Sentence-level metadata

 Description attribute within s-attribute <s>
Based on the UID of the text in the collection, modified as follows:
"s" and two underscores "__" added in front, all hyphens replaced by underscores and two underscores "__" added followed by the running number of the sentence in the current text (see example below - this is the first sentence in the file).
 id
The time the current sentence starts being displayed.  starttime
The time the current sentence stops being displayed. Currently not in use since this is practically always the time the next s-unit starts.  endtime
(currently omitted)
Relative time: The number of seconds that have elapsed since the beginning of the recording. This information is of course redundant since it can be inferred from starttime and the time property of the <text> s-attribute, so it may be omitted in the future. It is currently necessary to provide an easy way to link back to the video in the edge search engine, where the relative time is needed as a parameter to jump to the right position in the video.  reltime (redundant)
Example:
<s id="s__d3fd32de_e3e5_11e3_857a_001fc65c7848__1" starttime="20040301140315.450" reltime="195">

Further annotation with s-attributes

Story segmentation

[Detailed version may come in the future.]
There are various types of story segmentation in the archive that have been done by different people based on different cues. These are currently not consistent enough to include them in the corpus. Nonetheless the SEG tags are used by the sentence tagger to make sure no sentence crosses a SEG boundary. This practice may have to be revised in the future given that some of the SEG tags for commercials enclose only parts of sentences, as in the following example from the file 2008-05-05_2230_US_KCAL_Inside_Edition.txt.
20080505225945.286|20080505225946.000|CCO|WELCOME TO "KCAL 9 NEWS" AT
20080505225946.000|20080505225946.714|CCO|
20080505225946.714|20080505225947.429|CCO|4:00.
20080505225947.429|20080505225948.143|CCO|
20080505225948.143|20080505225948.857|CCO|ALSO STREAMING LIVE ON
20080505225948.857|20080505225951.000|SEG_01|Type=Commercial
20080505225948.857|20080505225949.571|CCO|
20080505225949.571|20080505225950.286|CCO|www.kcal9.com.
20080505225950.286|20080505225951.000|CCO|
20080505225951.000|20080505225952.250|SEG_01|Type=Story start
For the moment the various chevrons are indicated formally by a standalone tag, i.e. <storyboundary /> and <turnboundary />.

Speaker identification

It is quite common to identify the speaker in closed captions, as in the following example from 2012-05-05_0000_US_KNBC_Channel_4_News.txt:
20120505000037.000|20120505000041.000|CC1|>> I DON'T WANT YOU TO HURT     
20120505000041.000|20120505000041.000|CC1|YOURSELF.                       
20120505000041.000|20120505000043.000|CC1|>> Reporter: FOUR MINUTES LATER,
20120505000043.000|20120505000046.000|CC1|PARAMEDICS ARRIVED.             
Currently we detect a range of these with a whitelist, including "Reporter:" or "Translator:". These are stored in a stand-alone tag <speakeridentification speaker="Translator" /> etc. for now.

Named Entity Recognition

[or should we do this as p-attributes, possibly similar to ditto tags?]

    
Comments