What is this?
The Red Hen data format
has certain drawbacks for the use as a corpus that are to be remedied by the Red Hen corpus
data format. It follows a relatively standard and easy-to-process pattern called "vertical format" that is used as input format by many corpus managers, e.g. CQP (the backend to CQPweb) or manatee (the backend to the SketchEngine and NoSektchEngine).
There are two levels of representation in the Red Hen corpus data format:
- Token-level annotation, i.e. every word/punctuation mark has annotation, e.g. Part-of-Speech, lemma, etc. These are called p[ositional]-attributes in CQP.
- Annotation potentially spanning multiple words or not directly related to individual words, such as texts, sentences, pauses, gestures, etc. These are called s[tructural]-attributes in CQP.
Thus, in the following example, the lines containing the actual text have the p-attribute word
in the first column, the p-attribute Part-of-Speech
(in the Lancaster CLAWS5
tagset) in the second and the p-attribute lemma
(i.e. the base form of the word) in the third column. The columns are not labeled in the file itself. The s-attributes in this snippet are text
"a sentence-like division of a text", TEI). They can have multiple attributes (things like id, title, author, publisher) themselves.
<text id="file0001" title="Cat stories" author="Tom Cat" publisher="Feline Press">
The AT0 the
cat NN1 cat
sat VVD sit
on PRP on
the AT0 the
mat NN1 mat
. PUN .
Its DPS its
name NN1 name
is VBZ be
Pi NP0 pi
. PUN .
The attributes within the <text> s-attribute correspond to the header information in the Red Hen data format
. However, since every attribute can only have one value, fields with multiple values are distributed over multiple attributes (see the example of VID below).
||attribute(s) within s-attribute <text>
||contains the starting timestamp and the file name
date [In addition, currently year, month, day are used due to restrictions in the search capabilities of CQPweb. Do not rely on them as they may be removed in the future.]
time [this is the time scheduled, not the exact time]
timestamp [this is the exact time at which the recording started]
title [Please note that the title is taken from the filename with underscores replaced by spaces. Use event_title if available for a possibly more accurate and human-readable version.]
TODO: VERIFY event_title ON LARGE SCALE
||contains the collection name
||a unique ID for the collection
||id [the original UID is modified in that a "t" and two underscores "__" are added in front. All hyphens are replaced by underscores, e.g. t__d3fd32de_e3e5_11e3_857a_001fc65c7848
(there are technical reasons for this due to the way IDs are handled in CQPweb)
||the show's own program ID (when available)
||the time of acquisition
||the duration of the recording in hours:minutes:seconds.hundredths of a second
||the picture size of the compressed video and of the original video
||the title of the event if applicable, or the series, if it contains non-ascii characters
||the web source if applicable
||the type of transcript if applicable
||the recording location
||a comment added by the person scheduling the recording
||three-letter ISO language code
||the teletext page
||the header if available, typically with summary information about the content
||the original broadcast time, when it differs from the local broadcast time "OBT|Estimated" is used in digitized files when the precise broadcast time is unknown
||the local broadcast time, with time zone
| END|| The end timestamp is derived from the start time plus the video duration. It is followed by a repetition of the filename|| ---|
[TODO: ADD EXAMPLE HERE]
||attribute within s-attribute <s>
|Based on the UID of the text in the collection, modified as follows:
"s" and two underscores "__" added in front, all hyphens replaced by underscores and two underscores "__" added followed by the running number of the sentence in the current text (see example below - this is the first sentence in the file).
|The time the current sentence starts being displayed.
|The time the current sentence stops being displayed. Currently not in use since this is practically always the time the next s-unit starts.
|Relative time: The number of seconds that have elapsed since the beginning of the recording. This information is of course redundant since it can be inferred from starttime and the time property of the <text> s-attribute, so it may be omitted in the future. It is currently necessary to provide an easy way to link back to the video in the edge search engine, where the relative time is needed as a parameter to jump to the right position in the video.
|| reltime (redundant)
<s id="s__d3fd32de_e3e5_11e3_857a_001fc65c7848__1" starttime="20040301140315.450" reltime="195">
Further annotation with s-attributes
[Detailed version may come in the future.]
There are various types of story segmentation in the archive that have been done by different people based on different cues. These are currently not consistent enough to include them in the corpus. Nonetheless the SEG tags are used by the sentence tagger to make sure no sentence crosses a SEG boundary. This practice may have to be revised in the future given that some of the SEG tags for commercials enclose only parts of sentences, as in the following example from the file 2008-05-05_2230_US_KCAL_Inside_Edition.txt.
20080505225945.286|20080505225946.000|CCO|WELCOME TO "KCAL 9 NEWS" AT
20080505225948.143|20080505225948.857|CCO|ALSO STREAMING LIVE ON
For the moment the various chevrons are indicated formally by a standalone tag, i.e. <storyboundary /> and <turnboundary />.
It is quite common to identify the speaker in closed captions, as in the following example from 2012-05-05_0000_US_KNBC_Channel_4_News.txt:
20120505000037.000|20120505000041.000|CC1|>> I DON'T WANT YOU TO HURT
20120505000041.000|20120505000043.000|CC1|>> Reporter: FOUR MINUTES LATER,
Currently we detect a range of these with a whitelist, including "Reporter:" or "Translator:". These are stored in a stand-alone tag <speakeridentification speaker="Translator" /> etc. for now.
Named Entity Recognition
[or should we do this as p-attributes, possibly similar to ditto tags?]