Red Hen corpus data format


The Red Hen data format has certain drawbacks for the use as a corpus that are to be remedied by the Red Hen corpus data format. It follows a relatively standard and easy-to-process pattern called "vertical format" that is used as input format by many corpus managers, e.g. CQP (the backend to CQPweb) or manatee (the backend to the SketchEngine and NoSektchEngine).


Basic Concepts

There are two levels of representation in the Red Hen corpus data format:

  1. Token-level annotation, i.e. every word/punctuation mark has annotation, e.g. Part-of-Speech, lemma, etc. These are called p[ositional]-attributes in CQP.
  2. Annotation potentially spanning multiple words or not directly related to individual words, such as texts, sentences, pauses, gestures, etc. These are called s[tructural]-attributes in CQP.

Thus, in the following example, the lines containing the actual text have the p-attribute word in the first column, the p-attribute Part-of-Speech (in the Lancaster CLAWS5 tagset) in the second and the p-attribute lemma (i.e. the base form of the word) in the third column. The columns are not labeled in the file itself. The s-attributes in this snippet are text and s (s-unit, "a sentence-like division of a text", TEI). They can have multiple attributes (things like id, title, author, publisher) themselves.

<text id="file0001" title="Cat stories" author="Tom Cat" publisher="Feline Press">
<s id="1">
The    AT0    the
cat    NN1    cat
sat    VVD    sit
on     PRP    on
the    AT0    the
mat    NN1    mat
.      PUN    .
<s id="2">
Its    DPS    its
name   NN1    name
is     VBZ    be
Pi     NP0    pi
.      PUN    .

File-level metadata

The attributes within the <text> s-attribute correspond to the header information in the Red Hen data format. However, since every attribute can only have one value, fields with multiple values are distributed over multiple attributes (see the example of VID below).


Sentence-level metadata


<s id="s__d3fd32de_e3e5_11e3_857a_001fc65c7848__1" starttime="20040301140315.450" reltime="195">

Further annotation with s-attributes

Story segmentation

[Detailed version may come in the future.]

There are various types of story segmentation in the archive that have been done by different people based on different cues. These are currently not consistent enough to include them in the corpus. Nonetheless the SEG tags are used by the sentence tagger to make sure no sentence crosses a SEG boundary. This practice may have to be revised in the future given that some of the SEG tags for commercials enclose only parts of sentences, as in the following example from the file 2008-05-05_2230_US_KCAL_Inside_Edition.txt.

20080505225945.286|20080505225946.000|CCO|WELCOME TO "KCAL 9 NEWS" AT
20080505225948.143|20080505225948.857|CCO|ALSO STREAMING LIVE ON
20080505225951.000|20080505225952.250|SEG_01|Type=Story start

For the moment the various chevrons are indicated formally by a standalone tag, i.e. <storyboundary /> and <turnboundary />.

Speaker identification

It is quite common to identify the speaker in closed captions, as in the following example from 2012-05-05_0000_US_KNBC_Channel_4_News.txt:

20120505000037.000|20120505000041.000|CC1|>> I DON'T WANT YOU TO HURT     
20120505000041.000|20120505000043.000|CC1|>> Reporter: FOUR MINUTES LATER,
20120505000043.000|20120505000046.000|CC1|PARAMEDICS ARRIVED.             

Currently we detect a range of these with a whitelist, including "Reporter:" or "Translator:". These are stored in a stand-alone tag <speakeridentification speaker="Translator" /> etc. for now.

Named Entity Recognition

[or should we do this as p-attributes, possibly similar to ditto tags?]