ELAN is a professional tool for the creation of complex annotations on video and audio resources, see https://tla.mpi.nl/tools/tla-tools/elan. It is a java-based desktop application that runs on Windows, OS X, and Linux. We are integrating ELAN into the Red Hen workflow by creating standard annotation templates, providing basic instructions to get started, writing export scripts that convert ELAN annotations into Red Hen formats, and writing import scripts that allow ELAN to read Red Hen files. Annotations from ELAN can in turn be made available to students and researchers in machine learning, who will use the annotations to create classifiers that aim to discover the same patterns in a larger dataset and thus automate certain aspects of the annotation task.
Use these instructions to get started annotating videos from the Red Hen datasets in ELAN. See also How to request clips.
Go to ELAN's site at https://tla.mpi.nl/tools/tla-tools/elan/download/ and select the latest version for your operating system. On a Mac, move the ELAN_4.9.X folder into the Applications folder, or on a Windows PC into the Program Files folder. Then open the folder and drag the ELAN_4.9.X.app to you Menu Bar, so that it's easy to start the program.
For your working video, audio, and annotation files, please create a different folder called Elan_projects in Documents or on your Desktop. Keep all the media and annotation files in this folder.
A set of sample files have been created by Elisabeth Zima at the University of Freiburg and Francis Steen at UCLA. Download them to your working folder, Elan_projects:
If you have ffmpeg installed on your computer, you can generate the wav file yourself instead of downloading it, using this command:
ffmpeg -i 2014-10-13_1800_US_CNN_Newsroom_12-493.mp4 2014-10-13_1800_US_CNN_Newsroom_12-493.wav
Click on the 2014-10-13_1800_US_CNN_Newsroom_12-493.eaf file and ELAN will open with the associated video file 2014-10-13_1800_US_CNN_Newsroom_12-493.mp4 and the audio file 2014-10-13_1800_US_CNN_Newsroom_12-493.wav.
Click on the .eaf file you just downloaded to start ELAN. If the media files don't load, see troubleshooting; it's a known bug.
Expand the window to use your entire screen. Your opening view should look something like this:
Below the video image you find the player controls. Use the mouseover to see the meaning of each transport button. Flanking the play button are the controls for moving one pixel at a time, and next them the buttons for moving one frame at a time -- video typically has 29.97 (US) or 25 (EU) frames per second. A major advantage of ELAN over online video annotation tools is that it gives you ultimate precision in controlling the video, with rapid response times. Try out each of the buttons and note the effect.
To the right of the main transport bar is a second transport bar for playing a selection. Click on a segment of the Brooke Baldwin transcript to highlight it, and then click on the leftmost S (Select Play) icon to play just that segment:
Try "Loop Mode" and notice how the waveform lines between the two timelines tracks the speech. The waveform is very useful for determining pauses and speech boundaries, and thus to define a speech segment.
You can mark a selection using your mouse. Place the cursor anywhere on the timeline, and then click and hold the mouse button down while you drag to the left or right, as in a regular copy task. When you have marked a selection, press Select Play to play what you marked.
ELAN gives you multiple ways to view your data. If you select Grid view, you get a very useful list of annotations, with timecodes:
You can use the Grid view to navigate and then press Play or Play selection.
You can also select Text view, which shows the transcript as a continuous text:
Like Grid view, Text view can also be used for navigating.
To walk from one annotation to the next, start by marking an annotation, either in the timeline or in Grid or Text view. Press the "Play selection" button and then the right or left arrow button to move to the next selection. For each selection, press "Play selection" and then the arrow to move to the next annotation.
Let's start with the text transcript, adjusting boundaries, correcting an existing annotation, and adding a new annotation.
To adjust the start and/or end times of an annotation, first select the annotation. To get the right zoom level, use the slider at the bottom right corner:
Make sure you see the entire annotation and the area on both sides that you are adjusting into. Use the audio waveform to guide you in selecting the new start and end points. Click and hold from the chosen point in the timeline and mark the new extent of the annotation, growing or shrinking it on either side. Then right-click on the transcription text timeline, and from the context menu select "Modify Annotation Time". I find this function to be a bit finicky and you sometimes don't get the right context menu, but with a couple of attempts it typically works fine. Try ELAN's instructions on this for more help.
Note that if you extend an annotation into the neighboring annotation, to the right or left, your modification will push that annotation so that the two share a boundary. This is a helpful feature for getting the phrase boundaries right.
Let's say you notice a mistake in an annotation -- how do you correct it? The simplest method is to select the annotation by clicking on the timeline, and then right-clicking on the text. In the context menu, select "Modify Annotation Value". Edit the text and click somewhere outside of the annotation.
You'll note that the transcription follows a simple logic -- in brief, these are the GAT2 conventions used:
The GAT2 convention is easily readable, and adds some information that may be useful for charactering co-speech gestures.
A more complex convention is the Santa Barbara Discourse Transcription convention.
For Red Hen projects, we may also decide to transcribe speech into regular English, rather than using GAT2 or the Santa Barbara convention. The reason for this is that we want annotated gestures to be searchable, and to be easily aligned with the closed captions. A separate project for audio analysis will provide acoustic information through automated feature extraction.
To create a new text annotation, you simply mark out the area by clicking and holding while you sweep over the timeline, as above. Select a short phrase with natural break, maybe five to ten words. Once you've marked a segment, click the "Play selection" button to verify you have the phrase you want. Then right-click in the row of the speaker and select "New Annotation Here".
In the Grid view, select the speaker from the pulldown menu, say "Thomas Frieden -- speech transcript", to see the results. Verify by playing each transcribed segment.
Make sure you save the file when you're done.
The online version of the sample 2014-10-13_1800_US_CNN_Newsroom_12-493.eaf file may be updated with new features occasionally; you can check the last update time by viewing the file date in the download directory.
Elisabeth Zima, who annotated the clip for us, includded handedness codes (1 for right hand and 2 for left) and Right Hand Gesture Phase codes PR (Preparation) and STRO (Stroke), decomposing the gesture into its components:
In the RH-Gesture Phase you see the detailed gesture component annotations, explained if you request "Modify Annotation Value":
We may get to this approach eventually, but this is not how we'll start. Instead we'll focus on labeling the meaning of the whole gesture.
Call up the clip in Elan and watch the first gesture annotation. You'll see that what Brooke Baldwin is doing with this gesture is establishing a "Distancing Other Space", which is used to place the idea that the CDC "is not so sure" the nurse really followed the right procedure. She's using space to create a discourse landscape or semantic topology, and locating the CDC's view on the matter far out in that landscape. This makes it clear not only that there's a significant difference between the version of events claimed by hospital officials on the one hand and the CDC on the other, but that the hospital's version (which Brooke presents without gestures) is the normative center, the default or reference view that the CDC is challenging. This normative claim is presented silently and by implication; there is no explicit endorsement. Yet the fact that she's creating the space with a fully extended arm signals that the CDC's view is controversial; it tacitly removes the CDC view from the normative center and into the fringe, portraying it as an outlaying view that we by default and implicitly distance ourselves from. (Scott Lidell -- "real space" is the ground to hold and arrange conceptual space)
You can see another example of this Distancing Other Space gesture in Hillary Clinton's testimony to a Senate Hearing on media. As she is saying "AlJazeera is winning, the Chinese have opened up a global English language and multi-language television network, the Russians have opened up an English-language network," she's gesturing with a fully extended arm to establish this "other" space far away from the normative center. That normative center is "we" and "the BBC", gesturing towards the center of her own body, in effect standing in for America and the West.
So the level we will be annotating at is to label the overall semantic meaning or communicative intent of an entire gesture. We won't be focusing on the details of how the gesture is composed, but we need to be good at the labeling. In addition, we will include gaze direction information. The gesture annotations might looks like this:
You see I've labeled the second gesture "Precision Grip" -- this is how Brooke introduces "a source with direct knowledge of the case," thus adding credibility to this unnamed source. Our task will be to characterize gestures at this level, and create a collection of recognizable gestures classified by their communicative intent, or semantic meaning.
As an exercise, try adding an annotation for a gesture towards the end of the clip, at 00:05:51.8, where Brian Stelter says "I wish I knew".
Click Search in Elan's main menu and select "Go To..." In the pop-up, paste in 00:05:51.8 and press enter.
Mark the time from folded hands to folded hands, enclosing the entire gesture. Pay attention to when he says, “I wish I knew”, since that is the statement accompanying the gesture.
Make sure you are satisfied with your annotation boundaries. Under “Selection,” below the video and to the right, there is an "S" with a triangle arrow. You can click on this in order to hear and view the segment according to the specific boundaries you have set. You can also use the keyboard shortcut SHIFT+SPACE.
You can even set it on “loop” in order to view the selection over and over again until you are sure that this is where you want to set the boundaries.
Once you're satisfied with your boundaries, let's label the gesture. Position the cursor inside the marked area, on the Hand gestures line, and double-click. You can also right-click and select "New annotation here", or select "Annotation" from the main menu and then "New annotation here". You should see a white field open for the label. Let's label it "Positive Ignorance" -- the arm and hand movements occupy an expanding epistemic space that signals a positive attitude to learning something and acknowledging a complete absence of knowledge.
When studying gesture, you can either examine human multimodal communication and annotate what you see, or you can look for specific gestures in a large dataset. Both approaches are useful and necessary: we want to discover new gestures, but we also want to create ordered collections of the same gesture to study how it is used in different contexts. One method for creating such collections is to leverage manual annotations with machine learning.
An important dimension of gesture annotations in Red Hen is that they can be used for machine learning tasks. In machine learning, a computer program will examine our annotations of a certain gesture, say the Positive Ignorance gesture, learn something about it from our annotations, and then robotically find many more instances of the same gesture in other news program videos. In this way, your work of labeling a small number of gestures can be used to label a very large number.
Typically a training set for machine learning may require around a hundred annotations. The machine learning program uses the annotations in the training set to create a so-called "classifier." This classifier learns to recognize the gesture based on your annotations. To check our results, we can run the classifier on a small number of videos, until it finds some examples of the gesture it learned. We can then examine its candidate gestures and provide feedback on whether the gestures have been correctly identified. In this way, the classifier improves with our feedback.
Once we're satisfied with the performance of the classifier, we can use it to find thousands of instances of the same gesture in hundreds of thousands of video files -- far more than any human could accomplish.
Now, how to we find many instances of the same gesture? One approach is to search for when speakers say the phrases that we find is associated with a particular gesture. For instance, the Positive Ignorance gesture goes with the phrase "I wish I knew". We can search for that phrase in the NewsScape search engine, and the check the search results to see if the speaker is using that particular gesture.
To access the NewsScape search engine, go to the Red Hen video annotation class and click on one of the Login buttons -- there's one in the middle and one in the top right corner. Once you've logged in to your UCLA account, enter the enrollment key you received from your professor. In the list of courses, click on the link Communication Studies Archive Edge Search Engine.
In the search window, enter the phrase you want to search for, select the search results to display 100 results at a time, and set the "from" date to the beginning of the Archive in the "Date and time" section:
The start date will be changing, as we are adding materials from the digitizing project managed by Prof Groeling; at the moment it is January 1, 2005. Press Search to see your results; you'll get around 930 hits.
Most of the search results do not have the gesture we are looking for. Scroll down until you get to this search result from Fox on November 9. Click on the thumbnail image in the search results to start the video:
You'll see O'Reilly makes a one-hand version of the "I wish I knew" gesture. The camera leaves out the start of the gesture, but the remaining parts are shown well.
To save this search hit, right-click on the | text | link next to the show title to open it in a different tab, and copy the name of the show from the TOP line, then close the tab with Ctrl-w. Paste this file name into an e-mail, adding the time of the gesture in the bottom left-hand corner of the player, in this case 34:17:
If you find several instances of the same footage, such as in a commercial, you don't need to tag every instance. Just tag the first one, and note that it recurs. Skip any clip that does not show a gesture.
For more information on the search interface, see How to use the NewsScape Edge search engine.
If you find an interesting gesture, add it here! In general, talk shows are the best source of clear views. Note that in the Edge video player, the number of frames is always shown as "Select frame at ... seconds". This number you can use at the end of the URL to get the right location of the gesture.
To turn a list of links such as the Positive ignorance gestures into clips, follow this procedure:
We're now also able to do this using just the permalinks by referencing the list of file names and UIDs; see the clip-bulk script for details.
To annotate these clips, we still need generic annotation templates.
This section walks you through how to create a new annotation project for a video clip. See also ELAN's How to create a new annotation document.
The section has been written for the Winter and Spring 2016 Work-Study project, but can be used by others.
To start a project with a new video clip, you need two files. First select the video clip you want to annotate from the web site you have been given, such as http://vrnewsscape.ucla.edu/elan-clips, and place it in your Desktop/Elan_projects folder. Second, download this Red Hen annotation template:
This template is designed for annotating a single gesture by one individual. Place it in your Desktop/Elan_projects folder. If the file gets the .xml extension when you save it, see Troubleshooting.
Next, start Elan and select File | New from the main menu. You will see this window:
First click on "Add Media File" and select the video clip you want to annotate from your Elan_projects folder -- the one you just downloaded.
Second, click on "Add Template File" and select Redhen-04-single.etf in your Elan_projects folder.
Third, give your project a name. Click File | Save As from the main menu, and then select the video file you just added -- for instance, 2007-03-07_1900_US_KTTV-FOX_Montel_Williams_Show_797-1277.mp4. You always want your Elan project to have the same name as the video file you are annotating.
Click on the file name in the "Save As" field and move to the end of the file name -- it will end in "mp4". Change that to "eaf", which is Elan's file format.
You should now see your video and the template looking something like this:
You have now set up your workspace and are ready to begin the annotation. We will walk through the steps one by one.
The clips in http://vrnewsscape.ucla.edu/elan-clips use the links you created in task #3, Positive ignorance gestures, adding two minutes before the time you indicated for the gesture, and three minutes after. This means the phrase "I wish I knew" is most likely around two minutes into the clip. To go there directly, click on Search | Go To in the main menu, and enter the minutes and seconds like this:
By using 00:01:45, you'll be taken to 15 seconds before the most likely point for the phrase and the co-speech gesture.
If it's not at the two-minute mark, try searching for it before and after until you find it.
While you watch the video looking for the phrase and gesture, think about how you can name or characterize the main speaker.
The first task is to define the start and end of the phrase and co-speech gesture.
Once you've found the right location, you may need to adjust the scale of the timeline with the slider at the bottom right-hand of the screen -- for details, see Adjust the boundaries of a text annotation. The gesture typically only lasts a second or two, so you'll probably want to zoom in:
The start and end times you create for the Speaker tier will be used for all the annotation categories, so it's important to be accurate. Use the Play Selection button to verify you've set the boundaries correctly.
To add annotations within your selection, you first pick the kind of annotation you want to add. Click on Grid and select Speaker:
As a confirmation, you'll see the word Speaker in the coding template is now underlined -- that tells you this is the currently active tier.
Note that if you lose the bounding box further down in your annotation work, you can get it back by selecting the Speaker tier in the Grid selector in this way.
To label the speaker, right-click within the selection on the Speaker line and select "New Annotation Here":
Inside the Speaker field you then enter either the name of the speaker, if you know it (in this case Margaret Cho), or a general description, such as Talk show guest:
Note: in ELAN 4.9.3, the "New Annotation Here" menu item is found under the "Annotation" menu at the top horizontal menu bar of ELAN:
After annotating the Speaker tier, select the Speech tier and add a GAT2 transcription, as described above.
The third tier in the RedHen-04-single annotation template is called Rectangle. This is a new feature. Its purpose is to draw an imaginary box around the gesturer, so that the machine learning programs can find the person on the screen.
When you just click on the paused video image in Elan, the location of your pointer is copied into your paste buffer. This means you can click on the video image to define one corner of a rectangle that defines where on the screen the speaker is located:
In this screenshot I show two pointers, from two successive clicks. In the lower left, when I click I can paste this value: 68,325. That means I marked a location on the screen that is 68 pixels from the left edge (x-coordinate) and 325 pixels from the top (y-coordinate).
In the upper right, after clicking I can paste this value: 423,8. That means I marked a location on the screen that is 423 pixels from the left edge, and 8 pixels from the top.
In this way, I can mark the opposite sides of an invisible rectangle that encloses the speaker. If we could see the bounding box, it would look like this:
So we cannot actually see it, but when you click and paste in opposite corners like this, the computer program that looks at your annotations for machine learning projects will know that you drew a box and figure out where the speaker is located on the screen.
You see the cursor (pointer) in the upper right-hand corner of the video picture: when I click there, I can then paste (Ctrl-v, or right-click and "Paste") into the Rectangle field.
When you click and then paste the numbers into the Rectangle field, you'll also see this: [448.0,336.0]. These numbers give the width and the height of the video image, the picture size. When you're done, the whole field should look like this:
68,325 [448.0,336.0]-423,8 [448.0,336.0]
That's just the first coordinate point (with the picture size), then a hyphen, and then the second coordinate point (with the same picture size again). You see the result below:
Once you get the hang of it, this is very fast. Check the numbers you have pasted to verify that they are different and that they make sense in terms of the x- and y-coordinates.
Finally we get to the gesture. In the Gesture field, you want to simply describe the gesture. The gesture description should capture the communicative intent rather than the physical motion (which we describe below). In this case, I use "Mock aggression". The terms you use could describe the emotion associated with the gesture, or in other ways convey what the gesture means. In the case of the "I wish I knew" gestures, most of them should be described as "Positive ignorance". Use a brief phrase that characterizes the overall meaning of the gesture.
The next tier, Circle, is very similar to the Rectangle tier. However, in this case you are marking a circle instead of a box, and you are marking just the gesture your are describing and not the whole person.
Start by clicking in the center of the gesture, where you see the pointer in this picture. When I paste that point, I get 239,286.
Next, click on the outermost boundary of the gesture, as you see in the pointer near the bottom right corner. When I paste that point, I get 425,306.
If we could see the circle that these two points define, it would look like this:
This circle helps inform the machine learning program where the gesture is located.
The final result will look something like this -- as with the rectangle, you won't actually see the bounding circle, but the numbers tell us where the gesture is happening:
Finally, briefly describe the physical movements that constitute the gesture. This is where I'm asking for your help: we need to develop a standardized set of terms -- a so-called 'controlled vocabulary'. For the moment, just add in a brief description on one or more of the three tiers.
If the gesture is mainly happening with the head, use the head tier. If it involves the whole body, add your description in the Body field.
In this example, the gesture is mainly in the arms and hands, so I use the description "Raise both firsts in front of shoulders." You might have another that is something like "Spread lower arms and hands from center to sides". As we collect your suggestions, we'll develop standard descriptions.
The final annotation looks like this:
Make sure you save the annotation and submit the .eaf file as instructed.
List functions here that may be useful in special cases.
1. Identify a location or area in a picture or video frame
2. Video player zoom
3. Template design
Here are suggestions for how to work around various things that could go wrong; feel free to add.
The most common problem with ELAN 4.9.2 is that the media files do not load automatically. Make sure that you have a folder on your Desktop called Elan_projects, with all of your media files. Save the .eaf files to the same folder. If you always have the .eaf file in the same folder as your media (.mp4 and .wav) files, you should avoid the media loading problem.
You know you've run into the problem when you see a warning. On a Mac, you may get this error message:
This looks ominous, but it's just a bug in ELAN; it has nothing to do with QuickTime. To work around it, click OK and then in the main menu of ELAN select View | Media Player, and then the input file:
If the Media Player does not give you the option of loading the correct video file, select Edit | Linked files from the main menu, and then select the mp4 and wav files.
When you load the media files manually like this, you may find that when the video plays, you hear several instances of the same sound track. If this happens, click on "Controls" and the radio button "Solo" next to the first video file:
You may find that the media loading issue comes and goes; this workaround should always work.
Avoid changing the names of your media files; it's important for Red Hen conversion that they have the same name as your annotation file.
Some people report that when they save a template file from the web, such as Redhen-04-single.etf, the extension .xml gets tacked on, and Elan is unable to use the file. The solution is to remove the .xml extension from the file name. On a Mac, the safe way to do this is to right-click on the name in Finder and select "Get info". Under Name & Extension erase the .xml extension.