Blended Classic Joint Attention


This task focuses on some communicative actions and events that occur in media in ways that differ from the way they occur in face-to-face communication. These are events of Blended Classic Joint Attention (BCJA), which we explain below. Red Hen wants to be able to locate these events in its vast database of media. The goal is to create a computational system that will automatically tag elements of BCJA. This is a goal for machine learning and machine recognition. To advance to that goal, Red Hen needs "training sets"—also called "ground truth"—of tagged data to serve the creation of classifiers through machine learning. A training set can be created by people doing manual tagging, and we have a system for that manual tagging. Red Hen can hire taggers. But those taggers would be much more efficient if there were an unsupervised recognition system that offered the human taggers possibilities, for the human beings to correct manually. The human researcher is the expert; the expert's tags are ground truth; the computer could offer possibilities for tagging to the human researcher; the resulting ground truth could be submitted for the creation of classifiers through machine learning, or other recognition systems. In other words, the computer comes to imitate, not perfectly, the human tagger. Red Hen needs training sets of manually tagged elements in scenes of BCJA. Would you like to participate in this project? If so, write to

and we will connect you with a mentor.

More information

In joint attention, some people know that they are jointly attending to something and they all know this and they know, too, that they all know this and that they are engaging with each other by this joint attention even if they are not communicating about it (Tomasello 1999, e.g.). In classic joint attention (Thomas and Turner 2011), two (or so) people together are jointly attending to something they both perceive in the same human environment and they are communicating about it. They both know that they are attending to it, that they are engaging with each other by attending to it, and that they both know all of this. In classic joint attention, people point out to each other objects or events, and they communicate, even if minimally, about the object of their joint attention. Human beings are spectacularly equipped evolutionarily for classic joint attention. It is a foundation of our ability to teach, learn, and cooperate. Human communicative abilities, including language and gesture, are particularly dedicated to this basic scene of classic joint attention.

As Charles Fillmore has written, when we want to detect the most straightforward principles of communication, the language we study is “the language of people who are looking at each other or who are otherwise sharing some current experience and in which the hearer processes instantaneously what the speaker says” (Fillmore, 1981: 165).

This is the scene of classic joint attention. In classic joint attention, participants have an understanding of “the ground”; that is, they understand a great deal about “the speech event, its setting, and its participants” (Langacker 1985:113) without needing to refer to that understanding. Each of the participants understands that the other has a human mind; that each of them has a viewpoint on the conditions of space, time, participants, physical relationships, cultural situation, and so on, and that each knows that the other has such a viewpoint.

Classic joint attention is highly powerful and allows human beings to feel comfortable and cooperative in many very basic scenes of communication: mother and child looking at a bird, companions looking at the weather patterns in the sky, two people noticing a third approaching them on the road. These are all scenes of local experience, under joint and close attention, and communication.

But human thought is remarkable for its ability to stretch across much more than such basic scenes. The sweep of human thought is vast, stretching over time, space, causation, and agency, to create mental ideas that stretch far beyond our local experience. We are able to exploit ideas that are both familiar to us and at a scale congenial to local experience. We can blend these familiar, local, congenial, experiential ideas with mental networks that have vast content—content stretching across very diffuse arrays. in ways that would be very difficult to grip if we could not ground those diffuse arrays in familiar, experienced scenes. The mental network may not actually fit one of our familiar, experiential ideas, but we can blend that network with one of those ideas. We can make a compact mental blend that is based in familiar ideas, even though it includes other ideas that are not so familiar. For example, in the history of ideas, we often say that one scientist or writer or thinker or philosopher was “trying to answer the question” that a previous thinker posed, or that the later thinker was “disputing” with the previous thinker. We talk about the “debate” between Lamarck and Darwin, or Plato and Aristotle. But of course, these thinkers were not actually in a scene of classic joint attention where they were asking and answering, disputing, and debating. Our understanding of the vast network of agents and actions stretching over time and space is not actually a classic joint attention of conversation between two people, but we can blend this network with that familiar, experiential scene of conversation, and so understand it. In the blend, there is a debate between the two thinkers, who may not have even been alive at the same time. We are not fooled, but the blend is a very useful conceptual tool. It gives us a way to grasp the entire network of ideas.

Such a blend is intelligible, even though it treats elements stretching over time, space, causation, and agency, because it has some familiar, human-scale structure. In this case, it has the structure of conversation, even though it is not really a conversation: the earlier thinker, for example, cannot really reply to the questions posed by the later thinker, but we can take something the earlier thinker wrote and say, in the blend, that it is an “answer” to the question posed by the later thinker. Because of the familiar structure of such a blend, we can grasp the entire mental network. Mental networks grounded in compact blends often stretch far beyond what we would otherwise be able to conceive.

We often understand vast mental networks in part by blending them with the idea of classic joint attention, even though the network itself is not an example of classic joint attention. For example, the news anchor is not actually in a scene of classic joint attention with the viewer; terrorism is not a local object or event in a local scene that we can perceive directly; here for the participants in the news interaction is not actually a single shared space (“It’s good to have you here,” says the news anchor, but where is “here”?); now for the participants in the news interaction need not be a particular moment (“Now we have a special announcement coming up for you here,” says the news announcer, but perhaps it was recorded, perhaps the announcer did not even know what the special announcement would be, who is “we,” and again, where is “here”?). But we can blend all these elements into a scene of blended classic joint attention, which is tractable and familiar because it draws on our understanding of classic joint attention. All the language that is available for running a scene of classic joint attention can be projected, adapted, and used for blended classic joint attention. Blended classic joint attention is a major cognitive resource, available, perhaps, to only cognitively modern human beings, roughly all human beings during the last 50,000 years or maybe significantly more. Blended classic joint attention is a scene we understand by blending the scene of classic joint attention with other things that do not in fact fit that scene.

Personal letters, telephone calls, walkie-talkies, writing, and many other technologies have led to common cultural scenes of blended classic joint attention (BCJA). In these scenes, it is not necessarily the case that those who are jointly attending are together in the same spatial or temporal environment, or even that they know of each other’s existence. One can keep a secret diary that one never means to show to anyone else, and yet, the concept of what we are doing in keeping that diary is formed partly by thinking of joint attention—even if the other intelligence paying attention in the blend is only imaginary, or is one of our future selves, or is a disembodied non-human intelligence. A letter we write can begin, “To Whom It May Concern.”

Broadcast news relies on a conception of a scene of BCJA that it is extremely widespread. It is so common that fictional presentations of stories often give the hearer, reader, or viewer the backstory of the narrative in a fictional news clip. The film gets rolling by having one of the characters watch a quick fictional news broadcast. Then we, and the character, know what is going on. It’s easy for us, because we understand how the news works, and we understand how the news works largely because we are experts in blended classic joint attention.

Red Hen's purpose for this task is to find moments in news—serious news, daytime talk shows, late night talk shows, interview shows, etc.—in which there are elements of scenes of BCJA, and to train computers to recognize those elements automatically. If Red Hen pulls it off, she would be able to tag hundreds of thousands of hours of recordings for these elements, by having the machine learning classifier do the tagging robotically. Here is a beginning list of such moments:

    1. Gaze. News anchor alone looking out of screen. In CJA, people look at each other when talking. CJA is an input to the blend in which we are engaging with the news anchor. We are not really engaging with the anchor as we would be if we were talking with someone face-to-face. The idea of CJA is blended with lots of other inputs. In the blend, we are in a scene of blended classic joint attention. When the news anchor is looking at the camera, we know, in the other inputs to the blend, that the anchor is not looking at us, does not know us, cannot see our response, etc. The anchor is instead looking at a camera. But in the blend, we are interacting with the anchor.
    2. Gaze. Talk show host standing up or sitting down alone looking out of screen. Same comment as above under 1.
    3. Different gaze, but both direct to same viewer because of camera switch. Speaker on screen looking from one camera to the other, and addressing the viewer in both cases. Comment: in CJA, the speaker cannot move the viewer by looking at another spot. But in the news, this is exactly what happens; or at least, this is what happens in the blend. The anchor switches topic and indicates as much by switching focus. Of course, magically, via technology, the viewer is the focus both before and after the shift.
    4. An example of the drop-in Facetime Anchor, managing a conversation, all participants looking out of the screen:
    5. Joint attenders both looking out of screen at each other and at viewer. The anchor, looking out of the screen, starts a scene of blended classic joint attention with a reporter or respondent in the field. It is blended because of course the anchor and the reporter are not actually talking face-to-face. The reporter appears in another box on the screen. Both anchor and reporter are looking out of the screen, that is, the lines of their vision are parallel, but this means to us that they can see each other, which is exactly what cannot happen in CJA.
    1. Handoff. The anchor then hands the stage off to the reporter or respondent, who takes up the whole screen, looking out of it at, of course, the viewer, but also the anchor, and addresses the anchor. Sometimes, we hear the voice-over of the anchor during this presentation from the reporter in the field.
    2. Examples:
    3. A handoff at 2 minutes into In Edge at,66ee6902-b3b0-11e3-be98-089e01ba0338,2969
    4. A handoff at 1:50 minutes into In Edge at,4f667890-d598-11e5-9c78-089e01ba0338,621
    5. Anchor returns to screen but has been "present" while invisible. Then the anchor comes back on screen with the reporter, and the reporter signs off, and the reporter's box disappears, and the anchor addresses only the viewer.
  1. Drawing attention to something: in CJA, one points, or looks in the direction of the object of intended attention. There are tools to emphasize such pointing, such as a "laser pointer," or just a flashlight or searchlight. In the news, one often sees some visual change in an image: perhaps an arrow on the screen points to something in the photo, or perhaps a red circle is suddenly "drawn" around part of the image. Sometimes part of the image that is emphasized is suddenly highlighted by a lightening of the circle in which the object of attention is depicted. Notice that this is not normal in CJA: if one points at an area of intended joint attention, that area in the environment is not suddenly highlighted. But here is an example in which lightening an area of the image is understood, in BCJA, as drawing attention to it:
    1. "Illumination" at 1:50 minutes into In Edge at,66ee6902-b3b0-11e3-be98-089e01ba0338,2969
    2. Discourse management. In CJA, we can have a group face-to-face conversation, and manage that discourse with speech or gesture, such as saying "Your turn" or just pointing. In BCJA, the participants may be in quite different places, with different technology. Suppose an anchor is managing such a blended conversation through language and gesture. How does the management differ in the BCJA scene from the CJA scene? E.g., the anchor cannot actually point at someone who is not with the anchor, but may still point in some way to indicate who has the floor. Yet if that participant is actually talking via telephone, with no vision of the other participants, the anchor is unlikely to expect that participant to respond to silent pointing. Etc.
    3. Reaction shots. In CJA, participants react, often in ways that are meant by the reactor to be read and interpreted by one or more of the other participants. Reaction shots in BCJA are often extremely complicated. For example:
      1. The viewer of a broadcast or a film of a play is often "shown" a representation of a reaction that could not be perceived, or perhaps was unlikely to be perceived, by one or more of the other participants. But the viewer understands the reaction as being shown to the viewer as a member of the scene of BCJA. Sometimes, the "reactor" understands that this reaction is going to be "presented" to the viewer, and behaves so as to provide that reactor to the viewer, by, e.g., looking at the camera or, in theater, at the audience.
      2. Etc.
    4. There are many such scenes in traditional art, sculpture, and architecture:
      1. John the Baptist looking out of the frame of the painting and pointing at Christ.
      1. Etc.
      2. Etc.
      3. Etc.
    1. There are many such scenes in print or video advertisements:
      1. WWII poster in which "Uncle Sam" says "I Want You For U.S. Army" creating a BCJA with the viewer as a member. Analysis at Mark Turner (2014), "Blending in Language and Communication." See
      1. "When you ride alone, you ride with Hitler." Analysis at Mark Turner (2014), The Origin of Ideas. Oxford University Press, page 101. See
      1. Etc.
    1. The speaker on camera says "right there" and points to and looks at the word "HERE" in the blended ground. See the tweet at
    2. A spy recording of an actual conversation might take in several elements of the discourse scene, but in television newscasts where the camera field of view takes in the entire discourse scene, speakers in the studio do many things they would not in a scene of classic joint attention: they turn to speak to the camera, for example, or the viewpoint switches from one camera to another so that the viewer sees the front (rather than the back) of whoever happens to be speaking. Ex. :

Instances of BCJA are observed between characters of a comic, or the character and the reader when it becomes aware of its fictional nature, the instance being termed as "Breaking the fourth wall", which is fairly common. Ex:

Instructions for obtaining clips

To request working clips, create a list below in the format specified. Note that you should mark the first three lines with a # symbol.


# Your name, in the form LastName_FirstName, e.g., Turner_Mark

# Topic

# Selection method or criteria—include your regex or CQPweb search if this is what generated your links

List of clips

The list of clips can use any of these five formats:

2011-07-25_1200_US_FOX-News_Fox_and_Friends 00:00:30-00:00:44 (start and end duration in hh:mm:ss)

2011-07-25_1200_US_FOX-News_Fox_and_Friends 30-44 (start and end duration in seconds)

2007-12-24_0300_US_KCBS_60_Minutes 00:49:12 (single timestamp in hh:mm:ss)

66ee6902-b3b0-11e3-be98-089e01ba0338,2969 (UID with single timestamp in seconds),66ee6902-b3b0-11e3-be98-089e01ba0338,2969 (full permalink)

If you use a single timestamp, the default clip length is one minute, with the timestamp in the middle.

Clips to be obtained

Clips obtained


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 00:01:50-00:02:10

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 00:09:30-00:09:45

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 00:10:45-00:10:55

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 00:11:20-00:11:30

2016-06-27_2300_US_FOX-News_On_the_Record_with_Greta_Van_Susteren 00:01:15-00:02:20

2016-06-27_2300_US_FOX-News_On_the_Record_with_Greta_Van_Susteren 00:02:55-00:03:30

2016-06-27_2300_US_FOX-News_On_the_Record_with_Greta_Van_Susteren 00:04:55-00:05:20


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 0150-0210

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 0930-0945

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 1045-1055

2016-06-27_2200_US_KNBC_The_Ellen_DeGeneres_Show 1120-1130


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine

2016-06-27_2300_US_FOX-News_On_the_Record_with_Greta_Van_Susteren 0115-0220

2016-06-27_2300_US_FOX-News_On_the_Record_with_Greta_Van_Susteren 0255-0330

2016-06-27_2300_US_FOX-News_On_the_Record_with_Greta_Van_Susteren 0455-0520


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine

2016-06-28_0100_US_CNN_Anderson_Cooper_360 0315-0330

2016-06-28_0100_US_CNN_Anderson_Cooper_360 0700-0730

2016-06-27_2300_US_CNN_Erin_Burnett_Out_Front 0130-0150

2016-06-27_2300_US_FOX-News_On_the_Record_with_Greta_Van_Susteren 0025-0130


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine

2016-06-27_0100_US_KABC_Eyewitness_News_6PM 130-230


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine

2014-01-23_1800_US_MSNBC_News_Live 550-660


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine. (Needed some more data for the distance metric)

2016-05-13_0100_US_FOX-News_The_Kelly_File 00:00:00-00:59:53

2016-05-13_2300_US_KNBC_The_Ellen_DeGeneres_Show 00:00:00-00:59:54


#Blended Classic Joint Attention

#Browsing in the Edge Search Engine,f453c876-be6d-11dc-b3fd-1b0d42bc6500,960,2568fa14-1967-11e6-9467-089e01ba0326,1890,f453c876-be6d-11dc-b3fd-1b0d42bc6500,610,f453c876-be6d-11dc-b3fd-1b0d42bc6500,1250,3eb588d2-be6e-11dc-8f17-23d2d4b3d221,2170,f453c876-be6d-11dc-b3fd-1b0d42bc6500,1910,043a88b0-1261-11e6-8333-089e01ba0770,1190