Multimodal Television Show Segmentation


Opening the Digital Silo: Multimodal Television Show Segmentation

Mentored by Anna Bonazzi, Tim Groeling, Kai Chan, and Luca Rossetto

What we have

We have a large collection of videotape recordings from the ‘70s to 2006 that were digitized in UCLA’s NewsScape TV News Archive (22,544 .mp4 files as of February 2019). Each of these files is between 3 and 8 hours long and contains a varying amount of TV shows, mostly news and talk shows with commercials in between (see a short sample). Every .mp4 file comes with several other files, including a .txt3 file with the captions of the video (like this) and sometimes an .ocr file with the text appearing on screen during the video (like this). You can find longer samples of the dataset here.

We are currently working on manually annotating a sample of TV shows to identify and describe the starting boundaries of the shows and try and reconstruct the weekly TV schedules, in order to provide "ground truth" data for possible machine approaches. You can view an example of the annotations here and a growing collection of show boundary screenshots here.

What we need

Ideally, we need a segmentation pipeline that 1) detects the start boundary of each new TV show in the 8-hour digitized recording; 2) recognizes what show it is; 3) splits the 8-hour video into smaller segments, one for each TV show, with annotations of show title, channel and showtime.

How to achieve this

Well, that is the challenge, isn’t it? The task isn’t trivial, due to the magnitude and diversity of the material to be processed. However, we do have some starting points:

    • Feeding the video files into a machine learning algorithm that automatically recognizes show boundaries and patterns has been tried before. It didn’t work. Among the reasons why this approach didn’t give good results is the high imprecision and irregularity of the video files used for model training. Specifically, a boundary detection approach that was tried was a binary classification task (boundary presence: true/false) in 10-second segments of captioning text.
    • Recognizing TV shows basing exclusively on manual annotations was also tried, and while it did work, it did not result in a reliable method that could be applied further even in the absence of manual annotations (which is the case for most of our videos).
    • An optimal approach is to use a combination of text, audio, and visual cues to detect the TV show boundaries, giving particular attention to the initial boundary (i.e. the transition between the end of the previous commercial / show and the beginning of the next show). Your project should assemble multiple cues associated with these boundaries, from recurring phrases, theme music, and opening visual sequences, and then develop robust statistical methods to locate the most probable spot where one show ends and another begins.
    • Consider that many of the videos share similar show boundary patterns. Videos were recorded in different recorders according to a specific schedule (a specific channel/set of shows). This means that certain groups of videos all reproduce the same daily or weekly schedule for a few months. In a given year, you will find up to 15 of these schedules: the videos are marked by a "V" number (V1 to V15) according to the schedule they originally followed. This number is part of the videos' filenames, like 2006-01-04_0000_US_00000063_V2_MB7_VHS7_H11_MS.mp4. Although the show schedules vary over the years and sometimes over seasons, you will be able to find a repetitive pattern of start boundaries in several consecutive videos with the same V number. We are currently trying to identify which V numbers seem to have the most regular patterns.
    • After 2006, videos were recorded digitally, so they are available in show-sized segments (which is the goal we'd like to achieve with the born-analog videos as well). You could use these born-digital videos as a reference for repetitive TV show boundary patterns, to train a model, to identify logos, anchors' faces, music, etc.

Red Hen is open to developing segmentation tools in collaboration with vitrivr (, which already contains functionality to search for a frame within a video, a sequence of music within a sound track, and clusters of words in a text. Vitrivr performs shot segmentation based on visual content during video extraction. This will result in very similar or even identical video sequences for tv-show intros, outros, etc. The visual/audio features extracted from them will therefore also be very similar to each other. Given one or more instances of such sequences as example, vitrivr is capable of retrieving the other instances from within the collection. However, there is currently no logic to remember or export such boundaries a semantically meaningful entities.

Where to look

Two concrete starting points for this project are:

  • The code developed in 2018 by Awani Mishra, a former GSoC student, who worked on distinguishing music from speech and recognizing logos to identify theme songs and potential TV show starting boundaries basing on the available manual annotations;
  • The code being currently developed by Abdullah Elqaq, who is working on face recognition to identify TV show anchors or main figures which can help mark the beginning of a show.

You are encouraged to build upon this code or use sections of it (even though this is not a strict requirement) and think about how to combine these elements together. You could start working on a set of 2006 videos that have been manually annotated with information on each show’s starting point, showtime, duration, and visual or audio cues for boundary detection. However, please remember that the manual annotations we have are few, while the videos we would like to segment are many: a viable system should not rely permanently on the existence of manual annotations.

You could try triangulating information from different sources:

    • Video: what on-screen elements mark the beginning of a show? It could be a black frame marking a transition, a theme sequence, the face of one or more anchors, a large full-screen logo, or perhaps a small logo that appears at the beginning of the show in a corner of the screen and differentiates the current show from the previous one. You can find a collection of useful images (logos, full-screen images) here.
    • Audio: what sounds are likely to appear at the beginning of a show (and hopefully nowhere else)? It could be a theme song or perhaps a catchphrase used consistently by the host of a specific show. Can a show be recognized by its audio?
    • Text: what words (spoken or written) might mark the beginning of the show? You can search both in the captions and in the OCR text. A few examples:
      • “Caption”/”Captions”: many videos have a file with captions/subtitles that includes the words “Captions by…” near the beginning of each show, sometimes corresponding with the exact beginning, sometimes some seconds off.
      • “Type=Story start” / “Type=Commercial”: many caption files have a specific line to introduce a block of lines that come from a commercial and a block of lines that come from the actual show.
      • Does the captioning text include any show title, channel or host information? If yes, you could try to match the video with an existing TV schedule.
      • Are there phrases like “Welcome to”, “You’re watching”, “I’m”, “Welcome back”, that you can use to identify the show?

Any further ideas are welcome!