Multimodal Television Show Segmentation

What we have

We have a large collection of videotape recordings from the ‘90s to 2006 that were digitized in UCLA’s NewsScape TV News Archive (13,874 .mp4 files). Each of these files is about 8 hours long and contains an indistinct row of TV shows, mostly news and talk shows with commercials in between (see a short sample). Every .mp4 file comes with several other files, including a .txt3 file with the captions of the video (like this) and sometimes an .ocr file with the text appearing on screen during the video (like this). You can find longer samples of the dataset here.

We are currently working on manually annotating a sample of TV shows to identify and describe the starting boundaries of the shows and try and reconstruct the weekly TV schedules, in order to provide "ground truth" data for possible machine approaches. You can view an example of the annotations here and a growing collection of show boundary screenshots here.

What we need

Ideally, we need a segmentation pipeline that 1) detects the start boundary of each new TV show in the 8-hour digitized recording; 2) recognizes what show it is; 3) splits the 8-hour video into smaller segments, one for each TV show, with annotations of show title, channel and showtime.

How to achieve this

Well, that is the challenge, isn’t it? The task isn’t trivial, due to the magnitude and diversity of the material to be processed. However, we do have some starting points:

  • Feeding the video files into a machine learning algorithm that automatically recognizes show boundaries and patterns has been tried before. It didn’t work. Among the reasons why this approach didn’t give good results is the high imprecision and irregularity of the video files used for model training. Specifically, a boundary detection approach that was tried was a binary classification task (boundary presence: true/false) in 10-second segments of captioning text.

  • An optimal approach is to use a combination of text, audio, and visual cues to detect the show and episode boundaries, giving particular attention to the initial boundary (i.e. the transition between the end of the previous commercial / show and the beginning of the next show). Your project should assemble multiple cues associated with these boundaries, from recurring phrases, theme music, and opening visual sequences, and then develop robust statistical methods to locate the most probable spot where one show ends and another begins.

  • Consider that many of the videos share similar show boundary patterns. Videos were recorded in different recorders according to a specific schedule (a specific channel/set of shows). This means that certain groups of videos all reproduce the same daily or weekly schedule for a few months, because they come from tapes that were used consecutively in the same recorder set on the same schedule. In a given year, you will find up to 15 of these sets: the videos are marked by a "V" number (V1 to V15) according to the schedule they originally followed. This number is part of the videos' filenames, like 2006-01-04_0000_US_00000063_V2_MB7_VHS7_H11_MS.mp4. Although the show schedules vary over the years and sometimes over seasons, you will be able to find a repetitive pattern of start boundaries (video/audio sequences, specific frames) in several consecutive videos with the same V number. We are currently trying to identify which V numbers seem to have the most regular patterns.

Red Hen is open to developing segmentation tools in collaboration with vitrivr (, which already contains functionality to search for a frame within a video, a sequence of music within a sound track, and clusters of words in a text. Vitrivr performs shot segmentation based on visual content during video extraction. This will result in very similar or even identical video sequences for tv-show intros, outros, etc. The visual/audio features extracted from them will therefore also be very similar to each other. Given one or more instances of such sequences as example, vitrivr is capable of retrieving the other instances from within the collection. However, there is currently no logic to remember or export such boundaries a semantically meaningful entities.

Where to look

You could start working on a set of 2006 videos that have been manually annotated with information on each show’s starting point, showtime, duration, and visual or audio cues for boundary detection. You could try triangulating information from different sources:

  • Video: what on-screen elements mark the beginning of a show? It could be a black frame marking a transition, a theme sequence, a large full-screen logo, or perhaps a small logo that appears at the beginning of the show in a corner of the screen and differentiates the current show from the previous one. You can find a collection of useful images (logos, full-screen images) here.

  • Audio: what sounds are likely to appear at the beginning of a show (and hopefully nowhere else)? It could be a theme song or perhaps a catchphrase used consistently by the host of a specific show. Can a show be recognized by its audio?

  • Text: what words (spoken or written) might mark the beginning of the show? You can search both in the captions and in the OCR text. A few examples:

    • “Caption”/”Captions”: many videos have a file with captions/subtitles that includes the words “Captions by…” near the beginning of each show, sometimes corresponding with the exact beginning, sometimes some seconds off.

    • “Type=Story start” / “Type=Commercial”: many caption files have a specific line to introduce a block of lines that come from a commercial and a block of lines that come from the actual show.

    • Does the captioning text include any show title, channel or host information? If yes, you could try to match the video with an existing TV schedule.

    • Are there phrases like “Welcome to”, “You’re watching”, “I’m”, “Welcome back”, that you can use to identify the show?

Any further ideas are welcome!