Opening the Digital Silo: Multimodal Television Show Segmentation
Mentored by Anna Bonazzi, Tim Groeling, Kai Chan, and Luca Rossetto
We have a large collection of videotape recordings from the ‘70s to 2006 that were digitized in UCLA’s NewsScape TV News Archive (22,544 .mp4 files as of February 2019). Each of these files is between 3 and 8 hours long and contains a varying amount of TV shows, mostly news and talk shows with commercials in between (see a short sample). Every .mp4 file comes with several other files, including a .txt3 file with the captions of the video (like this) and sometimes an .ocr file with the text appearing on screen during the video (like this). You can find longer samples of the dataset here.
We are currently working on manually annotating a sample of TV shows to identify and describe the starting boundaries of the shows and try and reconstruct the weekly TV schedules, in order to provide "ground truth" data for possible machine approaches. You can view an example of the annotations here and a growing collection of show boundary screenshots here.
Ideally, we need a segmentation pipeline that 1) detects the start boundary of each new TV show in the 8-hour digitized recording; 2) recognizes what show it is; 3) splits the 8-hour video into smaller segments, one for each TV show, with annotations of show title, channel and showtime.
Well, that is the challenge, isn’t it? The task isn’t trivial, due to the magnitude and diversity of the material to be processed. However, we do have some starting points:
Red Hen is open to developing segmentation tools in collaboration with vitrivr (https://vitrivr.org), which already contains functionality to search for a frame within a video, a sequence of music within a sound track, and clusters of words in a text. Vitrivr performs shot segmentation based on visual content during video extraction. This will result in very similar or even identical video sequences for tv-show intros, outros, etc. The visual/audio features extracted from them will therefore also be very similar to each other. Given one or more instances of such sequences as example, vitrivr is capable of retrieving the other instances from within the collection. However, there is currently no logic to remember or export such boundaries a semantically meaningful entities.
Two concrete starting points for this project are:
You are encouraged to build upon this code or use sections of it (even though this is not a strict requirement) and think about how to combine these elements together. You could start working on a set of 2006 videos that have been manually annotated with information on each show’s starting point, showtime, duration, and visual or audio cues for boundary detection. However, please remember that the manual annotations we have are few, while the videos we would like to segment are many: a viable system should not rely permanently on the existence of manual annotations.
You could try triangulating information from different sources:
Any further ideas are welcome!