— How to use the Edge search engine

https://tvnews.sscnet.ucla.edu/edge/ 

Contents
  1. Introduction
  2. Front Page- Basic Search
  3. Advanced Search
    1. Display Format
    2. Search Boxes
    3. Search
    4. Export
    5. Regular Expression Mode
  4. Browse
  5. Regular Expression
  6. Search Results
    1. Video
    2. Text
    3. Montage
    4. Image Flow
    5. Metadata
    6. Permalink
    7. Bookmark
    8. Exporting
  7. Job List


A. Introduction top

Various sites capture TV News from around the world. Closed captioning texts in English and other languages are digitalized, captured, and put in this search engine. The search engine can potentially include the text of on-screen text boxes (harvested through optical character recognition), the text of transcripts, and other aspects of the broadcast. 

Related scrolls



B. Front Page- Basic Search top

Front Page

A UCLA account is required to access this archive. (For researchers in the Distributed Little Red Hen Lab, one of the co-directors must approve your acquisition of a UCLA account and monitor your research.) Upon logging in, you will be taken to the main page. The basic search features can be accessed here. The search will show 10 programs/results per page.

You can also access Advanced Search and Browse from this page.



C. Advanced Search top

Advanced Search Screen

The advanced search screen lets you control display format, specify networks, series, and date ranges through menus.

  1. Display Format

  2. There are three different display formats available: list, table, and chart.
    • List: Gives you a list of the programs that meet your search requirements.
    • Table: Lists a month's worth of results by date, number of programs that meet the search requirement, number of hits in those programs, and total programs for that day.
    • Table Results Screen

    • Chart: Displays the data of search results in charts. You can choose from area chart, bar chart, column chart, line chart, stepped area chart, or table. Each chart will display one of the following: occurrence counts (cumulative/non-cumulative) or news program counts (cumulative/non-cumulative).
    • Chart Results Screen


  3. Search Boxes
    • with all the words/phrases: contains all words/phrases.
    • with at least one word/phrase: contains at least one of the words/phrases.
    • without the words/phrases: does not contain the words/phrases.
    • with all the words/phrases within...: allows you to choose words/phrases that are near each other (ie. within same segment, within 5 words, within 10 words, etc.)

  4. Search
  5. The search button will lead you to the results page.


  6. Export
  7. The export button will lead you to the Job List page. This will allow one to export searches which you will be able to open in another program on your computer. You can continue searching while the export job is completing.

    Searches are exported to csv (comma-separated values) files. They can be read by spreadsheet programs such as Excel, Numbers, and OpenOffice. If prompted, select line 6 to start the rows, the UTF-8 character set, comma as the field delimiter, and double quote as the text delimiter.

    In the spreadsheet application, you can convert the URLs in the csv file into hyperlinks:

    Click a hyperlink to call up that search result.

  1. Regular Expression Mode
    • Fast: uses faster, indexed version of closed captioned files, with no punctuation included. Fast is case-insensitive. Fast version additionally has word place-holders (get example from existing docs).
    • Raw Text: uses raw, closed caption files, including punctuation. It is case-sensitive.
    • Both use mostly the same syntax.

    Test Page

    Test: opens a new tab to the test page, which allows you to test a regex pattern.


D. Browse top

Image Flow

The browse function will take you to the image flow set to today's date.



E. Regular Expression top

Regular Expressions Guide

You may test a regex pattern through the test link found in the Advanced Search page.
  1. Basic Syntax
  2. A regular expression pattern can contain subpatterns separated by space. Each subpattern matches a consecutive word, and the pattern as a whole matches a phrase.

    A regular expression takes the form of /subpattern1 subpattern2 ... subpatternN/

    Each subpattern can contain the following:

    String Matches Example
    . any character (within a word)
    [a-z] a single character from a to z
    | any of the elements (he|she|it) matches any of the words "he", "she" or "it"
    ? the preceding element occurring at most once (en)?large matches the words "large" or "enlarge"
    {m,n} the preceding element occurring at least m times and at most n times grea{2,4}t matches the words "greaat", "greaaat" or "greaaaat"

    The regular expression search use the BRICS automaton package.
    Regular expression syntax

    Currently, please use lower case to enter a pattern.


  3. Placeholders
  4. Placeholders can be used as a whole subpattern, and in the place of a single or multiple words.

    Placeholder Matches
    * 0 or more words
    *+ 1 or more words
    *? 0 or 1 words
    *{n} exactly n words
    *{m,} m or more words
    *{,n} up to n words
    *{m,n} at least m words and up to n words

    Please note that a placeholders must be used between non-placeholders. For example, the pattern /*+/ does not match anything. Currently, leading and trailing placeholders are discarded. For example, /*+ of the *+/ behaves the same as /of the/.


  5. More Examples

  6. Pattern Matches
    /[A-Za-z]{10}/ words at least ten letters long
    /[0-9A-Za-z]{10}/ words 10 alphanumeric characters long
    /[A-Za-z]{10,12} / 10 to 12 letters followed by a space
    /[A-Za-z]{10,12}(,|.)/ 10-12 followed by a space, comma, or period
    /[A-Za-z-]{10,12}S/ including hyphens, and end in S
    /[A-Za-z-]{12,}(!|?|.|,|:|;)/ 12+ and then space or punctuation
    /G[A-Za-z-]{10,12} / G followed by 10-12 letters and then a space
    / G[A-Za-z-]{10,12} / same, but G is the first letter
    / A G[A-Za-z-]{10,12} / same, preceded by an indefinite article
    / A G[A-Za-z-]{10,} / same, but no maximum number of letters
    /IS NOW [A-Za-z-]{2,}ING / "is now *ing"

    Pattern Matches
    /* is the * of */ "is the", followed by exactly one word, followed by "of"
    /has .*ed now/ "has", followed by a word that ends with "ed", followed by "now"
    /vote for (obama|romney)/ "vote for", followed by either "obama" or "romney"


F. Search Results top

Broadcast Times

There are two times listed for each program. The first one is the local time when it was broadcasted. Although most of the programs are broadcasted in California, there are some that come internationally. Thus, the second time is in the form of UTC.

Search Results Page

  1. Video
  2. Video Player

    Clicking on the video link or on the thumbnail will play the video in the player on the right. You can skip forward or backward by using the buttons in the video player.

    Note: On a given search results page, when you click on a thumbnail to cue the video player you may notice a discrepancy between the thumbnail image and the frame that appears in the video player. There is a 10-second difference between the two. This is due to the closed captioning text, which always lags in timecode behind the actual video. This is not an error in the DCL's timecode structure. If you immediately click on the "skip ahead 10" link in the video player after loading the clip, you will see identical images.

  3. Text
  4. Text Page

    This will take you to the page with the transcription of the video.

  5. Montage
  6. Montage Page

    Clicking on the montage link will lead you to thumbnails taken every 10 seconds of the show. Clicking on a thumbnail will start the video at that specific time of the show.

  7. Image Flow
  8. Image Flow

    Image Flow is a collection of montages of programs for each day. Browse by date and then program. Selecting a program will take you to the montage where you may look through thumbnails of the video for every 10 seconds. Clicking a thumbnail will take you to a new screen in which the video will be played at that timecode.

  9. Metadata
  10. Metadata Page

    Metadata is where you can find the closed captioning along with the corresponding time stamp. Clicking on the time will play the video at that specific moment in the show. You may also bookmark the video.

  11. Permalink
  12. The permalink will be able to take you to the page for linking the video. This bookmarks the video and will begin from the beginning of the show.

  13. Bookmark
  14. The bookmark will bring you to a page with the video which will start at a specific timecode of the show.

    To bookmark a video: Simply right click (or CTRL + click if on a mac) the link "permalink" and select "Bookmark This Link." Type in a reference for the bookmark in the "name" field, then select "ok." The reference will now appear under the bookmarks menu. If using a browser other than Firefox, note that the particular steps for accessing the bookmark may differ slightly. Refer to your browser's "help" section if necessary. This bookmark will cue the video at the beginning of the clip. If you desire a bookmark for a specific time in the video, use the second option.

    Bookmark a video at specific timecode: After playing the video and noting the desired timecode, click on the paper icon located at the end of each caption preview. This will open a page containing only that particular video clip. In the URL field at the top of your browser, change the last set of numbers to your desired timeocde. Note, you must convert the timecode into seconds and the number must be in ten second increments. Click enter to load the page for that specific timecode. Then select Bookmark from the browser menu bar. Select Bookmark this page. Fill in a reference in the "name" field.

    Example:

    • Noted timecode is 15 minutes and 23 seconds into a given clip (923 total seconds).
    • After clicking on the paper icon, the following appears in the URI field: "http://dcl.sscnet.ucla.edu/search/video,20279,170".
    • Change the "170" to 920 since the timecode must be converted to seconds and be in ten second increments. The URI would now read: http://dcl.sscnet.ucla.edu/search/video,20279,920
    • Select bookmark from the browser menu bar.
    • Select Bookmark this page.
    • Fill in a reference in the "name" field.
  15. Exporting
  16. There are two options for exporting: "export this page" or "export all pages." Both will lead you to the Job List page.

    • Export this page: Exports the number of programs found on that page.
    • Export all pages: Exports all of the programs found based on that search.

    Upon completion of the export, you may download it and open the file using another program such as Excel. The export is text-only.


G. Job List top

Job List

The job list will give you the list of activity done by the user. This allows you to go back to previous searches and download exports. The job list can be viewed by clicking your name at the top right corner. Displays type, start/end, query, status, message, action of activity.

  1. Type:
    • Describes whether the job is a search or export.

  2. Start/End:
    • Gives the time and date of the activity.

  3. Query:
    • Clicking on the query will take you to the advanced search screen with the same entries pre-filled.

  4. Status:
    • Finished: Job has been completed.
    • Running: Job is still in the process of completion. You can cancel the job by clicking the "cancel" link to the right.
    • Cancelled: Job has been cancelled.
    • Queued: Job has not started running yet but will run later.
    • Error: Job has been aborted because of an error.

  5. Message:
    • Describes the progress or results of the activity.

  6. Action:
    • Export jobs will be able to be downloaded or deleted. Downloading an export will allow you to open the file on your computer using another program such as excel.
    • Viewing a search will take you to the results page.

H. Missing Files top

Files rejected by the Edge import script are listed in tvnews:/data/tna/edge/solr/invalid_files.txt.

We should monitor this file. There are three common failure types:

  • the import script encountered an unfamiliar header tag (solution: modify the import script)
  • the duration of the video is missing (solution: run fixDUR on the file)
  • the video is missing (solution: none, disregard)