Guidelines for Red Hen Developers

Introduction

Guidelines for people who join or create a project team, especially one involving coding.

Related pages

Google Summer of Coders 2018

  • Name | Mentor | Topic | Github | Blog | Case HPC
  • Ahmed Ismail -- Turner -- ASR -- github -- blog 
  • Ajinkya Takawale -- Singla -- AVSR -- github -- blog 
  • Awani Mishra -- Bonazzi -- segmentation -- github? -- blog? 
  • Burhan Ul Tayyab -- Steen -- OCR -- github -- blog -- old blog
  • Devendra Yadav -- Bhatt
  • Gyanesh Malhotra -- Bhatt
  • Shuwei Xu -- Turner -- ASR -- github -- blog 
  • Sumit Vohra -- Bhatt
  • Vaibhav Gupta -- Uhrig -- annotator -- github? -- blog? 
  • Vikrant Goyal -- Singla -- translation -- github -- blog -- vxg195 
  • Vinay Chandragiri -- Bhatt
  • Zhaoqing Xu --Turner -- ASR -- github -- blog 

Standard Operating Guidelines

Networking your data, activity, and code

  • The overarching goal is to place code you develop into production in a Red Hen pipeline, and to place data you use into a networked resource list. Your work should be available to other, and future Red Hens. Some specifics:
    • A complete instance of the Red Hen dataset is present on the server Gallina, which is located inside the Case Western Reserve University's High-Performance Computing Cluster; this is where Red Hen builds its main information processing pipelines
    • To contribute effectively to Red Hen, you will probably need to establish an account as a member of Mark Turner's mbt8 HPC team at Case. We expect your functional code to reside there and to be either in production or available for further development by your team or future members of the team.
    • Except for special cases involving confidentiality, proprietary restrictions, etc., your data must be available generally, with specification of how to locate it, or on a Red Hen server (e.g. gallina) for other Red Hens.

Red Hen dataset

The entire Red Hen dataset of nearly 500,000 video and text files from television news recordings is available inside the Case HPC cluster, mounted on /mnt/rds/redhen/gallina/tv.

Files are organized by date -- that is to say, by year, month, and day. Each individual recording of a television news program will have a series of files with the same name and different extensions, for instance:

drwxr-xr-x 2 tna tna     57344 May 22 06:49 2018-05-22_0600_US_KMEX_Noticias_34_Edición_Nocturna.img
-rw-r--r-- 1 tna tna    380158 May 22 06:49 2018-05-22_0600_US_KMEX_Noticias_34_Edición_Nocturna.jpg
-rw-r--r-- 1 tna tna 124433419 May 22 06:43 2018-05-22_0600_US_KMEX_Noticias_34_Edición_Nocturna.mp4
-rw-r--r-- 1 tna tna     13980 May 22 07:01 2018-05-22_0600_US_KMEX_Noticias_34_Edición_Nocturna.ocr
-rw-r--r-- 1 tna tna     47207 May 22 06:30 2018-05-22_0600_US_KMEX_Noticias_34_Edición_Nocturna.txt

For a detailed description of the data, see Red Hen data format.

Accessing the Red Hen dataset

To navigate in the Gallina tree, add this function to your ~/.bashrc file:

    # Move to the main tv storage directory N days ago and list the contents
        function day () {
         if [ -z "$1" ] ; then DAY=0 ; else DAY=${1:0:10} ; fi
         if [ "$( echo "$1" | egrep '^[0-9]+$' )" ] ; then DAY="$1"
          elif [ "${#1}" -eq "7" ] ; then cd /mnt/rds/redhen/gallina/tv/${1%-*}/$1 ; DAY=""
          elif [ "$1" = "here" ] ; then DAY="$( pwd )" DAY=${DAY##*/} DAY="$[$[$(date +%s)-$(date -d "$DAY" +%s)]/86400]"
          elif [ "$1" = "+" ] ; then DAY=`pwd` ; DAY=${DAY##*/}
            DAY="$[$[$(date +%s)-$(date -ud "$DAY" +%s)]/86400]" ; DAY=$[DAY-$2]
          elif [ "$1" = "-" ] ; then DAY=`pwd` ; DAY=${DAY##*/}
            DAY="$[$[$(date +%s)-$(date -ud "$DAY" +%s)]/86400]" ; DAY=$[DAY+$2]
          elif [ "${#DAY}" -eq "10" ] ; then DAY="$[$[$(date +%s)-$(date -ud "$DAY" +%s)]/86400]"
          else echo "$1?"
         fi #;  echo "DAY is $DAY ; 1 is $1 ; 2 is $2"
         if [ -n "$DAY" ] ; then DIR="/mnt/rds/redhen/gallina/tv/$(date -ud "-$DAY day" +%Y)/$(date -ud "-$DAY day" +%Y-%m)/$(date -ud "-$DAY day" +%F)"
           if [ -d $DIR ] ; then cd $DIR ; else echo "No $DIR" ; fi
         fi
        }


Save the file and issue "source ~/.bashrc" to activate. To go to a particular day, issue "day" with the date or the number of days ago:

    day 2018-02-04
    day 4

To navigate between dates, use

    day + 5
    day - 30

You can also use this in a loop, for instance:

    module load ffmpeg
    for DAY in {08..31} ; do day 2018-01-$DAY ; for FIL in *_CN_*.txt ; do echo $FIL ; grep 'DUR|' $FIL ; ffprobe ${FIL%.*}.mp4 ; done ; done

We have limited storage capacity in your home directory, but ample space on gallina, which is to say, /mnt/rds/redhen/allina. Please create a directory on gallina where you can store your output and possibly your code and symlink to it from your home directory.

Documenting your work

  • Create a blog for your project on Github Pages using Jekyll, a static site generator. Link to the URL from this page, next to your name in the section Google Summer of Coders 2018 above.  As you progress, update your blog to indicate the current state and the next steps, and link to your code. Once a project is (perhaps temporarily) wrapped up, add a new post with clear user instructions, that will help newcomers to the project get oriented and use whatever has been developed.  See this page as a model.

Singularity containers

  • To standardize and to make our pipelines portable between laptops, servers, and HPCs, we use Singularity containers. These should be constructed using recipes kept at https://github.com/RedHenLab/singularity_containers. These recipes will be picked up by Singularity Hub and built automatically. This dramatically simplifies the task of maintaining and distributing the software we build for pipelines. For instructions, see Using Singularity to create portable appliances.