Google Summer of Code 2019 Ideas Page
- Red Hen Lab has been selected as a Google Summer of Code 2019 organization; see https://g.co/gsoc
- The deadline for students to submit proposals to Google for consideration has now passed — this year Red Hen received an unprecedented 64 proposals, all of which had undergone a rigorous pre-proposal process
- For the 19 projects selected by Red Hen Lab this year, see the Project Profiles page
- This page serves as a record of the instructions and ideas given to prospective students
Red Hen Google Summer of Code 2019
Join the Red Hen Slack Channel
How to Apply
Red Hen Lab is an international cooperative of major researchers in multimodal communication, with mentors spread around the globe. Together, the Red Hen cooperative has crafted this Ideas page, which offers some information about the Red Hen dataset of multimodal communication (see some sample data here and here) and a long list of tasks.
To succeed in your collaboration with Red Hen, the first step is to orient yourself carefully in the relevant material. The Red Hen Lab website that you are currently visiting is voluminous. Please explore it carefully. There are many extensive introductions and tutorials on aspects of Red Hen research. Make sure you have at least an overarching concept of our mission, the nature of our research, our data, and the range of the example tasks Red Hen has provided to guide your imagination. Having contemplated the Red Hen research program on multimodal communication, come up with a task that is suitable for Red Hen and that you might like to embrace or propose. Many illustrative tasks are sketched below. Orient in this landscape, and decide where you want to go.
The second step is to formulate a pre-proposal sketch of 1-3 pages that outlines your project idea. In your proposal, you should spell out in detail what kind of data you need for your input and the broad steps of your process through the summer, including the basic tools you propose to use. Give careful consideration to your input requirements; in some cases, Red Hen will be able to provide annotations for the feature you need, but in other cases successful applicants will craft their own metadata, or work with us to recruit help to generate it.
Red Hen emphasizes: although she has programs and processes—see, e.g., her Τέχνη Public Site, Red Hen Lab's Learning Environment—through which she tutors high-school and college students, Red Hen Google Summer of Code does not operate at that level. Red Hen GSoC seeks mature students who can think about the entire arc of a project: how to get data, how to make datasets, how to create code that produces an advance in the analysis of multimodal communication, how to put that code into production in a Red Hen pipeline. Red Hen is looking for the 1% of students who can think through the arc of a project that produces something that does not yet exist. Red Hen does not hand-hold through the process, but she can supply elite and superb mentoring that consists of occasional recommendations and guidance to the dedicated and innovative student.
Send your pre-proposal to firstname.lastname@example.org. You may join the Red Hen Lab Slack Channel The ability to generate a meaningful pre-proposal is a requirement for joining the team; if you require more hand-holding to get going, Red Hen Lab is probably not the right organization for you this year. Red Hen wants to work with you at a high level, and this requires initiative on your part and the ability to orient in a complex environment.
When Red Hen receives your pre-proposal, Red Hen will assess it and attempt to locate a suitable mentor; if Red Hen succeeds, she will get back to you and provide feedback to allow you to develop a fully-fledged proposal to submit to GSoC 2019. The deadline for submitting your final proposal to Google was 9 April 2019. Your final proposal must be submitted directly to the Google Summer of Code site for Google to recognize your submission.
Red Hen is excited to be working with skilled students on advanced projects and looks forward to your pre-proposals.
Requirements for Commitment
Google requires students to be dedicated full-time to the project during Google Summer of Code and to state such a commitment. Attending courses or holding other jobs or onerous appointments during the period is a violation of Google policy. Red Hen relies on you to apply only if you can make this full commitment. If your conditions change after you have applied, Red Hen relies on you to withdraw immediately from Google Summer of Code. If you violate policy, you will not be paid. If you violate policy or if you are selected and then withdraw after selections have been announced, you will deprived another worthy applicant of being selected. Such eliminated slots cannot be recovered or reassigned.
In all but exceptional cases, recognized as such in advance, your project must be put into production by the end of Google Summer of Code or you will not be passed or paid. Putting your project into production means scripting (typically in bash) an automated process for reading input files from Red Hen's data repository, submitting jobs to the CWRU HPC using the Slurm workload manager, of course running your code, and finally formatting the output to match Red Hen's Data Format. Consider these requirements as opportunities for developing all-round skills and for being proud of having writtenb code that is not only merged but in regular production!
Requirements for Production
Note that your project must be implemented inside a Singularity container (see instructions). This makes it portable between Red Hen's high-performance computing clusters. Red Hen has no interest in toy, proof-of-concept systems that run on your laptop or in your user account on a server. Red Hen is dedicated exclusively to pipelines and applications that run on servers anywhere and are portable. Please study Guidelines for Red Hen Developers, and master the section on building Singularity containers. You are required to maintain a github account and a blog.
In almost all cases, you will do your work on CWRU HPC, although of course you might first develop code on your device and then transfer it to CWRU HPC. On CWRU HPC, do not try to sudo; do not try to install software. Check for installed software on CWRU HPC using the command
module spider singularity
module load gcc
module load python
On CWRU HPC, do not install software into your user account; instead, if it is not already installed on CWRU HPC, install it inside a Singularity container so that it is portable. Red Hen expects that Singularity will be used in 95% of cases. Why Singularity? Here are 4 answers; note especially #2 and #4:
What is so special about Singularity?
While Singularity is a container solution (like many others), Singularity differs in its primary design goals and architecture:
- Reproducible software stacks: These must be easily verifiable via checksum or cryptographic signature in such a manner that does not change formats (e.g. splatting a tarball out to disk). By default Singularity uses a container image file which can be checksummed, signed, and thus easily verified and/or validated.
- Mobility of compute: Singularity must be able to transfer (and store) containers in a manner that works with standard data mobility tools (rsync, scp, gridftp, http, NFS, etc..) and maintain software and data controls compliancy (e.g. HIPPA, nuclear, export, classified, etc..)
- Compatibility with complicated architectures: The runtime must be immediately compatible with existing HPC, scientific, compute farm and even enterprise architectures any of which maybe running legacy kernel versions (including RHEL6 vintage systems) which do not support advanced namespace features (e.g. the user namespace)
- Security model: Unlike many other container systems designed to support trusted users running trusted containers, we must support the opposite model of untrusted users running untrusted containers. This changes the security paradigm considerably and increases the breadth of use cases we can support.
A few further tips for rare, outlier cases:
- In rare cases, if you feel that some software should be installed by CWRU HPC rather than inside your Singularity container, write to us with an argument and an explanation, and we will consider it.
- In rare cases, if you feel that Red Hen should install some software to be shared on gallina but not otherwise available to the CWRU HPC community, explain what you have in mind, and we will consider it.
Remember to study the blogs of other students for tips, and document on your own blogs anything you think would help other students.
Red Hen Lab participated in Google Summer of Code in 2015, 2016, 2017, and 2018, working with brilliant students and expert mentors from all over the world. Each year, Red Hen has mentored students in developing and deploying cutting-edge techniques of multimodal data mining, search, and visualization, with an emphasis on automatic speech recognition, tagging for natural language, co-speech gesture, paralinguistic elements, facial detection and recognition, and a great variety of behavioral forms used in human communication. With significant contributions from Google Summer of Code students from all over the world, Red Hen has constructed tagging pipelines for text, audio, and video elements. These pipelines are undergoing continuous development, improvement, and extension. Red Hens have excellent access to high-performance computing clusters at UCLA, Case Western Reserve University, and FAU Erlangen; for massive jobs Red Hen Lab has an open invitation to apply for time on NSF's XSEDE network.
Red Hen's largest dataset is the NewsScape Library of International Television News, a collection of more than 600,000 television news programs, initiated by UCLA's Department of Communication, developed in collaboration with Red Hens from around the world, and curated by the UCLA Library, with processing pipelines at Case Western Reserve University, FAU Erlangen, and UCLA. Red Hen develops and tests tools on this dataset that can be used on a great variety of data—texts, photographs, audio and audiovisual recordings. Red Hen also acquires big data of many kinds in addition to television news, such as photographs of Medieval art, and is open to the acquisition of data needed for particular projects. Red Hen creates tools that are useful for generating a semantic understanding of big data collections of multimodal data, opening them up for scientific study, search, and visualization. See Overview of Research for a description of Red Hen datasets.
In 2015, Red Hen's principal focus was audio analysis; see the Google Summer of Code 2015 Ideas page. Red Hen students created a modular series of audio signal processing tools, including forced alignment, speaker diarization, gender detection, and speaker recognition (see the 2015 reports, extended 2015 collaborations, and github repository). This audio pipeline is currently running on Case Western Reserve University's high-performance computing cluster, which gives Red Hen the computational power to process the hundreds of thousands of recordings in the Red Hen dataset. With the help of GSoC students and a host of other participants, the organization continues to enhance and extend the functionality of this pipeline. Red Hen is always open to new proposals for high-level audio analysis.
In 2016, Red Hen's principal focus was deep learning techniques in computer vision; see the Google Summer of Code 2016 Ideas page and Red Hen Lab page on the Google Summer of Code 2016 site. Talented Red Hen students, assisted by Red Hen mentors, developed an integrated workflow for locating, characterizing, and identifying elements of co-speech gestures, including facial expressions, in Red Hen's massive datasets, this time examining not only television news but also ancient statues; see the Red Hen Reports from Google Summer of Code 2016 and code repository. This computer vision pipeline is also deployed on CWRU's HPC in Cleveland, Ohio, and was demonstrated at Red Hen's 2017 International Conference on Multimodal Communication. Red Hen is planning a number of future conferences and training institutes. Red Hen GSoC students from previous years typically continue to work with Red Hen to improve the speed, accuracy, and scope of these modules, including recent advances in pose estimation.
In 2017, Red Hen invited proposals from students for components for a unified multimodal processing pipeline, whose purpose is to extract information about human communicative behavior from text, audio, and video. Students developed audio signal analysis tools, extended the Deep Speech project with Audio-Visual Speech Recognition, engineered a large-scale speaker recognition system, made progress on laughter detection, and developed Multimodal Emotion Detection in videos. Focusing on text input, students developed techniques for show segmentation, neural network models for studying news framing, and controversy and sentiment detection and analysis tools (see Google Summer of Code 2017 Reports). Rapid development in convolutional and recurrent neural networks is opening up the field of multimodal analysis to a slew of new communicative phenomena, and Red Hen is in the vanguard.
In 2018, Red Hen GSoC students created Chinese and Arabic ASR (speech-to-text) pipelines, a fabulous rapid annotator, a multi-language translation system, and multiple computer vision projects. The Chinese pipeline was implemented as a Singularity container on the Case HPC, built with a recipe on Singularity Hub, and put into production ingesting daily news recordings from our new Center for Cognitive Science at Hunan Normal University in Hunan Province in China, directed by Red Hen Lab Co-Director Mark Turner. It represents the model Red Hen expects projects in 2019 to follow.
This year, the organization is focusing on gesture research, and also extending last year's work on ASR to several new languages; see details below.
In large part thanks to Google Summer of Code, Red Hen Lab has been able to create a global open-source community devoted to computational approaches to parsing, understanding, and modeling human multimodal communication. With continued support from Google, Red Hen will continue to bring top students from around the world into the open-source community.
What kind of Red Hen are you?
More About Red Hen
The profiles of mentors not included in the portrait gallery are linked to their name below.
Guidelines for project ideas
Your project should be in the general area of multimodal communication, whether it involves tagging, parsing, analyzing, searching, or visualizing. Red Hen is particularly interested in proposals that make a contribution to integrative cross-modal feature detection tasks. These are tasks that exploit two or even three different modalities, such as text and audio or audio and video, to achieve higher-level semantic interpretations or greater accuracy. You could work on one or more of these modalities. Red Hen invites you to develop your own proposals in this broad and exciting field.
Red Hen studies all aspects of human multimodal communication, such as the relation between verbal constructions and facial expressions, gestures, and auditory expressions. Examples of concrete proposals are listed below, but Red Hen wants to hear your ideas! What do you want to do? What is possible? You might focus on a very specific type of gesture, or facial expression, or sound pattern, or linguistic construction; you might train a classifier using machine learning, and use that classifier to identify the population of this feature in a large dataset. Red Hen aims to annotate her entire dataset, so your application should include methods of locating as well as characterizing the feature or behavior you are targeting. Contact Red Hen for access to existing lists of features and sample clips. Red Hen will work with you to generate the training set you need, but note that your project proposal might need to include time for developing the training set.
Red Hen develops a multi-level set of tools as part of an integrated research workflow, and invites proposals at all levels. Red Hen is excited to be working with the Media Ecology Project to extend the Semantic Annotation Tool, making it more precise in tracking moving objects. The "Red Hen Rapid Annotator" is also ready for improvements. Red Hen is open to proposals that focus on a particular communicative behavior, examining a range of communicative strategies utilized within that particular topic. See for instance the ideas "Tools for Transformation" and "Multimodal rhetoric of climate change". Several new deep learning projects are on the menu, from "Hindi ASR" to "Gesture Detection and Recognition". On the search engine front, Red Hen also has several candidates: the "Development of a Query Interface for Parsed Data" to "Multimodal CQPweb". Red Hen welcomes visualization proposals; see for instance the "Semantic Art from Big Data" idea below.
Red Hen is now capturing television in China, Egypt, and India and is happy to provide shared datasets and joint mentoring with our partners CCExtractor, who provides the vital tools for text extraction in several television standards, for on-screen text detection and extraction..
When you plan your proposal, bear in mind that your project should result in a production pipeline. For Red Hen, that means it finds its place within the integrated research workflow. The application will typically be required to be located within a Singularity module that is installed on Red Hen's high-performance computing clusters, fully tested, with clear instructions, and fully deployed to process a massive dataset. The architecture of your project should be designed so that it is clear and understandable for coders who come after you, and fully documented, so that you and others can continue to make incremental improvements. Your module should be accompanied by a python application programming interface or API that specifies the input and output, to facilitate the construction of the development of a unified multimodal processing pipeline for extracting information from text, audio, and video. Red Hen prefers projects that use C/C++ and python and run on Linux. For some of the ideas listed, but by no means all, it's useful to have prior experience with deep learning tools.
Your project should be scaled to the appropriate level of ambition, so that at the end of the summer you have a working product. Be realistic and honest with yourself about what you think you will be able to accomplish in the course of the summer. Provide a detailed list of the steps you believe are needed, the tools you propose to use, and a weekly schedule of milestones. Chose a task you care about, in an area where you want to grow. The most important thing is that you are passionate about what you are going to work on with us. Red Hen looks forward to welcoming you to the team!
Ideas for Projects
Red Hen strongly emphasizes that a student should not browse the following ideas without first having read the text above them on this page. Red Hen remains interested in proposals for any of the activities listed throughout this website (http://redhenlab.org). See especially the
Red Hen is uninterested in a preproposal that merely picks out one of the following ideas and expresses an interest. Red Hen looks instead for an intellectual engagement with the project of developing open-source code that will be put into production in our working pipelines to further the data science of multimodal communication. What is your full idea? Why is it worthy? Why are you interested in it? What is the arc of its execution? What data will you acquire, and where? How will you succeed?
1. Automatic Speech Recognition for Chinese
Red Hen has a working prototype in production, see Automatic Speech Recognition for Chinese. Please study the existing code and blogs carefully, and develop a proposal for how the process can be improved. In particular, study the related pages, github repositories, and blogs listed on that page before communicating with Red Hen. The audio is currently cut at fixed intervals; developing a method to cut the audio at word boundaries would likely improve the output. The current system uses Baidu's Deep Speech 2 inside a Singularity container; Red Hen would like to maintain this arrangement, but the system should be upgraded to Deep Speech 3 and include other improvements. Red Hen's dataset in Chinese is growing daily; Red Hen now has around a thousand hours of news recordings.
2. Automatic Speech Recognition for Hindi, Urdu, and Bengali
Mentor: T. M. Thasleema
Red Hen has recently begun to record television news in Kolkata in Hindi, Urdu, and Bengali. A successful proposal will include a detailed description of the tools that will be used. The ASR pipeline for each language must be built as a Singularity container on the Case HPC and put into production processing daily incoming files.
3. Automatic speech recognition for Arabic
4. Automatic speech recognition for European languages
We are open to proposals for a large number of European languages, including Czech, Danish, Dutch, English, German, Italian, Norwegian, Polish, Portuguese, Russian, Spanish, and Swedish.
5. OCR for Chinese, Arabic, Indi, Urdu, and Bengali
Red Hen invites proposals for optical character recognition of any language present in the NewsScape Library of International Television News, using the latest methods. Successful proposals will specify the method to be used. The task is to set up a stable pipeline that will perform OCR on television news shows. We will provide the recordings of the shows in an x264 encoding; you would detect the presence and location of any writing on screen in images at one-second intervals and then apply the OCR. Since the text often repeats in successive images, you also need to develop algorithms for deduplication, and use a dictionary to select the best of several captures of the same text. The OCR pipeline for each language must be built as a Singularity container on the Case HPC and put into production processing daily incoming files.
6. Gesture detection and recognition in news videos
Mentored by Mahnaz Parian <email@example.com> and Heiko Schuldt's team
Red Hen invites proposals to build a gesture detection and recognition pipeline. For gesture detection, a good starting point is OpenPose, and a useful extension is hand keypoint detection. Our dataset is around 600,000 hours of television news recordings in multiple languages, so the challenge is to obtain good recall rates with this particular content.
For the GSoC gesture project, Red Hen has the following goals:
- Build a system inside a Singularity container for deployment on high-performance computing clusters (see instructions)
- Reliably detect the presence or absence of hand gestures
- Recognize and label a subset of the detected hand gestures
- Process and annotate Red Hen's news video dataset
A good command of python and deep learning libraries (Tensorflow/caffe/Keras) is necessary. Please see here for more information regarding proposals.
7. Feature recognition in works of art
Mentored by Prof. Dr. Peter Bell <firstname.lastname@example.org>, Leonardo Impett, Dr. Line Engh, and Prof. Mark Turner
See Speech Gestures in Art, Dataset for Medieval Art, and Christian Iconography: The Émile Mâle pipeline. Red Hen is also open to a project on feature classification of Roman portaits; see Machine Learning Classifiers and Other Tools for Analysis of Roman Art (github). For background, see Peter Bell's article Nonverbal Communication in Medieval Illustrations Revisited by Computer Vision and Art History (2013).
You may propose other projects relating to feature recognition in works of art, such as image recognition in Greek vases, but this requires collecting and hand-annotating the data, cf. art collections.
8. Tagging for conceptual frames
Mentored by Francis Steen and Mark Turner
The International Computer Science Institute at UC Berkeley has developed a database of conceptual frames, FrameNet. Red Hen currently tags all English-language content in the NewsScape dataset with FrameNet 1.5, using CMU's Semafor. Red Hen invites proposals for updating this system to FrameNet 1.7 with Semafor and OpenSesame.
For a discussion, see Butterfly Effects in Frame Semantic Parsing:impact of data processing on model ranking (2018), which compares Semafor and OpenSesame for both FrameNet 1.5 and 1.7. The most promising approach is to improve and deploy their python pipeline. Do this inside a Singularity container at the Case HPC, run it on the entire Red Hen English dataset of around 450,000 hours of news recordings, and develop a metric to assess the results.
9. Use AI to expand FrameNet
Mentored by Tiago Torrent and Mark Turner
The FrameNet project is building a lexical database of English that is both human- and machine-readable, based on annotating examples of how words are used in actual texts. It contains more than 200,000 manually annotated sentences linked to more than 1,200 semantic frames using 13,640 lexical units, cf. current project status.
Red Hen invites proposals for augmenting the work of the human annotator in creating FrameNet annotations. This could for instance take the form of representing all of FrameNet as a set of vectors, identifying words and expressions in the Red Hen dataset that are not yet represented in FrameNet, and using WordNet distances to assign these words to existing frames, with a likelihood measure, retaining phrase-level context.
Please begin by familiarizing yourself with the FrameNet project.
10. Red Hen Rapid Annotator
Mentored by Peter Uhrig and Vaibhav Gupta
This task is aimed at extending the Red Hen Rapid Annotator, which was re-implemented from scratch as a Python/Flask application during last year's GSoC and is already in active use. Still, there are some bugs and feature requests. Then Red Hen would like to integrate it further with other pieces of software, such as CQPweb and Google Docs. And a usability review is under way at the moment, so Red Hen will probably incorporate suggestions from the usability report.
Please familiarize yourself with the project and play around with it.
11. NLP pipelines for various languages
Red Hen Lab runs NLP tagging on English, as you see from this snippet from a Red Hen metadata file:
csa@cartago:/tv/2019/2019-01/2019-01-19$ more 2019-01-19_0100_US_CNN_Anderson_Cooper_360.seg
COL|Communication Studies Archive, UCLA
TTL|Anderson Cooper 360
CMT|News and commentary
LBT|2019-01-18 17:00 America/Los_Angeles
SEG_02|2019-01-21 01:45|Source_Program=RunTextStorySegmentation.jar|Source_Person=Rongda Zhu
SMT_01|2019-01-21 02:42|Source_Program=Pattern 2.6, Sentiment-02.py|Source_Person=Tom De Smedt, FFS|Codebook=polarity, subjectivity
SMT_02|2019-01-21 02:42|Source_Program=SentiWordNet 3.0, Sentiment-02.py|Source_Person=Andrea Esuli, FFS|Codebook=polarity, subjectivity
NER_03|2019-01-21 02:43|Source_Program=stanford-ner 3.4, NER-StanfordNLP-03.py|Source_Person=Jenny Rose Finkel, FFS|Codebook=Category=Entity
POS_02|2019-01-21 02:43|Source_Program=stanford-postagger 3.4, PartsOfSpeech-StanfordNLP-02.py|Source_Person=Kristina Toutanova, FFS|Codebook=Treebank II
FRM_01|2019-01-21 02:44|Source_Program=FrameNet 1.5, Semafor 3.0-alpha4, FrameNet-06.py|Source_Person=Charles Fillmore, Dipanjan Das, FFS|Codebook=Token|Po
sition|Frame name|Semantic Role Labeling|Token|Position|Frame element
20190119010003.484|20190119010004.319|CC1|THEY NEVER MAKE STATEMENTS LIKE THIS.
The lines in 24 point print above show a sentence in the broadcast file and the grammatical tagging produced for it.
In principle, Red Hen Lab could run such Natural Language Processing tagging on a great range of files, in a great range of languages in which Red Hen holds and acquires data: English (including US, UK, and India), German, Norwegian, Swedish, Danish, Dutch, Spanish (including from Mexico), European and Brazilian Portuguese, French, Italian, Russian, Polish, Czech, Chinese, Arabic, Hindi, Urdu, Bengali, . . . . (For some of these languages—Chinese, Arabic, Hindi, Urdu, Bengali—Red Hen does not have closed-captions, and accordingly Red Hen has a separate project to develop Automatic Speech Recognition). See Basic Text Pipeline for an example of an NLP Pipeline that Red Hen is building. Building such a pipeline requires, typically, the processing of the text to segment it into sentences (see the github entry for, e.g., Pragmatic Segmenter). Then, Natural Language Processing software like Stanford Core NLP can be applied. Stanford Core NLP offers "jars" for several of these languages: Arabic, Chinese, French, German, Spanish. There are many other interesting packages of NLP software, such as LASER. This project is to develop an NLP pipeline for one of these languages. For example, Stanford Core NLP is not one of the software packages Red Hen runs on English, but Red Hen could be interested in such a pipeline. (Red Hen already runs Stanford Core NLP for its CQPweb interface.) Would you like to create an NLP pipeline? It is crucial that these pipelines would need to run inside a Singularity container on CWRU HPC and produce output that is available in the Red Hen metadata. A pipeline that runs as a demo on a laptop is not interesting to Red Hen.
12. Development of a Query Interface for Parsed Data
Mentored by Peter Uhrig's team
This infrastructure task is to create a new and improved version of a graphical user interface for graph-based search on dependency-annotated data. The new version should have all functionality provided by the prototype plus a set of new features. The back-end is already in place.
Develop current functionality:
- add nodes to the query graph
- offer choice of dependency relation, PoS/word class based on the configuration in the database (the database is already there)
- allow for use of a hierarchy of dependencies (if supported by the grammatical model)
- allow for word/lemma search
- allow one node to be a "collo-item" (i.e. collocate or collexeme in a collostructional analysis)
- color nodes based on a finite list of colors
- paginate results
- export xls of collo-items
- create a JSON object that represents the query to pass it on to the back-end
Develop new functionality:
- allow for removal of nodes
- allow for query graphs that are not trees
- allow for specification of the order of the elements
- pagination of search results should be possible even if several browser windows or tabs are open.
- configurable export to csv for use with R
- compatibility with all major Web Browsers (IE, Firefox, Chrome, Safari) [currently, IE is not supported]
- parse of example sentence can be used as the basis of a query ("query by example")
- Visit http://www.treebank.info and play around with the interface (user: gsoc2018, password: redhen) [taz is a German corpus, the other two are English]
- Think about html representation. Red Hen probably prefers HTML5/CSS3, but it is unclear whether its requirements can be met without major work on <canvas>, or whether sensible widgets are possible without digging into into the <canvas> tag.
Contact Peter Uhrig <email@example.com> to discuss details or to ask for clarification on any point.
13. Multimodal CQPweb
Mentored by Peter Uhrig's team
CQPweb (http://cwb.sourceforge.net/cqpweb.php) is a web-based corpus analysis system used by linguists. Red Hen is involved in extending its capabilities to handle multimodal forms of data. Red Hen is open to proposals that accomplish the following tasks:
- Phonetic search
- Integration with EMU
- Menu-based search assistance for gesture search, shot detection, speaker, etc.
- Direct integration of video player
Successful applicants for this infrastructure task will familiarize themselves with the existing codebase.
14. Opening the Digital Silo: Multimodal Show Segmentation
Libraries and research institutions in the humanities and social sciences often have large collections of legacy video tape recordings that they have digitized, but cannot usefully access -- this is known as the "digital silo" problem. Red Hen is working with several university libraries on this problem. A basic task Red Hen needs to solve is television program segmentation: in other words, the task of cutting a video obtained from a videotape into units corresponding to individual TV shows. The UCLA Library, for instance, is digitizing its back catalog of several hundred thousand hours of news recordings from the Watergate Hearings in 1973 to 2006. These digitized files have up to eight hours of programming that must be segmented at their natural boundaries; cf. sample data.
Red Hen welcomes proposals for a segmentation pipeline. An optimal approach is to use a combination of text, audio, and visual cues to detect the show and episode boundaries. Your project should assemble multiple cues associated with these boundaries, from recurring phrases, theme music, anchors' faces, and opening visual sequences, and then develop robust statistical methods to locate the most probable spot where one show ends and another begins. Red Hen is open to your suggestions for how to solve this challenge.
Red Hen is open to developing segmentation tools in collaboration with vitrivr (https://vitrivr.org), which already contains functionality to search for a frame within a video, a sequence of music within a sound track, and clusters of words in a text. Successful proposals for this infrastructure project will use all three modalities to triangulate the optimal video boundaries.
For more details, continue reading here.
15. NLP Pipeline for English v2
Mentored by Peter Uhrig, Francis Steen, Mark Turner
Red Hen has a working NLP pipeline for all incoming English recordings, but some of the tools are outdated, and some have already been superseded by our new system. Your task is to get the latest versions of a range of programs to run and adapt existing software to our new data format. Specifically, these include:
- Commercial detection (existing software, needs to be integrated with the new file format)
- Sentiment annotation
- time expressions
- possibly: coreference resolution
For this task, you should
- be good at getting software to run in Linux, which will include Bash scripting, compiling with weird tools, dependency management, ...
- be good at transforming data from one textual format to another (tables, XML, JSON, proprietary formats).
- ideally be able to modify simple I/O code in various programming languages (Python, Perl, Java)
16. Semantic Art from Big Data
Mentored by Heiko Schuldt and Francis Steen
Vast collections of multimodal data are becoming raw materials for artistic expressions and visualizations that are both informative and esthetically appealing. Red Hen is collaborating with vitrivr (https://vitrivr.org) to develop an application for semantically meaningful large-scale visualizations of multimodal data. The tools will support visualizations along a range of scalar dimensions and arrays, utilizing Red Hen's deep analysis of hundreds of thousands of hours of news videos: clusters of event categories over time, the distribution of emotions across nations and networks, the emotional intensity of a single event cascading through the international news landscape. Red Hen is also interested in making these visualizations serve as browsing tools for exploring large collections of images and videos in novel and creative ways.
For examples of Red Hen big data visualizations, see the Viz2016 project, which provides visualizations of some dimensions of US Presidential elections.
Successful applicants for this task will familiarize themselves with the vitrivr stack, including Cineast, a multi-feature content-based multimedia retrieval engine. Java and web programming skills are required.
17. Chinese Pipeline
Mentored by Weixin Li, Yao Tong, and Kai Chan
Red Hen has recently begun acquiring massive audiovisual data in Chinese and wants both to extend that collection and to add other kinds of Chinese data (text, audio). This task includes developing tools for tagging, parsing, annotating, analyzing, searching, etc. the Chinese data. Red Hen now directs a new Center for Cognitive Science at Hunan Normal University dedicated to this project and to related work on multimodal communication. Before communicating with Red Hen, study Automatic Speech Recognition for Chinese, including the related pages, github repositories, and blogs listed on that page. Areas of work in this project might include:
- Extracting captions
- For OCR, Google tesseract in collaboration with CCExtractor
- NLP of various kinds: word segmentation, part-of-speech tagging, named entity recognition, sentiment analysis, etc. Resources known to Red Hen include:
- A curated list of resources for Chinese NLP
- NLPIR/ICTCLAS Chinese segmentation software: a python wrapper is available at https://github.com/tsroten/pynlpir
- FudanNLP tookit: https://github.com/FudanNLP/fnlp
- Stanford NLP tookit's Chinese module: https://nlp.stanford.edu/projects/chinese-nlp.shtml
- For speech to text, many speech recognition packages, such as CMUSphinx and Baidu Deep Speech, which is based on TensorFlow.
- Forced alignment of Chinese
- Multi-program transport stream splitting for Joker-TV (collaboration with Abylay Ospan)
Red Hen is collaborating with CCExtractor on text extraction and OCR; successful candidates will have a mentor from both organizations.
18. System Integration of Existing Tools Into a New Multimodal Pipeline
Red Hen is integrating multiple separate processing pipelines into a single new multimodal pipeline. Orchestrating the processing of hundreds of thousands of videos on a high-performance computing cluster along multiple dimensions is a challenging design task. The winning design for this task will be flexible, but at the same time make efficient use of CPU cycles and file accesses, so that it can scale. Pipelines to be integrated include:
- Shot detection
- Commercial detection
- Speaker recognition
- Frame annotation (for English)
- Text and Story segmentation
- Sentiment Analysis
- Emotion detection
- Gesture detection
This infrastructure task requires familiarity with Linux, bash scripting, and a range of programming languages such as Java, Python, and Perl, used in the different modules.
19. Semantic Annotation Tool
Red Hen provides an integrated research workflow, from manual annotation to machine learning and data visualization. The Semantic Annotation Tool (SAT) is a next-generation annotation too developed by Red Hen's collaborator The Media Ecology Project. SAT is a jQuery plugin and Ruby on Rails server that adds an annotation interface to HTML5 videos. For machine learning, it is essential that semantic annotations be spatially located within the picture frames of the video, so that the algorithms focus on the correct features. SAT supports associating tags and text bodies with time- and geometry-delimited media fragments using W3C Web Annotation ontologies. One limitation of the current tool, however, and the Web Annotation spec more generally, is that there is no support for moving a geometric annotation target within a frame over time. For example, a baseball thrown from the left side of the frame to the right would force an annotator to choose whether they want their annotation to target the ball’s location on the first frame it appears, the last frame, or even the entire path of the ball. No matter what they decide, the annotation target will necessarily be inaccurate.
Red Hen wants to add the ability to tween geometric annotation targets over time. These areas are currently defined as a single array of points. The new feature would redefine geometric targets to include multiple arrays of points for starting location, ending location, and an arbitrary number of keyframes; add interface tools to the jQuery plugin that allows all of these locations to be entered by a user; add support for graphically tweening the geometric area in sync with playback of the video; and extend the current data API (client and server) to support the new geometric data format.
20. Forced Alignment
Mentored by Francis Steen, Peter Uhrig, and Kai Chan
The Red Hen Lab dataset contains television news transcripts in multiple languages, including American English, British English, Danish, French, Italian, German, Norwegian, European and Brazilian Portuguese, Russian, European and Mexican Spanish, and Swedish. These transcripts timestamped, but these timestamps are delayed by a variable number of seconds relative to the audio and video. To bring them into alignment, Red Hen has so far used the Gentle aligner to align English-language text. Red Hen could now either try to extend Gentle to other languages (it is based on Kaldi, so this may be possible) or use some other software such as Aeneas, or the MAUS tools offered by BAS. A relatively complete list of forced alignment tools can be found at https://github.com/pettarin/forced-alignment-tools, which is maintained by the author of Aeneas.
The task is to create an automated pipeline on a high-performance computing cluster that deploys a suitable piece of software on the languages present in the Red Hen dataset, perform quality checks, and improve the quality of the alignment for hundreds of thousands of hours of transcripts. For parallel corpora, see Europarl.
For background, see http://linguistics.berkeley.edu/plab/guestwiki/index.php?title=Forced_alignment. Improving the quality of alignment will involve developing methods for cleaning up non-speech content in the transcripts, adding words to a local dictionary, and evaluating accuracy.
Successful applicants will familiarize themselves with forced alignments and will identify software which can in fact align one-hour recordings with their transcripts. Gentle is exceptional in this regard - the Montreal Forced Aligner, for instance, cannot cope with such long recordings. For instance, the BAS has created a chunker to circumvent this problem, but the reliability of such a solution would need to be assessed, too.
Mentors will be available for each language.
21. Cockpit - Red Hen Monitoring System
Mentored by José Fonseca and Francis Steen
Currently, RedHen has more than a dozen remote capture stations all around the World that send their data to a central repository. The growing number of capture stations increases dramatically the complexity of their management from a central entity. The stations should be online, able to record the media signals based on a schedule and send them, or allow them to be downloaded, to a central repository. If any of these tasks fail, then the local person responsible for that station must be warned in order to fix the problem.
This project proposes to automate the task of sensing the health of the stations and take appropriate actions, according to the problems detected. Some other routine operations may also be performed, like automatic backup generation, and configuration of new stations. It should provide a responsive dashboard, called Cockpit, for the central administrator, using a web device or a smartphone, to monitor the capture stations, obtain uptime statistics and make corrective actions.
22. Accountability classifier from annotated data
Mentored by a team at UCLA Public Health
Red Hen has a dataset of newspaper articles on mass shooter events that is annotated for accountability and responsibility. The task is to train a classifier for detecting terms of accountability and responsibility in a larger dataset of newspaper articles and television news transcripts. We are particularly interested in tracking how allocation of responsibility changes over time.