Update the FrameNet tagger to Open Sesame
Red Hen annotates all of its English-language television caption data with conceptual frames from UC Berkeley's FrameNet project -- to our knowledge, by far the largest dataset thus annotated. To achieve this, we use the Semafor 3.0 frame-semantic parser from CMU, and the results are excellent. However, Semafor only works with FrameNet 1.5, and the project has now been superceded by Open Sesame, which also handles FrameNet 1.7.
For a discussion, see Butterfly Effects in Frame Semantic Parsing:impact of data processing on model ranking (2018); see also their python pipeline, which we may want to utilize.
The task is to implement Open Sesame's annotation of Red Hen's English textual data with FrameNet 1.7. Would you like to work on this task?
If so, write to
and we will connect you with a mentor.
How to start
A set of suggestions to find out more about Open-SESAME works and how we can integrate it into our existing data format to make the annotations searchable. Once we have gathered the required intelligence from the steps below, we will come up with a plan how to interface with Open-SESAME.
Install Open-SESAME in a Singularity container. If you have a Linux machine with root access, you can play around locally, but in the end you should make available a working image through Singularity Hub. If you do not have a Linux machine with root access, you can use SingularityHub directly (but this will be a bit more tedious).
Test it by running it with one sentence per line of normal text (i.e. no spaces between words and periods/full stops, commas, and so on).
What does it do?
What does the output look like?
Are words and punctuation separated in the output? If so, we know that Open-SESAME performs tokenization. Try sentences with hyphenated words (ice-cream, co-pilot, ...). Are they split up or not?
Can you identify the tagset used? It will very likely be the Penn Treebank Tagset, but we'd better verify.
Are there syntactic annotations? (nsubj, dobj, ... or NP, PP, ...)
Report back with your results so we can decide which route to take from here.
Tagging FrameNet 1.7 with Semafor
Red Hen is currently tagging FremeNet 1.5 with Semafor 3.0-alpha4 from the ARK group at CMU. It selects frames from FrameNet 1.5, using automatic semantic role labeling (ASRL), frame identification, and argument identification. FrameNet-06.py converts the json output to the RedHen format with primary tag FRM_01. An experimental fork is maintained at https://github.com/AlenUbuntu/semafor. It has made a few fixes and updated semafor to be compatible with framenet 1.7. It would be very interesting for Red Hen to see if we can use this fork to upgrade our current pipeline.
FrameNet has initiated a Multi-lingual FrameNet project, funded by the NSF. Anything that comes out of this project should also be used by Red Hen.
In late October 2018, Red Hen directors Turner and Steen met in Zoom with FrameNet principal Eve Sweetser at Berkeley, lead developer of Brazilian FrameNet Tioga in Rio, former ICSI PhD Nancy Chang at Google, and Anna Pleshakova at Oxford. Tiago was asked to prepare a brief description of the web-based infrastructure he is setting up to allow submissions of frames in multiple languages. Olga Lynshevskaya, who was part of Laura Janda's team at some point and speaks Norwegian, is involved in the Russian FrameBank project.
2021-06-01: New possibilities: https://www.aclweb.org/anthology/2021.eacl-demos.19.pdf "We present LOME, a system for performing multilingual information extraction. Given a text document as input, our core system identifies spans of textual entity and event mentions with a FrameNet (Baker et al., 1998) parser. It subsequently performs coreference resolution, fine-grained entity typing, and temporal relation prediction between events. By doing so, the system constructs an event and entity focused knowledge graph. We can further apply third-party modules for other types of annotation, like relation extraction. Our (multilingual) first-party modules either outperform or are competitive with the (monolingual) state-of-the-art. We achieve this through the use of multilingual encoders like XLM-R (Conneau et al., 2020) and leveraging multilingual training data. LOME is available as a Docker container on Docker Hub. In addition, a lightweight version of the system is accessible as a web demo." —Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 149–159 April 19 - 23, 2021. ©2021 Association for Computational Linguistics