Red Hen Lab - Basic Text Pipeline

Some "multimodal communications" consist of text files. A scientific article is such a piece of data. These files can be processed and tagged so as to create files of metadata that can then be searched and further processed. For example, Red Hen has a project focused on processing and tagging abstracts of scientific articles to make it easier to analyze them for image-schematic and narrative structure.

The Basic Text Pipeline is project within Red Hen Lab.

A Brazilian team from IFSP constitute the current active researchers of the pipeline: Rafael Ruggi (student), Lucas Souza Teixeira (student), Matheus Sardeli Malheiros (student), Gabriel Evaristo Santana da Silva (student), Nicholas Gomez Zilli Castro (student), Yamen Zaza (Student), Professor Gustavo Aurelio Prieto, Professor Hana Gustafsson - mentored by Professor Rosana Ferrareto.

Respectively, the active mentioned participants are:

Would you like to assist? If so, write to:

and we will try to connect you with a mentor.

Currently we have the following picture as a guideline for the development of the pipeline as a whole.

Constraints notes:

During the construction of the pipeline several problems were encountered by the developers and linguists, all the problems had during the construction period will be noted using this format, just a heads up, nothing to report yet!

Metadata

1. Elements of the Pipeline

1.1 File Acquisition

This can be accomplished in a myriad of ways. In the current state of this pipeline, these files are Journal Abstracts gathered manually from high impact journals in the Web of Science Platform (webofknowledge.com).

The current Dataset being worked at the Red Hen Basic Text Pipeline is a corpus composed of 1,000 Abstracts gathered by Cognitive Linguist and Language Professor Rosana Ferrareto, faculty member of IFSP and former PostDoctoral visiting researcher at CWRU, in the CogSci Department and in the Red Hen Lab.

The processes of abstract acquisition can be improved, and one of the ways in which this is possible is by creating a web crawler, which can do the steps required to download the abstracts and then upload those onto a storage of some kind.

Here's an example of an raw manually collected file:

FN Clarivate Analytics Web of Science

VR 1.0

PT J

AU Matsuda, PK

AF Matsuda, Paul Kei

TI Identity in Written Discourse

SO ANNUAL REVIEW OF APPLIED LINGUISTICS

AB This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.

SN 0267-1905

EI 1471-6356

PY 2015

VL 35

BP 140

EP 159

DI 10.1017/S0267190514000178

UT WOS:000351470600008

1.2 File normalization and conformity

This section specifies the rules that the names and content of files should have in order to be searchable by Edge Search Engine 4 provided by Red Hen (https://sites.google.com/case.edu/techne-public-site/red-hen-edge-search-engine). It is also worth pointing out that these rules have been established as in conformity to the Journal Abstracts to be searchable for the purpose of this specific research.

If you wish to contribute to the pipeline with different types of files please send an e-mail to the already specified address redhenlab@gmail.com.

1.2.1 Name specifications

Rules:

COMPLETE-DATE_DATATYPE_DOI_JOURNAL_FirstAuthorLastName_FirstAuthorFirstName.txt

Example filename:

2015-oct_JA_10-1017_S0267190514000178_annual-review-of-applied-linguistics_matsuda_paul.txt

Notes:

1. COMPLETE-DATE is not numeric due to the type of data that is given by the source file in the case of Web of Science, since an accurate publication date is not provided. The magazine publication date is provided by its own standards, in most of the abstracts, date is encountered mostly by range, so this type of notation is found in several files: 2015-apr-may as COMPLETE-DATE, since this data is not used in the moment in the pipeline it is best to leave the date as originally found.

2. DOI is a number that specifies a file, see more at: https://www.doi.org/

Method:

A python script was made to sweep through all files, reorganizing each individual name into the aforementioned format:

Constraint Notes:

Before we arrived at this version of the file given above we had at least 3 versions of this format of name, the history can be accessed here.

1.2.2 File Headers specifications

Expected file header format:

TOP|COMPLETEDATE|FILENAME

COL|PLACE WHERE FILE IS BEING HELD

UID|UUID IDENTIFICATION NUMBER

SRC|JOURNAL COMPLETE NAME

CMT|SPECIFIC COMMENTS ABOUT THE FILE

CC1|LANGUAGE USED IN FILE

TTL|JOURNAL TITLE

CON|ABSTRACT CONTENT

END|COMPLETEDATE|FILENAME

Example file header format:

TOP|2015-oct|2015/2015-oct_JA_10-1017_S0267190514000178_annual-review-of-applied-linguistics_matsuda_paul.txt

COL|Journal Abstracts, Red Hen Lab

UID|464f9d73e94e468eb7f492fa332e23d3

SRC|ANNUAL REVIEW OF APPLIED LINGUISTICS

CMT|

CC1|ENG

TTL|Identity in Written Discourse

CON|This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.

END|2015-oct|2015/2015-oct_JA_10-1017_S0267190514000178_annual-review-of-applied-linguistics_matsuda_paul.txt

The following python Script was written in order to conform the file with the norms above:

The above mentioned python script was preceded by a PHP version of the same file, however as the machine that stores and runs the code does not have the capabilities to run PHP code, it was refurbished in the format described above (the original PHP file can be found on the GitHub for the project).

Constraint Notes:

This piece of code actually gave us some trouble, on the initial versions of the code, that can be found here, it is clear that the code does not account for a multiple lined Title neither a multiple lined Abstract content. Regexses had to be arranged in order for the resulting header file to be correct, since we had a bunch of files with incomplete titles and abstract contents.

1.3 Pragmatic Segmenter

Pragmatic segmenter is a third party software that is used in the pipeline in order to organize the file content (SRC, TTL and CON headers) into a file with a .seg extension. In this file, each sentence of the content f the abstracts occupies a line. Using the raw file at the first example we get the following resulting pragmatically segmented file (copy and paste the file in a plain text reader for the better understanding of the example):

Identity in Written Discourse

This article provides an overview of theoretical and research issues in the study of writer identity in written discourse.

First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized.

Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized.

The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity.

The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.

Steps to the procedure:

Pragmatic Segmenter is a software written in a programming language called ruby, initially ruby needs to be installed locally on a machine to run the application. For an installation guide follow the related link in the top of the page. After ruby is installed it is needed to download Pragmatic Segmenter. For the creation of this pipeline, the version developed by Kevin Dias was used. Here is the link to the step by step guide to installing it https://github.com/diasks2/pragmatic_segmenter.

Once pragmatic segmenter was installed, a brief ruby script was arranged to sweep through a path. Then the files were given an argument call on command prompt, this was arranged because as the whole pipeline is written in python, code conformity was preferred, rather then dealing with ruby natively to deal with the files. The next script was created based on the example given in the installation instructions page:

In this file, we can see that this ruby file sweeps through a file in the cache directory and segments its contents line by line. After that it will record its results to a file that has the same name of the argument and has the .seg extension. In order to put the files with just the contents of TTL and CON headers into the cache directory and to call the ruby script, a python script was created:

The function os.system() at the end of the script would calls the ruby application through a terminal command line. For each file that the python script reads, it sends the contents of its SRC + TTL + CON headers to a file in the cache folder with a .seg extension, then it sends the location of that recently created file to the aforementioned ruby script which segments the file as described and gives the results of the first example in this section.

This procedure to arrange the file in the specified format is essential to the process executed to run OpenSesame on Framenet 1.7 (step five of such guide), although the need for this procedure is unknown in order for the files to be found by Edge Search Engine 4 given by Red Hen.

1.4 Stanford Core NLP

Stanford Core NLP (SC NLP) is another third party software that serves the purpose of annotating text in a variety of ways in order to gather information from it (the various ways in which segments and surfaces information on the content are provided can be found on its webpage). The first thing to do in order to use SC NLP is the download of the specific software so that one can proceed to run Stanford Core NLP in any text.

There are several "wrappers" of SC NLP since its main code in written in java. For this project the wrapper developed by Lynten was used, and the link for the usage and installation of its pip library is https://github.com/Lynten/stanford-corenlp.

In this pipeline, the SC NLP software is used as the base marker for text in general, this is the first software of the pipelines that "adds" data, and not just "rearranges" its data for future use. The biggest function of the software is to use the annotators: tokenize, ssplit, pos, lemma, these respectively puts the sentence into tokens, ssplit to sentence split the sentence, just as Pragmatic Segmenter, but internally, pos for Part of Speech tags and lemma to lemmatize the sentence into the base forms of words.

In a first step, the SC NLP output is going to be used for the base insertion of the Corpus into CQPweb, which is a software described bellow.

Once everything was installed, a script was assembled to run the pipeline through the software. The python script is shown below:

The output format of the function currently is a JSON file which contains the annotations commented above. As the file is too big to be directly inserted here, it follows the link for the JSON output.

1.5 Data Modeling

After all of the previous phases, the project needed to be inserted into a corpus concordant of some kind. Initially it was thought that the corpus of abstracts was going to be inserted into Edge Search Engine 4 (as the previous text of this page infers), since it already has the capability of showing the data by frames, however the platform mainly serves the purpose of video and would become ill fitted with the data collected on this page.

In order to overcome this "corpus format" issue, it was set that we would use CQPweb (as it can be seen in the next sessions) this platform can provide a wide array of searches, such as by POS, grammatical constructions as well as its inputs can be directly modeled into it. With that in mind the project could grow a side in which data was modeled based on the tools already presented to fit CQPweb.

After a first glance it was clear that the inference of inserting the collected frames into CQPweb directly would be too hard and time consuming, so it was decided that in a first phase of the project, with the fixed deadline of july 2019 the Corpus would be searchable, at least by constructions on CQPweb.

1.5.1 VRT files

Throughout the research process it was discovered that the type of files accepted by CQPweb to be searchable as a corpus would be a file with extension .vrt, which stands for vertical text (i.e. one word per line).

The tutorial of the insertion of SaCoCo Corpus into markingCQPweb describes the process of creating such files.

The action of following the aforementioned tutorial on a private installation of CQPweb (the installation process is described in the next topic) was helpful mainly because it gave ground to the research process of modelling the data, and it was a hands on process needed to advance in this direction. The "Easy" section of the tutorial was followed, and by doing that the project received an example VRT, a blueprint that could be used to process the research's data set.

The VRT file stands for VeRTical xml file, basically it follows a very definite structure, in which, each text (in the case of this research, each abstract) is surrounded with the <text> tag, this tag defines the beggining and the end of each file in the corpus, following this tag, there is the <p> tag, that surrounds every paragraph of text, further on there is the <s> tag, which surrounds every sentence. Each line of the VRT file that contains the corpus has to be structure in the following way:

searchable_information(A TABULATION SPACE)searchable_information(A TABULATION SPACE)searchable_information...

Once the format of the file was well defined, it was time to create a script that regulates its information:

The above script also creates a metadata file for the corpus, which indexes the file together with its metadata, this means that important text information connected with each file can be found on CQPweb, as it can be seen in the next steps.

The first abstract formatted in VRT can be seen bellow:

Another thing about the VRT that is important to highlight is the information the research team inserted in it, in the case above, the following information were extracted from StanfordCoreNLP to be inserted into CQPweb:

word, part_of_speech, lemma, Original word

2. Corpus on CQPweb

CQPweb is query processor and Corpus Workbench (Corpus Query Processor hence CQP). More information on the purpose and aims of the tool could be found here.

As this is a documentation on the process regarding the construction of the pipeline, we will restrain from explaining every detail on the platform, extracting just what was important for the creation and maintenance of the pipeline.

CQPweb makes it possible to query a corpus from grammatical patterns. For the purposes of this documentation CQPweb came to fill the gap left by Edge Search Engine 4 which will not be available for the nature of the pipeline.

Once the corpus is inside the platform it is intended that it is going to be possible to query its contents for: targets, frames and arguments highlighted by Open Sesame or any other frame annotation software, as well as part of speech (POS) tags displayed by Stanford Core NLP as well as any other annotation at the word level in the corpus, besides obviously the tools and possibilities already provided just by the use of CQPweb.

The presentation of the usage steps will take the following course: CQPweb installation, verification of needs of corpora insertion on the platform as well as how to query the corpus for searches for frames (narrative patterns) and linguistic constructions (figurative patterns).

2.1 CQPweb Installation

In order for one to install CQPweb locally in a machine, at least a Debian distro of Linux is required. In this case, version 18.04 of Ubuntu was used. This installation occurred via a SSH connection on a virtual machine in the cloud, in case any of these concepts are alien to you, do not worry, any machine with the aforementioned Linux distro will be able to at least follow our installation steps.

The basic guidelines for installing the software are available here.

We focus here on which problems yielded from following the Guide.

Initially the manual is very objective in pointing out that the following sotware is needed to install CQPweb:

- Apache or some other webserver
- MySQL (v5.0 at minimum, and preferably v5.7 or higher)
- PHP (v5.3.0 at minimum, and preferably v5.6 or v7.0+)
- Perl (v5.8 at minimum, and preferably v5.20 or higher)
- Corpus Workbench (see 1.4 and 1.5 for version details)
- R
- Standard Unix-style command-line tools: awk, tar, and gzip; either GNU versions, or versions compatible with them.
- Also, in spite of not having any mentions of SVN on the referenced installation document, we recommend it.

For the installation of all these softwares combined, but the standard Unix-Style command-line tools which come with the linux distro usually, as well as Corpus Workbench, one might use the following command:

For the installation of Corpus Workbench, we recommend the instructions available here.

After the preceding software is installed, it is time to install CQPweb per se. In order for this process to begin, please follow the below instructions provided by Peter Uhrig, one of the mentors of the pipeline:

After this process, it might be possible to access the system through a web browser in the URL:

Once the software installed and running, the splashcreen of the system will look as something like this:

2.2 Corpus Insertion And configurations

In order to execute this step one must first.

- Have a proper VRT file, as formatted and shown above
- Have an instance of CQPweb up and Running normally
- An user with power to insert corpora into CQPweb

Firstly Login to the CQPweb system, and navigate to the "Admin Control Panel Page". It it the last link in the "Account options" Menu, on the left of the screen.

After that, in order to insert the corpus into CQPweb, firstly the upload of the files containing the Corpus must be uploaded to CQP, as well as its metadata counterpart.

To do that, firstly one needs to Navigate, under the menu "Corpora" and click the link "View Upload Area", at this link it will be listed all the files in any shape needed to research the corpora, in the case of this study, the VRT files and the metadata files are already uploaded, containing the Corpus.

After that, it is necessary to select every file separately and upload the selected files, with the click of the central button.

Once all the corpus files had been uploaded its time to install the corpus into CQPweb to make it searchable. The first step toward that is clicking in the "Install New Corpora" on your instance of CQPweb, as shown below:

Here there are several thing one must fill in and we will be as descriptive as possible regarding on how to properly fill the form to install the corpus.

- Fill the "Specify the corpus “name”" field, this will be the name of the Corpus to the MySQL database, and it has some rules, just follow the procedure enforced by CQP and you should be fine.
- Fill the "Enter the full descriptive name of the corpus" with a short descriptive name of the corpus
- After filling the names, it is time to choose which VRT files will be part of your corpus, choose as many VRTs as it is necessary for the entire corpus.
- After filling the chosen files it is time to choose the annotation of the corpus, the first part of this form is concerning the XML TAGS that classify the corpus, and you must fill them according to the fields specified in the XML. Below there is a picture of the Screen in which one can see the form filled with the information necessary to insert the corpus described on this page.
- Finally, there is the Word Annotation form, in which one must be descriptive about the annotations that occur word by word in the .VRT file.
- After filling all these fields we recommend selecting one of the many standard CSS files to styling the corpus.
- At last, to finish the installation process, one must click the link "Install corpus" with settings above to finish this part of the installation.

After the Click on the button the following Screen should follow up:

To proceed, the user must have clicked on the link to fill the metadata with a file, choose the previously uploaded metadata file in the screen and fill the metadata fields as it is shown in the image bellow, after that click in the "Install the matadata table using the settings above" button to finish the corpus metadata configuration.

Finally the corpus is installed! The final step is the configuration of the Lemma annotations, to frame with the Simple Query parameters automatically provided by CQPweb.

In order to do that one must navigate in the corpus view to the "Manage annotation" menu, and select the proper tag for every annotation in the corpus, as it is shown in the image bellow:

Now the corpus is installed and ready to usage, to do that, one must go in the "standart query" link in the corpus view and make the desired query.

3. OpenSesame on FrameNet 1.7

Open Sesame is an open source frame parser software, or in a more trivial way, it is a software that will highlight the frames of a given text using a certain parsing model (Frame Semantics is a huge topic in which we will not deepen in definition on this page. For more on this follow this link). Open Sesame uses FrameNet 1.7 as its dictionary, to annotate text for frames.

The process of dealing with Open Sesame to annotate frames in the Journal Abstracts dataset was divided mainly in three steps: dependency software installations and downloads of various required software, training, and running on the text files.

3.1 Dependency software installations and download of various required software

Firstly it is important clarify that all the software previously described and used in this pipeline were initially run locally. However this is not the case for Open Sesame, since the training phase of the software requires a lot of processing capacity. This fact forced the research team to search for an alternative and this would be one of the Red Hen's servers. Once the access was made to the platform, the following steps were taken in order to use Open Sesame properly. All the instructions given here to installation imply the use of a UNIX server.

As it can be seen in the read.me page of Open Sesame, the software is written in python and it solves its dependencies using PIP, so both software needed to be downloaded and installed plus the detailed instructions for this process are given here. In the specific case of this installation, arrangements were made in order for the software to be installed just for the logged in user of the machine. Below are the steps of that installation.

Python was already installed globally, so pip had to be downloaded and installed for just the current user:

Once pip and python were installed it was time for the download and installation of Open Sesame's dependencies, again, all locally:

The next step following the documentation given in the read.me file on the github page would be:

However an error was given by this command in which a specific NLTK library was not installed. In order to fix the issue, a small search indicated that such library had to be inside the python shell.

After the described installation through pyhton shell, the command could run with no bigger problems:

Once all this software is installed, it´s time to begin the processes of downloading the data the program will need to train as its dataset. Once the system is a neural network it uses training models in order to more accurately predict the frames in the target data. Firstly the data from FrameNet needs to be requested. Once the request was approved they sent the data. For evaluation and organization purposes, all data that will be used by Open Sesame has to be inside the \data folder. The FrameNet data was downloaded and uncompressed inside the \data directory.

Once the FrameNet data is uncompacted inside the \data directory, it´s time to download the glove. Glove is a pretrained word embedings set that Open Sesame uses its ouput trained by 6B tokens.

The last thing to do in preparation for training and running the software on real files is the preprocessing command, which has to be run in the root of the project:

Open Sesame Developer describes the preprocess script as "The above script writes the train, dev and test files in the required format into the data/neural/fn1.7/ directory. A large fraction of the annotations are either incomplete, or inconsistent. Such annotations are discarded, but logged under preprocess-fn1.7.log, along with the respective error messages."

3.2 Training

Once the process described above is done, the training of the data can begin. @swabhs explains the process of training as threefold, in which each individual step is used for tests later in the process. The main command given to training is:

Where $MODEL is the type of training that can be performed, the types available are "targetid", "frameid" and "argid", a more specific explanation about each of these functions in relation to its properties can be found in the readme page as well as in the paper, released by the same author.

$MODEL_NAME refers to a .conll file which will be used as a model to identify frames in each of the previously described models ("targetid", "frameid" and "argid").

This is the point where time consumption of computational resources was noticed, and measures had to be taken, moving the processing to the Red Hen's server, as the software is training a neural network. That said it is also noticeable the fact that any of the three stage models can be trained forever. Usually the time used for training the models was about a 24h period.

Here are the examples of the training made in the pipeline:

After letting each of this training commands run for approximately 24 hours, the system had prepared the models "targetid-01-17", "frameid-01-18" and "argid-01-22" for prediction, which is the next step.

3.3 Predictions

For the predictions, the models previously derived from training are used in order to predict the frames in each level, target, frame and argument, these models are required for the software to work. Also for the software to work it is needed that the sample or test file to be separated in just sentences (see example 1 in the pragmatic segmenter header in this page for the an example). In this example we used the pure output of step 3 in this page.

As for the next python inputs on the command prompt, they go as follows:

As it can be seen, the command is very similar to the training model, since it's running the same script. However only the flag --mode that will change from "training" to "predict", obviously now, can also be seen as the --raw_input flag, which designates in which file the Open Sesame will search for frames. After the program runs, it will output a .conll file in the directory "logs/$MODEL_NAME/predicted-targets.conll", which will be used as raw input for the next step. So in this case it becomes:

The runsit will output the frame predictions in the file logs/$MODEL_NAME/predicted-frames.conll which will be used as an input for the next step:

After all, the argument predictions file will be at: logs/argid-01-22/predicted-args.conll. Here is a preview of the predicted-args.conll file.

4. Semafor on FrameNet 1.5

Semafor is an open source frame parser software, or in a more trivial way, it is a software that will highlight the frames of a given text using a certain parsing model (Frame Semantics is a huge topic in which we will not deepen in definition on this page. For more on this follow this link). Semafor uses FrameNet 1.5 as its dictionary, to annotate text for frames.

Differently from Open Sesame it uses a Java Software on the core for its predictions and at least one model to guide these predictions.

Another important aspect of this version of the Semafor Software, is NOT the Latest Stable Release (LTS) of Semafor. As this version had Several bugs and it was too difficult to run. The Same author of the LTS version decided to fork that GitHub repository into another and made some experimentation that facilitated both, the process of installing the software as well as running it with no change in the output, given the usage of the same models.

The process of installing Semafor is basically the same as described in the GitHub Link.

Also like the Open Sesame Software, due to processing resource consumption from Semafor it was decided that it would run on Gallina, with that in mind the next steps presented here are referent to a Gallina download and configuration of the Software.

4.1 Downloads

The first steps were: Download the Semafor package, models package, maven package (Semafor compiler) and your files as input from Semafor.

To download Semafor, use this command inside on your Gallina home:

The same for the others, including that the models package needs to create a directory called models, an inside, download the package.

That is, inside of /mnt/rds/redhen/gallina/home/user/models, use the command:

For maven, inside of gallina home, use the command:

As last download, your files that will be used as input, that in this case, we use the prose hen files, with the command:

4.2 Configuration

Before running Semafor, it’s necessary to change some configurations to run definitely. About the models folder, after downloading the models package inside of them, it needs to unpack and enable the use of other folders. To unpack, inside of models directory, execute the command:

To enable the use on other directories, use this command:

Now for the Semafor configuration, needs to change an file called config.sh, which was on this directory:

To open and edit the file, use these code:

Between the changes, was:

1. Comment the line from the variable TURBO_MODEL_DIR, using #, because it will not be used on the execution of Semafor;
2. Comment the variable BASE_DIR, and write in a new line, using the code: export BASE_DIR=”/mnt/rds/redhen/gallina/home/USER”
3. Change the variable JAVA_HOME_BIN for the following directory: “/usr/lib/jvm/java-1.8.0-openjdk/bin”

Saving the changes, the last configuration is about maven, which after download your package, also needs to extract for your gallina home, using the command:

With the folder unpacked on your gallina home, it needs to use this command:

This command was to open a text file that configure the environment variables to your user, where what you write on this file, it will execute on all files and directories of the Gallina home of your user only. It needs to do these configurations because the server does not accept the use of sudo command, because it will invade the privacy of other users, so it’s necessary to configure it this way.

Inside of this file, needs to add these variables:

After save and exit from the file, needs to execute the file to ensure the modifications, using these command:

To verify if maven was installed with success, use this command:

As last point to configure, inside of the Semafor directory, install the maven, using this command:

4.3 Running

With all of these configurations made with success, it just remains run the Semafor, which first you need to enter on this directory:

After the system finished compiling is just run the "runSemafor.sh" file inside bin with the following parameters, absolute path to input files, absolute path to output file (it will try to create the file, so if permissions are not set properly it will not write the file correctly) and finally the amount of threads used to run Semafor for that file.

As a preference, the input file has to be the ".seg" extension, as the output file with ".sem". Is not necessary to create the output file before running Semafor, the program creates automatically. Generally, the number of threads used was 1 or 2.

Another point was that the input file cannot have a line break, or a blank line, because if this happens, it results in an execution problem on Semafor.

Here is a link of the output of Semafor for the example run.

As a next step on the front of the Semafor process, currently there is an effort to automate the running process for the entire corpus through a python script.

4.4 Running more than one file

One actual problem that Semafor has is about running a file, because for each file you need to write the same code line, but changing every time that has a new file. If there are few files, it’s not a problem, but considering one hundred or one thousand, is something that’s not viable, it will take a lot of unnecessary time. So the easy way is to automate this Semafor run using a code that will run all of the files you need.

4.4.1 Automate code in Python and how it works

So we create a Python code that makes the Semafor run automatically, using two libraries that will help navigate and open the runner though directories, and another to solve a problem that appears in running the Semafor multiple times. These two libraries are os and multiprocessing, which make the import to use on the code. Also it is necessary to import numpy because we use slurm files to split all the files of the input directory to manage the execution of Semafor, because if was to run all directly, probably it will cause an error, knowing that sometime it will have too many files for processing. And the data_gatherer was to manipulate where it will be your output file and the extension of the files.

First we need to define a function to store all the processes to run Semafor, after, navigate for where our input files are, make a list of files to store them, and navigate through the folder to catch all of the name files, which will be useful to use on the Semafor runner. After that, we create a broken list who splits the list_of_files in a specific quantity of slurm files. These steps are on the part of the code below:

After these steps, we need to create a variable to count how many slurm files will be created, walk through a list of who will create and edit those shell files, putting the commands from line 10 to 14. After that, walk on this list, and on each shell_file, put the command to run Semafor and print all the commands that exist on this shell script. And after that, we open the slurm_count as a slurm file, and write what is on the shell_file, and add a counter of the slurm_count. These code will until reach the quantity limited at the broken_list, which in this case was five.

Just these codes do not solve one problem that appears when you run many times the Semafor. When you put to run the Semafor, he open an thread and make all the analysis and generate the output file, but what happens, he doesn't close this run, and he opens another, so the problem which appears was that, assuming that he not close the executions, the threads increase, and when it comes to the limit of your machine, it bugs your computer. So for example, if your computer has 10 threads, when this code runs Semafor on the eleventh time, it bugs. To solve this, you need to close a Semafor run when he finishes, and for that, we use the multiprocessing library, with some functions that open, start and close the run of Semafor, or in this case, the function run().

To have a better view of the entire code, go to this link.

4.4.2 Executing the Python code

If you want to use these codes, remember to change the directories according to where your input files and the Semafor. To execute the python code, use those Linux commands:

The command to install, actualize the python-pip and install pandas needs to be executed on your Gallina home, to execute the python file go to the abstract_normalizer directory of prose_hen. The sbatch command needs to be executed on the prose_hen directory, and it will be in total 5 commands, one for each slurm file. Also the squeue and tail command was executed on the prose_hen directory.

5. Data Analysis

The Data analysis part of the research is that in which the linguist tries to directly derive meaning and conclusions based on the quantitative data given by one or more software included in the pipeline. For this reason, this section is divided into which information was derived from which presented platform.

5.1 Open Sesame ConLL files:

The Data Analysis Processes based on the Open Sesame software has its roots on the ConLL files delivered by Open Sesame. In these files it is shown that there is a myriad of information that Open Sesame provides to the research, those mainly being in the axis of Frame count, Target count (Lexical unit that invokes the frame, can be counted by frame, or by itself), argument count (An argument that is invoked by an specific frame, can be counted by frame, or by simply counting the arguments).

These findings with respect to the ConLL files can be found here in the excel folder of the statistics part of the GitHub project. Initially these findings were taken,by abstract, by fields of research, such as Biology, Hard Sciences or Human Sciences, by discipline, such as chemistry, or health, journal, such as, Academic Medicine, or Advanced Materials, and finally, all the files together in a single count.

After analyzing the above mentioned statistical files, which provide an overview of the frames encountered by Open Sesame the research team decided to not use the data from Open Sesame as a trustworthy software to frame annotation. One of the main points that guided this decision was that the most found frame by the Open Sesame resulting dataset is the frame Age, mainly evoked by the lexical unit "of", however, many instances of the word "of" in the dataset, when contrasted by human evaluation would not point to the frame Age.

With that in mind the team decided a different approach running another Frame Annotation Software, Semafor, using Framenet 1.5.

5.2 Semafor in FrameNet 1.5

This parsing for the data analisys part of the research starts from a point where we have Semafor’s output as a JSON file for each abstract of the corpus. The resulting JSON from the frame analisys have the structure that can be seen bellow:

The file is diveded by line, each line in the JSON corresponds to a line of text in the abstract. With that being the case, each line has a JSON object with the attribute “frames” containing the frames in the marked up sentence by this line.

As a sentence can contain more than one frame, the attribute "frames" contains an Array, each position in the Array is an identified frame, this information comes in two attributes respectivelly, “target” and “annotationSets”.

- “target” is where general information about the frame is, its designation and also the lexical unit that called the frame and its position (counting spaces),this positional information can be looked up by the attributes “start”, “end” and “text”.
- “annotationSets” is where information about the result of semafor’s analysis is found. Attributes like “rank” and “score”, that measures the accuracy of certain numbers by semafor. Also the “frameElements” attribute is included in this part and its information designate the frame elements identified by which.

With these information in mind, as it is know, the research in this part has two main purposes.

- 1. Find the occurrence of frames in the corpus, for all abstracts, per field, per discipline and per journal.
- 2. Find the occurrence of Lexical Units that evoke the above found frames in the corpus, for all abstracts, per field, per discipline, per journal and also for each frame (considering frame subdivisions).

In order to evalluate such objects there are some different tools that are described bellow for a better understanding of the coding process. These are shown bellow:

5.3 Tools

The two main tools used in the data analysis process are the python language, as it is also the rest of the pipeline, as well as Jupyter Notebook Environment this technology allow us to have a practical environment to generate the statistical info, with the concept of "live code", which means that the code can run from a point where data have already been treated, without needing to run all the code every time a new statistical visualization is needed.

Inside these Jupyter Notebooks, Python code is normally run, all files and their respective functions and features can be viewed here.

5.4 Results

The code written in this section of the research process mainly aims at two specific statistical purposes, already mentioned above, frame count and lexical unit count.

Firstly there is a division by all the aforementioned categories (for all abstracts, per field, per discipline and per journal), then inside of each directory there is two directories, one for frames, another one for lexical units. In Practice, the way the results can be found here.

In terms of generating artifacts for representing the results three main types of artifacts were generated by the analisys.

- 1. A spreadsheet (here the spreadsheet refers to all the occurrences of frames ranked in the journal "Academy of Management Annals") counting every occurrence.
- 2. Graph bars ranking occurrences, divided in thirty by thirty slices.

- 3. Pizza graphs ranking all occurrences, and here the graphs are divided in two types.
  - - a - Ranking all occurrences by the category, until it reaches one percent, and when it reaches one percent it aggregates its results.

- - - b - Ranking all occurrences above one percent, as it has make a new total.

These findings are today guiding our next steps in the research, mainly in terms of having more statistical analysis. The expansion will come in the front of ranking which lexical units evoked a single frame.

Despite all the processes that are described in this section being about the data analysis extracted from the Semafor frame annotation software the actual installation process of Semafor is yet documented, currently we have the effort document the platform instalation and running processes.

Constraints notes:

Here there is one big constraint that must be mentioned. In order for all the statistical data to be extracted from the corpus, all the data had to be manipulated through python scripts. Since the retrieval of the Semafor data, to the actual calculations to make all these graphs based only on counting occurrences of frames and lexical units in the corpus.

We intent that solely with the purpose of analyzing the data more thoroughly and less sparsely to insert this data into a database of some sort and query the database to extract the spreadsheets that would infer the above mentioned graphs.

6. Projects using the Pipeline

As the development and construction of this pipeline is part of the Red Hen Lab it has a collaborative nature. In this context it is natural to assume that several people would use the work already finished to achieve different results with different inputs into the system.

Currently there are three projects that are being developed inside this pipeline.

6.1 Narrative & Figurative Patterns in Science Communication: frame and blending constructions

This project lies between cognition and language and between the research and the communication processes. It is developed under three axes:

- Theoretical: Under the light of Cognitive Linguistics, it investigates in what extent narrativity and figurativity account for developing, describing and communicating scientific concepts and processes. In addition, as the audience are both scientists in interdisciplinary research settings and novice researchers having to engage into a new jargon, we aim to look into blending constructions for the communication of new/original complex technical concepts and processes.
- Technological: The Basic Text Pipeline provides the project with tech tools for linguistic analysis. It is being developed for semantic and syntactic automatic tagging of data. High-impact published scientific abstracts are tagged for frames in Semafor, whose statistical output makes up the narrative patterns. The same corpus is tagged for parts of speech in Standford Core NLP for CQPweb search of blending constructions (analogies and metaphors), which make up the figurative patterns.
- Pedagogical: Laletec Extension Project is the branch responsible for modelling the linguistic analysis results into pedagogical practices and products as evidences on how narrative and figurative patterns contribute to the teaching of scientific writing.

The project has been developed by Rosana Ferrareto (IFSP, Brazil) since August/2018, when she was a visiting PhD scholar at Case Western Reserve University under the supervision of Mark Turner and is now part of the Research Program aCOMTECe.

6.2 WriteFrame

This project investigates how novice researchers frame their research; specifically, if they use personal narratives that differ from the conventional narrative structure of scientific writing. The project is being developed by Hana Gustafsson (NTNU, Norway) in close collaboration with Rosana Ferrareto and Mark Turner. WriteFrame’s text pipeline - which is directly based on the Basic Text Pipeline - has been developed and implemented by Yamen Zaza in close collaboration with Rosana’s technical team.

7. Future Pipeline Steps

As it can be foreseen this page is a growing continuum of the construction of the basic text pipeline, as we as a team intent to help and serve as guidance to Rosana's post Doctoral and future research there are some tasks that are about to happen, and a myriad of another tasks that would be awesome if they could integrate the pipeline, as it follows our intentions about the future are:

Tasks about to be documented/implemented:

- OpenSesame installation and data formatting.
- Evaluation and comparing between Semafor and OpenSesame.
- Frame Index.
- MongoDB Database for the Basic Text Pipeline.
- Obviously, at the end of the process we intent to manipulate the files, the details about this procedure should be documented in this page.