Basic Text Pipeline



Smiley face


Some "multimodal communications" consist of text files.  A scientific article is such a piece of data. These files can be processed and tagged so as to create files of metadata that can then be searched and further processed. For example, Red Hen has a project focused on processing and tagging abstracts of scientific articles to make it easier to analyze them for image-schematic and narrative structure.
The Basic Text Pipeline is project within Red Hen Lab.
A Brazilian team from IFSP constitute the current active researchers of the pipeline: Rafael Ruggi (student), Lucas Souza Teixeira (student), Matheus Sardeli Malheiros (student), Nicholas Gomez Zilli Castro (student), Yamen Zaza (Student), Professor Gustavo Aurelio Prieto, Professor Hana Gustafsson - mentored by Professor Rosana Ferrareto


Respectively, the active mentioned participants are:

Smiley face Smiley face Smiley face Smiley face
Smiley face Smiley face

Would you like to assist? If so, write to: and we will try to connect you with a mentor.

Currently we have the following picture as a guideline for the development of the pipeline as a whole.

Constraints notes:
During the construction of the pipeline several problems were encountered by the developers and linguists, all the problems had during the construction period will be noted using this format, just a heads up, nothing to report yet!

Elements of the Pipeline


File Acquisition

This can be accomplished in a myriad of ways. In the current state of this pipeline, these files are Journal Abstracts gathered manually from high impact journals in the Web of Science Platform (webofknowledge.com). 
The current Dataset being worked at the Red Hen Basic Text Pipeline is a corpus composed of 1,000 Abstracts gathered by Cognitive Linguist and Language Professor Rosana Ferrareto, faculty member of IFSP and former PostDoctoral visiting researcher at CWRU, in the CogSci Department and in the Red Hen Lab.
The processes of abstract acquisition can be improved, and one of the ways in which this is possible is by creating a web crawler, which can do the steps required to download the abstracts and then upload those onto a storage of some kind.
Here's an example of an raw manually collected file:



FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Matsuda, PK
AF Matsuda, Paul Kei
TI Identity in Written Discourse
SO ANNUAL REVIEW OF APPLIED LINGUISTICS
AB This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.
SN 0267-1905
EI 1471-6356
PY 2015
VL 35
BP 140
EP 159
DI 10.1017/S0267190514000178
UT WOS:000351470600008
ER

EF

File normalization and conformity

This section specifies the rules that the names and content of files should have in order to be searchable by Edge Search Engine 4 provided by Red Hen (https://sites.google.com/case.edu/techne-public-site/red-hen-edge-search-engine).  It is also worth pointing out that these rules have been established as in conformity to the Journal Abstracts to be searchable for the purpose of this specific research.
If you wish to contribute to the pipeline with different types of files please send an e-mail to the already specified address redhenlab@gmail.com.

Name specifications

Rules: 
COMPLETE-DATE_DATATYPE_DOI_JOURNAL_FirstAuthorLastName_FirstAuthorFirstName.txt
Example filename:
2015-oct_JA_10-1017_S0267190514000178_annual-review-of-applied-linguistics_matsuda_paul.txt
Notes:
1. COMPLETE-DATE is not numeric due to the type of data that is given by the source file in the case of Web of Science, since an accurate publication date is not provided. The magazine publication date is provided by its own standards, in most of the abstracts, date is encountered mostly by range, so this type of notation is found in several files: 2015-apr-may as COMPLETE-DATE, since this data is not used in the moment in the pipeline it is best to leave the date as originally found.
2. DOI is a number that specifies a file, see more at: https://www.doi.org/


Method:
A python script was made to sweep through all files, reorganizing each individual name into the aforementioned format:


# -*- coding: UTF-8 -*-
import os

listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk("../raw"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenames]
lines = 0

nameOfFiles = list()
date = ""
ano = ""
fonte = ""
autor = ""
doi = ""

for file in listOfFiles:
    if file.endswith(".txt"):
        with open(file) as f:
            lines = [line.rstrip('\n') for line in open(file)]
            head, sep, tail = lines[0].partition("FN ")
            
            mes = ""
            ano = ""
            fonte = ""
            autor = ""
            doi = ""

            for l in lines:
                l.replace("\ufeff", "")

                if l.startswith("PD "):
                    date = l.replace("PD ", "")
                    date = date.replace(" ", "-")
                    date = date.lower()

                if l.startswith("PY "):
                    ano = l.replace("PY ", "")

                if l.startswith("DI "):
                    doi = l.replace("DI ", "")

                if l.startswith("SO "):
                    journal = l.replace("SO ", "")
                    fonte += journal.replace(" ", "-").lower()

                if l.startswith("AF "):
                    autorComplete = l.replace("AF ", "")
                    autorLast, sep, autorFirstComplete = autorComplete.partition(', ')
                    autorLast = autorLast.replace(" ", "-").lower()
                    

                    aux = 2 # this is two taking into consideration that an abreviated name is the first letter of the name plus a comma eg. D.
                    autorFirstList = autorFirstComplete.split(" ")
                    if len(autorFirstList) == 1 and len(autorFirstList[0]) != aux:
                        autorFirst = autorFirstList[0]
                    else:
                        
                        for autorName in autorFirstList:
                            if len(autorName) > aux and len(autorName) != aux:
                                autorFirst = autorName
                                break
                            else:
                                autorFirst = ""
                    autorFirst = autorFirst.lower()

            #start assembling string
            stringFileName = str(ano)

            if date != "":
                stringFileName += "-"
                stringFileName += str(date)
            else:
                stringFileName += "00"

            #separating Date and DOI putting the annotation name
            stringFileName += "_JA_"
            
            if doi != "":
                stringFileName += doi
            else:
                stringFileName += "00-0000_0000000000000000"

            #separating DOI and file source
            stringFileName += "_"+str(fonte)

            #adding first author name
            stringFileName += "_"
            stringFileName += str(autorLast)
            if autorFirst != "":
                stringFileName += "_"
                stringFileName += str(autorFirst)

            #slashes are underscores
            #dots are dashes
            stringFileName = stringFileName.replace("/", "_")
            stringFileName = stringFileName.replace(".", "-")

            #adding extention
            stringFileName += ".txt"


            nameOfFiles.append(stringFileName)

            with open("../abstracts/"+ano+"/"+stringFileName, "w+") as fileReady:
                for lineInFile in f:
                    fileReady.write(lineInFile)

print(nameOfFiles)

Constraint Notes:
Before we arrived at this version of the file given above we had at least 3 versions of this format of name, the history can be accessed here.

File Headers specifications:

Expected file header format:
TOP|COMPLETEDATE|FILENAME
COL|PLACE WHERE FILE IS BEING HELD
UID|UUID IDENTIFICATION NUMBER
SRC|JOURNAL COMPLETE NAME
CMT|SPECIFIC COMMENTS ABOUT THE FILE
CC1|LANGUAGE USED IN FILE
TTL|JOURNAL TITLE
CON|ABSTRACT CONTENT
END|COMPLETEDATE|FILENAME


Example file header format:
TOP|2015-oct|2015/2015-oct_JA_10-1017_S0267190514000178_annual-review-of-applied-linguistics_matsuda_paul.txt
COL|Journal Abstracts, Red Hen Lab
UID|464f9d73e94e468eb7f492fa332e23d3
SRC|ANNUAL REVIEW OF APPLIED LINGUISTICS
CMT|
CC1|ENG
TTL|Identity in Written Discourse
CON|This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.
END|2015-oct|2015/2015-oct_JA_10-1017_S0267190514000178_annual-review-of-applied-linguistics_matsuda_paul.txt

The following python Script was written in order to conform the file with the norms above:

import os
import uuid
import re

listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../abstracts"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]


for file in listOfFiles:
    if file.endswith(".txt"):
        path = file.split('/')

        completeDate = path[path.__len__()-1].split("_")[0]

        completePath = path[path.__len__()-2]+"/   "+path[path.__len__()-1]
        completePath = completePath.replace("..", ".")
        completePath = completePath.replace(" ", "")

        print(completePath)

        line1 = "TOP|"+completeDate+"|"+completePath+'\n'

        line2 = "COL|Journal Abstracts, Red Hen Lab"+'\n'

        line3 = "UID|"+uuid.uuid4().hex+'\n'

        line5 = "CMT|"+'\n'

        line6 = "CC1|ENG"+'\n'

        line9 = "END|"+completeDate+"|"+completePath+'\n'


        with open(file) as f:
            lines = [line.rstrip('\n') for line in open(file)]

            #SEARCH FOR TITLE
            r = re.search("TI ((.*?\n)+)SO", "\n".join(map(str, lines)))

            if r:
                line7 = "TTL|"+r.group(1).strip().replace("\n", " ").replace("    ", " ")+"\n"

            #SEARCH FOR ABSTRACT CONTENTS
            r = re.search("AB ([\w\W]*?)(?=\n[A-Z]{2}\s)", "\n".join(map(str, lines)))            

            if r: 
                line8 = "CON|"+ r.group(1).strip().replace("\n", " ").replace("    ", " ")+"\n"

            #FOREACH LINE TO GATHER NAME OF
            for l in lines:
                l.replace("\ufeff", "")

                if l.startswith("SO "):
                    subLine = l[2:]
                    trimmedSubLine = subLine.strip()
                    line4 = "SRC|"+trimmedSubLine+'\n'

            stringFinal = line1 + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9
            with open("../headers/" + completePath, "w+") as fileReady:
                fileReady.write(stringFinal)

The above mentioned python script was preceded by a PHP version of the same file, however as the machine that stores and runs the code does not have the capabilities to run PHP code, it was refurbished in the format described above (the original PHP file can be found on the GitHub for the project). 

Constraint Notes:
This piece of code actually gave us some trouble, on the initial versions of the code, that can be found here, it is clear that the code does not account for a multiple lined Title neither a multiple lined Abstract content. Regexses had to be arranged in order for the resulting header file to be correct, since we had a bunch of files with incomplete titles and abstract contents.


Pragmatic Segmenter

Pragmatic segmenter is a third party software that is used in the pipeline in order to organize the file content (SRC, TTL and CON headers) into a file with a .seg extension. In this file, each sentence of the content f the abstracts occupies a line. Using the raw file at the first example we get the following resulting pragmatically segmented file (copy and paste the file in a plain text reader for the better understanding of the example):

Identity in Written Discourse
This article provides an overview of theoretical and research issues in the study of writer identity in written discourse.
First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized.
Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized.
The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity.
The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.


Steps to the procedure:

Pragmatic Segmenter is a software written in a programming language called ruby, initially ruby needs to be installed locally on a machine to run the application. For an installation guide follow the related link in the top of the page. After ruby is installed it is needed to download Pragmatic Segmenter. For the creation of this pipeline, the version developed by Kevin Dias was used. Here is the link to the step by step guide to installing it https://github.com/diasks2/pragmatic_segmenter.

Once pragmatic segmenter was installed, a brief ruby script was arranged to sweep through a path. Then the files were given an argument call on command prompt, this was arranged because as the whole pipeline is written in python, code conformity was preferred, rather then dealing with ruby natively to deal with the files. The next script was created based on the example given in the installation instructions page:

require 'pragmatic_segmenter'

fileLocation = ARGV[0]
content = ""
f = File.open("../pragmatic/cache/"+fileLocation, "r")
f.each_line do |line|
    if content == ""
        content += line
    else
        content += line
    end
end
f.close

ps = PragmaticSegmenter::Segmenter.new(text: content)

segments = ps.segment

stringFinal = ""
segments.each do |seg|
    stringFinal += seg + "\n"
end


f = File.open("../pragmatic/"+fileLocation, "w+")
f.write(stringFinal)
f.close

In this file, we can see that this ruby file sweeps through a file in the cache directory and segments its contents line by line. After that it will record its results to a file that has the same name of the argument and has the .seg extension. In order to put the files with just the contents of TTL and CON headers into the cache directory and to call the ruby script, a python script was created:

import os
listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../headers"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

for file in listOfFiles:
    if file.endswith(".txt"):
        path = file.split('/')
        path = path[path.__len__()-1]
        path = path.replace("..", ".")

        print(path)

        sendToPragmatic = ''

        with open(file) as f:
            lines = [line.rstrip('\n') for line in open(file)]
            head, sep, tail = lines[0].partition("FN ")

            for l in lines:
                l.replace("\ufeff", "")


                if l.startswith("SRC|"):
                    sendToPragmatic = l[4:].capitalize().strip() + '.' + '\n'

                if l.startswith("TTL|"):
                    sendToPragmatic += l[4:].strip() + '.' + '\n'

                if l.startswith("CON|"):
                    sendToPragmatic += " "
                    sendToPragmatic += l[4:]

            with open("../pragmatic/cache/" + path[:-3]+'seg', "w+") as fileReady:
                fileReady.write(sendToPragmatic)

            pragmaticReturn = os.system('ruby ps.rb "' + path[:-3]+'seg' + '"')
print("end")

The function os.system() at the end of the script would calls the ruby application through a terminal command line. For each file that the python script reads, it sends the contents of its SRC + TTL + CON headers to a file in the cache folder with a .seg extension, then it sends the location of that recently created file to the aforementioned ruby script which segments the file as described and gives the results of the first example in this section. 

This procedure to arrange the file in the specified format is essential to the process executed to run OpenSesame on Framenet 1.7 (step five of such guide), although the need for this procedure is unknown in order for the files to be found by Edge Search Engine 4 given by Red Hen.

Stanford Core NLP

Stanford Core NLP (SC NLP) is another third party software that serves the purpose of annotating text in a variety of ways in order to gather information from it (the various ways in which segments and surfaces information on the content are provided can be found on its webpage). The first thing to do in order to use SC NLP is the download of the specific software so that one can proceed to run Stanford Core NLP in any text.

There are several "wrappers" of SC NLP since its main code in written in java. For this project the wrapper developed by Lynten was used, and the link for the usage and installation of its pip library is https://github.com/Lynten/stanford-corenlp.

In this pipeline, the SC NLP software is used as the base marker for text in general, this is the first software of the pipelines that "adds" data, and not just "rearranges" its data for future use. The biggest function of the software is to use the annotators: tokenize, ssplit, pos, lemma, these respectively puts the sentence into tokens, ssplit to sentence split the sentence, just as Pragmatic Segmenter, but internally, pos for Part of Speech tags and lemma to lemmatize the sentence into the base forms of words.  

In a first step, the SC NLP output is going to be used for the base insertion of the Corpus into CQPweb, which is a software described bellow.

Once everything was installed, a script was assembled to run the pipeline through the software. The python script is shown below:

import os
import uuid
import re
from stanfordcorenlp
import StanfordCoreNLP

nlp = StanfordCoreNLP(r '/var/python/stanford-corenlp-full-2018-10-05')

props = { 'annotators': 'tokenize, ssplit, pos, lemma, parse', 'pipelineLanguage': 'en', 'outputFormat': 'json' }

listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../pragmatic"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]


for file in listOfFiles:
    caminho = file.split('/')
    if caminho[2] != "cache":
        stanfordFileName = caminho[2]
        stanfordFileName = os.path.splitext(stanfordFileName)[0]

        with open(file) as f:

            text = f.read()

            annotatedText = nlp.annotate(text, properties = props)

            with open("../stanford/" + stanfordFileName + ".stf", "w+") as fileReady:
                fileReady.write(annotatedText)
                print stanfordFileName + "\n"

nlp.close()
print "end"
 
The output format of the function currently is a JSON file which contains the annotations commented above. As the file is too big to be directly inserted here, it follows the link for the JSON output.

Data Modeling

After all of the previous phases, the project needed to be inserted into a corpus concordant of some kind. Initially it was thought that the corpus of abstracts was going to be inserted into Edge Search Engine 4 (as the previous text of this page infers), since it already has the capability of showing the data by frames, however the platform mainly serves the purpose of video and would become ill fitted with the data collected on this page.
In order to overcome this "corpus format" issue, it was set that we would use CQPweb (as it can be seen in the next sessions) this platform can provide a wide array of searches, such as by POS, grammatical constructions as well as its inputs can be directly modeled into it. With that in mind the project could grow a side in which data was modeled based on the tools already presented to fit CQPweb.
After a first glance it was clear that the inference of inserting the collected frames into CQPweb directly would be too hard and time consuming, so it was decided that in a first phase of the project, with the fixed deadline of july 2019 the Corpus would be searchable, at least by constructions on CQPweb.

VRT files

Throughout the research process it was discovered that the type of files accepted by CQPweb to be searchable as a corpus would be a file with extension .vrt, which stands for vertical text (i.e. one word per line).
The tutorial of the insertion of SaCoCo Corpus into markingCQPweb describes the process of creating such files.
The action of following the aforementioned tutorial on a private installation of CQPweb (the installation process is described in the next topic) was helpful mainly because it gave ground to the research process of modelling the data, and it was a hands on process needed to advance in this direction. The "Easy" section of the tutorial was followed, and by doing that the project received an example VRT, a blueprint that could be used to process the research's data set.
The VRT file stands for VeRTical xml file, basically it follows a very definite structure, in which, each text (in the case of this research, each abstract) is surrounded with the <text> tag, this tag defines the beggining and the end of each file in the corpus, following this tag, there is the <p> tag, that surrounds every paragraph of text, further on there is the <s> tag, which surrounds every sentence. Each line of the VRT file that contains the corpus has to be structure in the following way:
searchable_information(A TABULATION SPACE)searchable_information(A TABULATION SPACE)searchable_information...
Once the format of the file was well defined, it was time to create a script that regulates its information:
import os
import uuid
import re
import json
import hashlib
import xml.etree.cElementTree as ET


import dictionaries

# function to indent the XML properly
def indent(elem, level=0):
    i = "\n" + level*" "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + " "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        else:
            if level and (not elem.tail or not elem.tail.strip()):
                elem.tail = i


listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../stanford"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

lines = []

completeVrtString = ""
metadataString = ""


for file in listOfFiles:
    if file.endswith(".stf"):
        path = file.split('/')

        fileNameNoExt = path[path.__len__()-1].split(".")[0]

        publicationYear = path[path.__len__()-1].split("_")[0].split("-")[0]

        f=open("../headers/"+publicationYear+"/"+fileNameNoExt+".txt", "r")

        #dealing with the first sets of metadata.
        headerLines = f.readlines()
        for line in headerLines:

            if line.startswith("TTL|"):
                abstractName = line[4:].replace("&#10", "").replace("\ufeff", "").replace("\n","")



            if line.startswith("SRC|"):
                journalName = line[4:].rstrip().replace("-"," ").upper().replace("\n","")
                field = dictionaries.dictionaryFields()[journalName].replace("\n","")
                discipline = dictionaries.dictionaryDisciplines()[field].replace("\n","")


        stringIdHash = hashlib.md5(fileNameNoExt.encode()).hexdigest()

        metadataString += stringIdHash+ "\t" + abstractName + "\t" + journalName + "\t" + field + "\t" + discipline + "\n"

        print(fileNameNoExt+" "+stringIdHash)

        #inputing metadata text tags on file.
       text = ET.Element('text', _id=stringIdHash, abstract_name=abstractName, jounal_name=journalName, field=field, discipline=discipline)

        #dealing with the stanfordCoreNLP sstricture and files.
        with open(file) as json_file:
            data = json.load(json_file)
            # tag
            p = ET.SubElement(text, 'p')
            for sentence in data['sentences']:

                for token in sentence['tokens']:
                    lines.append(token['word'] + "\t" + token['pos'] + "\t" + token['lemma'] + "\t" + token['originalText'])
                # tag
                s = ET.SubElement(p, 's').text = "\n"+"\n".join(lines)+"\n"
                lines = []


        indent(text)

        tree = ET.ElementTree(text)

        completeVrtString += ET.tostring(text, encoding='utf-8').replace("_id", "id")

        tree.write("../vrt/" + fileNameNoExt+".vrt", encoding="utf8", xml_declaration=True, method="xml")

vrt_file = open("../vrt/completeVrtString.vrt", "w")
vrt_file.write("\n"+completeVrtString)
vrt_file.close()

meta_file = open("../vrt/completeMetaString.meta", "w")
meta_file.write(metadataString)
meta_file.close()
The above script also creates a metadata file for the corpus, which indexes the file together with its metadata, this means that important text information connected with each file can be found on CQPweb, as it can be seen in the next steps.
The first abstract formatted in VRT can be seen bellow:
<?xml version='1.0' encoding='utf8'?>
<text _id="73ae83f510225d57ae31968abadf8300" abstract_name="The dark side of customer co-creation:
exploring the consequences of failed co-created services" discipline="SOCIAL & HUMAN SCIENCES" field="BUSINESS" jounal_name="JOURNAL OF THE ACADEMY OF MARKETING SCIENCE">
<p>
<s>
Journal NNP Journal Journal
of IN of of
the DT the the
academy NN academy academy
of IN of of
marketing NN marketing marketing
science NN science science
. . . .
</s><s>
The DT the The
dark JJ dark dark
side NN side side
of IN of of
customer NN customer customer
co-creation NN co-creation co-creation
: : : :
exploring VBG explore exploring
the DT the the
consequences NNS consequence consequences
of IN of of
failed VBN fail failed
co-created JJ co-created co-created
services NNS service services
. . . .
</s><s>
Whereas IN whereas Whereas
current JJ current current
literature NN literature literature
emphasizes VBZ emphasize emphasizes
the DT the the
positive JJ positive positive
consequences NNS consequence consequences
of IN of of
co-creation NN co-creation co-creation
, , , ,
this DT this this
article NN article article
sheds VBZ shed sheds
light NN light light
on IN on on
potential JJ potential potential
risks NNS risk risks
of IN of of
co-created JJ co-created co-created
services NNS service services
. . . .
</s><s>
Specifically RB specifically Specifically
, , , ,
we PRP we we
examine VBP examine examine
the DT the the
implications NNS implication implications
of IN of of
customer NN customer customer
co-creation NN co-creation co-creation
in IN in in
service NN service service
failure NN failure failure
episodes NNS episode episodes
. . . .
</s><s>
The DT the The
results NNS result results
of IN of of
four CD four four
experimental JJ experimental experimental
studies NNS study studies
show VBP show show
that IN that that
in IN in in
a DT a a
failure NN failure failure
case NN case case
, , , ,
services NNS service services
high JJ high high
on IN on on
co-creation NN co-creation co-creation
generate VBP generate generate
a DT a a
greater JJR greater greater
negative JJ negative negative
disconfirmation NN disconfirmation disconfirmation
with IN with with
the DT the the
expected JJ expected expected
service NN service service
outcome NN outcome outcome
than IN than than
services NNS service services
low JJ low low
on IN on on
co-creation NN co-creation co-creation
. . . .
</s><s>
Moreover RB moreover Moreover
, , , ,
we PRP we we
examine VBP examine examine
the DT the the
effectiveness NN effectiveness effectiveness
of IN of of
different JJ different different
service NN service service
recovery NN recovery recovery
strategies NNS strategy strategies
to TO to to
restore VB restore restore
customer NN customer customer
satisfaction NN satisfaction satisfaction
after IN after after
failed VBD fail failed
co-created JJ co-created co-created
services NNS service services
. . . .
</s><s>
According VBG accord According
to TO to to
our PRP$ we our
results NNS result results
, , , ,
companies NNS company companies
should MD should should
follow VB follow follow
a DT a a
matching NN matching matching
strategy NN strategy strategy
by IN by by
mirroring NN mirroring mirroring
the DT the the
level NN level level
of IN of of
customer NN customer customer
participation NN participation participation
in IN in in
service NN service service
recovery NN recovery recovery
based VBN base based
on IN on on
the DT the the
level NN level level
of IN of of
co-creation NN co-creation co-creation
during IN during during
service NN service service
delivery NN delivery delivery
. . . .
</s><s>
In IN in In
particular JJ particular particular
, , , ,
flawed JJ flawed flawed
co-creation NN co-creation co-creation
promotes VBZ promote promotes
internal JJ internal internal
failure NN failure failure
attribution NN attribution attribution
which WDT which which
in IN in in
turn NN turn turn
enhances VBZ enhance enhances
perceived VBN perceive perceived
guilt NN guilt guilt
. . . .
</s><s>
Our PRP$ we Our
results NNS result results
suggest VBP suggest suggest
that IN that that
in IN in in
such JJ such such
case NN case case
customer NN customer customer
satisfaction NN satisfaction satisfaction
is VBZ be is
best JJS best best
restored VBN restore restored
by IN by by
offering VBG offer offering
co-created JJ co-created co-created
service NN service service
recovery NN recovery recovery
. . . .
</s>
</p>
</text>
Another thing about the VRT that is important to highlight is the information the research team inserted in it, in the case above, the following information were extracted from StanfordCoreNLP to be inserted into CQPweb:
word, part_of_speech, lemma, Original word

Corpus on CQPweb

CQPweb is query processor and Corpus Workbench (Corpus Query Processor hence CQP). More information on the purpose and aims of the tool could be found here.
As this is a documentation on the process regarding the construction of the pipeline, we will restrain from explaining every detail on the platform, extracting just what was important for the creation and maintenance of the pipeline.
CQPweb makes it possible to query a corpus from grammatical patterns. For the purposes of this documentation CQPweb came to fill the gap left by Edge Search Engine 4 which will not be available for the nature of the pipeline.
Once the corpus is inside the platform it is intended that it is going to be possible to query its contents for: targets, frames and arguments highlighted by Open Sesame or any other frame annotation software, as well as part of speech (POS) tags displayed by Stanford Core NLP as well as any other annotation at the word level in the corpus, besides obviously the tools and possibilities already provided just by the use of CQPweb.
The presentation of the usage steps will take the following course: CQPweb installation, verification of needs of corpora insertion on the platform as well as how to query the corpus for searches for frames (narrative patterns) and linguistic constructions (figurative patterns).

CQPweb Installation

In order for one to install CQPweb locally in a machine, at least a Debian distro of Linux is required. In this case, version 18.04 of Ubuntu was used. This installation occurred via a SSH connection on a virtual machine in the cloud, in case any of these concepts are alien to you, do not worry, any machine with the aforementioned Linux distro will be able to at least follow our installation steps.

The basic guidelines for installing the software are available here.
We focus here on which problems yielded from following the Guide.

Initially the manual is very objective in pointing out that the following sotware is needed to install  CQPweb:
  • Apache or some other webserver
  • MySQL (v5.0 at minimum, and preferably v5.7 or higher)
  • PHP (v5.3.0 at minimum, and preferably v5.6 or v7.0+)
  • Perl (v5.8 at minimum, and preferably v5.20 or higher)
  • Corpus Workbench (see 1.4 and 1.5 for version details)
  • R
  • Standard Unix-style command-line tools: awk, tar, and gzip; either GNU versions, or versions compatible with them.
  • Also, in spite of not having any mentions of SVN on the referenced installation document, we recommend it.

For the installation of all these softwares combined, but the standard Unix-Style command-line tools which come with the linux distro usually, as well as Corpus Workbench, one might use the following command:

sudo apt-get install apache2 mysql-server php7.0 perl libopenblas-base r-base svn

For the installation of Corpus Workbench, we recommend the instructions available here

After the preceding software is installed, it is time to install CQPweb per se. In order for this process to begin, please follow the below instructions provided by Peter Uhrig, one of the mentors of the pipeline:

#creation of basic CQPweb directory tree
cd /data
sudo mkdir corpora
cd corpora
sudo mkdir cqpweb
cd cqpweb
sudo mkdir upload
sudo mkdir tmp
sudo mkdir corpora
sudo mkdir registry

#configuring apparmor
cd /data
sudo chmod -R a+w corpora 
sudo nano /etc/apparmor.d/usr.sbin.mysqld

# ADD   /data/corpora/cqpweb/** rw, BEFORE THE LAST CLOSING BRACE

sudo service apparmor restart

#configuring apache2
cd /etc/apache2/conf-available
sudo nano security.conf

# ADD THIS IN THE END OF FILE:

<Directory /var/www/html/CQPweb/bin>
    
deny from all
</Directory>
<Directory var/www/html/CQPweb/lib>
    
deny from all
</Directory>

sudo service apache2 reload

#creating databases
mysql -u root -p
create database cqpweb_db default charset utf8;
create user cqpweb identified by 'supersecret';
grant all on cqpweb_db.* to cqpweb;
grant file on *.* to cqpweb;
exit;

cd /var/www/html/
sudo svn co http://svn.code.sf.net/p/cwb/code/gui/cqpweb/branches/3.2-latest/ CQPweb

cd /var/www/html/CQPweb/bin
sudo php autoconfig.php
    
#data input inside autoconfig
    
user: cqpwebadmin
    
/data/corpora/cqpweb/corpora
    
/data/corpora/cqpweb/registry
    
/data/corpora/cqpweb/tmp
    
/data/corpora/cqpweb/upload
    
cqpweb
    
supersecret
    
cqpweb_db
    
localhost

sudo php autosetup.php

sudo nano config.inc.php
#ADD THIS:
$path_to_cwb = '/usr/local/cwb-3.4.14/bin

cd /var/www/html
sudo chown -R www-data CQPweb/

After this process, it might be possible to access the system through a web browser in the URL:

http://yourserver(localhost)/CQPweb

Once the software installed and running, the splashcreen of the system will look as something like this:
Splashscreen

Corpus Insertion And configurations.

In order to execute this step one must first.
  • Have a proper VRT file, as formatted and shown above
  • Have an instance of CQPweb up and Running normally
  • An user with power to insert corpora into CQPweb

Firstly Login to the CQPweb system, and navigate to the "Admin Control Panel Page". It it the last link in the "Account options" Menu, on the left of the screen.
Go to the Admin Control page

After that, in order to insert the corpus into CQPweb, firstly the upload of the files containing the Corpus must be uploaded to CQP, as well as its metadata counterpart.
To do that, firstly one needs to Navigate, under the menu "Corpora" and click the link "View Upload Area", at this link it will be listed all the files in any shape needed to research the corpora, in the case of this study, the VRT files and the metadata files are already uploaded, containing the Corpus.
After that, it is necessary to select every file separately and upload the selected files, with the click of the central button.
Go to the Admin Control page

Once all the corpus files had been uploaded its time to install the corpus into CQPweb to make it searchable. The first step toward that is clicking in the "Install New Corpora" on your instance of CQPweb, as shown below: Go to the Admin Control page

Here there are several thing one must fill in and we will be as descriptive as possible regarding on how to properly fill the form to install the corpus.
  • Fill the "Specify the corpus “name”" field, this will be the name of the Corpus to the MySQL database, and it has some rules, just follow the procedure enforced by CQP and you should be fine.
  • Fill the "Enter the full descriptive name of the corpus" with a short descriptive name of the corpus
  • After filling the names, it is time to choose which VRT files will be part of your corpus, choose as many VRTs as it is necessary for the entire corpus.
  • After filling the chosen files it is time to choose the annotation of the corpus, the first part of this form is concerning the XML TAGS that classify the corpus, and you must fill them according to the fields specified in the XML. Below there is a picture of the Screen in which one can see the form filled with the information necessary to insert the corpus described on this page.
  • Finally, there is the Word Annotation form, in which one must be descriptive about the annotations that occur word by word in the .VRT file.
  • After filling all these fields we recommend selecting one of the many standard CSS files to styling the corpus.
  • At last, to finish the installation process, one must click the link "Install corpus" with settings above to finish this part of the installation.

Go to the Admin Control page

After the Click on the button the following Screen should follow up: Go to the Admin Control page

To proceed, the user must have clicked on the link to fill the metadata with a file, choose the previously uploaded metadata file in the screen and fill the metadata fields as it is shown in the image bellow, after that click in the "Install the matadata table using the settings above" button to finish the corpus metadata configuration. Go to the Admin Control page

Finally the corpus is installed! The final step is the configuration of the Lemma annotations, to frame with the Simple Query parameters automatically provided by CQPweb.

In order to do that one must navigate in the corpus view to the "Manage annotation" menu, and select the proper tag for every annotation in the corpus, as it is shown in the image bellow:
Go to the Admin Control page

Now the corpus is installed and ready to usage, to do that, one must go in the "standart query" link in the corpus view and make the desired query.

OpenSesame on FrameNet 1.7

Open Sesame is an open source frame parser software, or in a more trivial way, it is a software that will highlight the frames of a given text using a certain parsing model (Frame Semantics is a huge topic in which we will not deepen in definition on this page. For more on this follow this link). Open Sesame uses FrameNet 1.7 as its dictionary, to annotate text for frames.

The process of dealing with Open Sesame to annotate frames in the Journal Abstracts dataset was divided mainly in three steps: dependency software installations and downloads of various required software, training, and running on the text files.

Dependency software installations and download of various required software


Firstly it is important clarify that all the software previously described and used in this pipeline were initially run locally. However this is not the case for Open Sesame, since the training phase of the software requires a lot of processing capacity. This fact forced the research team to search for an alternative and this would be one of the Red Hen's servers. Once the access was made to the platform, the following steps were taken in order to use Open Sesame properly. All the instructions given here to installation imply the use of a UNIX server.

As it can be seen in the read.me page of Open Sesame, the software is written in python and it solves its dependencies using PIP, so both software needed to be downloaded and installed plus the detailed instructions for this process are given here. In the specific case of this installation, arrangements were made in order for the software to be installed just for the logged in user of the machine. Below are the steps of that installation.

Python was already installed globally, so pip had to be downloaded and installed for just the current user:
$ cd ~/home
$ mkdir get-pip
$ cd get-pip/ $ wget https://bootstrap.pypa.io/get-pip.py
$ python get-pip.py --user
$ pip -V

Once pip and python were installed it was time for the download and installation of Open Sesame's dependencies, again, all locally:
$ cd ~/home/path/to/server's/actual/home/directory/
git clone https://github.com/swabhs/open-sesame.git $ cd open-sesame/

$ #once inside open-sesame directory its time to install its dependencies via pip
$ pip install dynet --user
$ pip install nltk --user

The next step following the documentation given in the read.me file on the github page would be:
python -m nltk.downloader averaged_perceptron_tagger wordnet

However an error was given by this command in which a specifi NLTK library was not installed. In order to fix the issue, a small search indicated that such library had to be inside the python shell.
$ python
→ SHELL PYTHON

Python 3.7.2 (default, Dec 29 2018, 21:15:15)
[GCC 8.2.1 20181127] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk 
>>> nltk.download("punkt")
[nltk_data] Downloading package punkt to /home/lucas/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
True >>>Ctrl + z

→ END SHELL PYTHON

After the described installation through pyhton shell, the command could run with no bigger problems:
python -m nltk.downloader averaged_perceptron_tagger wordnet

Once all this software is installed, it´s time to begin the processes of downloading the data the program will need to train as its dataset. Once the system is a neural network it uses training models in order to more accurately predict the frames in the target data. Firstly the data from FrameNet needs to be requested. Once the request was approved they sent the data. For evaluation and organization purposes, all data that will be used by Open Sesame has to be inside the \data folder. The FrameNet data was downloaded and uncompressed inside the \data directory.

$ mkdir data $ cd data/
$ #cookies.txt not directly created once the wget is done
$ touch cookies.txt
$ wget --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1s4SDt_yDhT8qFs1MZJbeFf-XeiNPNnx7' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1s4SDt_yDhT8qFs1MZJbeFf-XeiNPNnx7" -O fndata-1.7.tar.gz && rm -rf cookies.txt
$ tar xvfz fndata-1.7.tar.gz fndata-1.7/

Once the FrameNet data is uncompacted inside the \data directory, it´s time to download the glove. Glove is a pretrained word embedings set that Open Sesame uses its ouput trained by 6B tokens.

$ wget "http://nlp.stanford.edu/data/glove.6B.zip"   
$ unzip glove.6B.zip 

The last thing to do in preparation for training and running the software on real files is the preprocessing command, which has to be run in the root of the project:
$ cd ..
$ python -m sesame.preprocess

Open Sesame Developer describes the preprocess script as "The above script writes the train, dev and test files in the required format into the data/neural/fn1.7/ directory. A large fraction of the annotations are either incomplete, or inconsistent. Such annotations are discarded, but logged under preprocess-fn1.7.log, along with the respective error messages."

Training

Once the process described above is done, the training of the data can begin. @swabhs explains the process of training as threefold, in which each individual step is used for tests later in the process. The main command given to training is:
$ python -m sesame.$MODEL --mode train --model_name $MODEL_NAME
 
Where $MODEL is the type of training that can be performed, the types available are "targetid", "frameid" and "argid", a more specific explanation about each of these functions in relation to its properties can be found in the readme page as well as in the paper, released by the same author.

$MODEL_NAME refers to a .conll file which will be used as a model to identify frames in each of the previously described models ("targetid", "frameid" and "argid").

This is the point where time consumption of computational resources was noticed, and measures had to be taken, moving the processing to the Red Hen's server, as the software is training a neural network. That said it is also noticeable the fact that any of the three stage models can be trained forever. Usually the time used for training the models was about a 24h period.

Here are the examples of the training made in the pipeline:
$ python -m sesame.targetid --mode train --model_name targetid-01-17
$ python -m sesame.frameid --mode train --model_name frameid-01-18
$ python -m sesame.argid --mode train --model_name argid-01-22

After letting each of this training commands run for approximately 24 hours, the system had prepared the models "targetid-01-17", "frameid-01-18" and "argid-01-22" for prediction, which is the next step. 

Predictions

For the predictions, the models previously derived from training are used in order to predict the frames in each level, target, frame and argument, these models are required for the software to work. Also for the software to work it is needed that the sample or test file to be separated in just sentences (see example 1 in the pragmatic segmenter header in this page for the an example). In this example we used the pure output of step 3 in this page.

As for the next python inputs on the command prompt, they go as follows:
python -m sesame.targetid --mode predict --model_name targetid-01-17 --raw_input 2015-00-00_JA_10.1017∕S0267190514000178_Annual-Review-Of-Applied-Linguistics_Matsuda_Paul-Kei.seg 

 As it can be seen, the command is very similar to the training model, since it´s running the same script. However only the flag --mode that will change from "training" to "predict", obviously now, can also be seen as the --raw_input flag, which designates in which file the Open Sesame will search for frames. After the program runs, it will output a .conll file in the directory "logs/$MODEL_NAME/predicted-targets.conll", which will be used as raw input for the next step. So in this case it becomes:
python -m sesame.frameid --mode predict --model_name frameid-01-18 --raw_input logs/targetid-01-17/predicted-targets.conll

The runsit will output the frame predictions in the file logs/$MODEL_NAME/predicted-frames.conll which will be used as an input for the next step:
$ python -m sesame.argid --mode predict --model_name argid-01-22 --raw_input logs/frameid-01-18/predicted-frames.conll

After all, the argument predictions file will be at: logs/argid-01-22/predicted-args.conll. Here is a preview of the predicted-args.conll file.

Semafor on FrameNet 1.5

Semafor is an open source frame parser software, or in a more trivial way, it is a software that will highlight the frames of a given text using a certain parsing model (Frame Semantics is a huge topic in which we will not deepen in definition on this page. For more on this follow this link). Semafor uses FrameNet 1.5 as its dictionary, to annotate text for frames.

Differently from Open Sesame it uses a Java Software on the core for its predictions and at least one model to guide these predictions.

Another important aspect of this version of the Semafor Software, is NOT the Latest Stable Release (LTS) of Semafor. As this version had Several bugs and it was too difficult to run. The Same author of the LTS version decided to fork that GitHub repository into another and made some experimentation that facilitated both, the process of installing the software as well as running it with no change in the output, given the usage of the same models.

The process of installing Semafor is basically the same as described in the GitHub Link

Also like the Open Sesame Software, due to processing resource consumption from Semafor it was decided that it would run on Gallina, with that in mind the next steps presented here are referent to a Gallina download and configuration of the Software.

Downloads

To download Semafor, use this command inside on your gallina home:

$ git clone https://github.com/Noahs-ARK/semafor.git

Semafor uses a "models" directory that can be put wherever in the organization of your OS, in the case of Gallina the following path is recommended "/mnt/rds/redhen/gallina/home/YOUR_USER/models", after that it is necessary to give the entire folder permissions to be read and manipulated by other programs, and finally make the download of the model package used by Semafor for predictions.

$ mkdir /mnt/rds/redhen/gallina/home/YOUR_USER/models
$ cd /mnt/rds/redhen/gallina/home/YOUR_USER/models
$ chmod 775 -R ./
$ wget http://www.ark.cs.cmu.edu/SEMAFOR/semafor_malt_model_20121129.tar.gz

Maven is needed in order to compile the Java code on this project, but it is not available globally at Gallina, so it is needed to download the software. To download maven, inside of Gallina home, use the command:

$ wget https://apache.claz.org/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz

The last thing necessary to run Semafor are the files that are used as the software input. In the case of the corpus described in this page, it is used the following command to download the necessary part of the Software.

$ git clone https://github.com/BrazilianRedHen/prose_hen.git

Configuration

Once all the downloads are done is time to start the configuration of software in order to run Semafor.

Initially go into the previously mentioned models folder, and decompress the downloaded tar.gz file

$ cd /mnt/rds/redhen/gallina/home/YOUR_USER/models
$ tar -zxvf semafor_malt_model_20121129.tar.gz

The Maven software package also needs to be unpacked into the Gallina home of the user.

$ cd /mnt/rds/redhen/gallina/home/YOUR_USER/
$ tar -zxvf apache-maven-3.6.3-bin.tar.gz

In order to run maven as a software inside Gallina it is needed to set some environment variables to your user, since "sudo" is not allowed. To do that the ".bashrc" file needs to be created, as well as the following following code with its content;

export M2_HOME=/mnt/rds/redhen/gallina/home/abc123/apache-maven-3.6.3

export M2=$M2_HOME/bin

export MAVEN_OPTS="-Xms1024m -Xmx4096m -XX:PermSize=1024m"

export PATH=$M2:$PATH

After the file is saved, it is needed to be executed in order for the changes to take effect:

$ source .bashrc

Only to certify that the Maven installation processes occurred with no failures, execute the following command, if the text after the command references the Maven version, the Java version, as well as the place where its installed, everything worked properly.

$ mvn -version

Now the next step is to configure Semafor, this process consists of setting environment variables that will only be used by Semafor in its instances for your user. To set these variables navigate to the Semafor folder, in the "bin" directory edit the "config.sh" file to have the following properties.

#!/bin/sh
######################## ENVIRONMENT VARIABLES ###############################
######### change the following according to your own local setup #############


# assumes this script (config.sh) lives in "${BASE_DIR}/semafor/bin/"
export BASE_DIR=”/mnt/rds/redhen/gallina/home/abc123”

# path to the absolute path
# where you decompressed SEMAFOR.
export SEMAFOR_HOME="${BASE_DIR}/semafor"

export CLASSPATH=".:${SEMAFOR_HOME}/target/Semafor-3.0-alpha-04.jar"

# Change the following to the bin directory of your $JAVA_HOME
export JAVA_HOME_BIN=“/usr/lib/jvm/java-1.8.0-openjdk/bin”

# Change the following to the directory where you decompressed
# the models for SEMAFOR 2.0.
export MALT_MODEL_DIR="${BASE_DIR}/models/semafor_malt_model_20121129"




######################## END ENVIRONMENT VARIABLES #########################

echo "Environment variables:"
echo "SEMAFOR_HOME=${SEMAFOR_HOME}"
echo "CLASSPATH=${CLASSPATH}"
echo "JAVA_HOME_BIN=${JAVA_HOME_BIN}"
echo "MALT_MODEL_DIR=${MALT_MODEL_DIR}"

For Semafor to run properly it is needed to run the config.sh file.

$ ./config.sh

Running

When all the above mentioned configurations are done it is time to compile the software.

$ cd /mnt/rds/redhen/gallina/home/YOUR_USER/semafor/
$ mvn package

After the system finished compiling is just run the "runSemafor.sh" file inside bin with the following parameters, absolute path to input files, absolute path to output file (it will try to create the file, so if permissions are not set properly it will not write the file correctly) and finally the amount of threads used to run Semafor for that file.

$ ./runSemafor.sh /mnt/rds/redhen/gallina/home/YOUR_USER/prose_hen/pragmatic/2015-apr-30_JA_10-1016_j-physrep-2015-02-003_physics-reports-review-section-of-physics-letters_miransky_vladimir.seg /mnt/rds/redhen/gallina/home/YOUR_USER/prose_hen/semafor_output/2015-apr-30_JA_10-1016_j-physrep-2015-02-003_physics-reports-review-section-of-physics-letters_miransky_vladimir.sem 2

One important notice on the input file is that it must be a file segmented by phrases, where each period occupy a line. In the pipeline a software that does that was already presented in this documentation (pragmatic segmenter), that is why, on the example above the input file is taken from the "pragmatic" folder of the prose_hen project.

Here is a link of the output of Semafor for the example run.

As a next step on the front of the Semafor process, currently there is an effort to automate the running process for the entire corpus through a python script.

Data Analysis

The Data analysis part of the research is that in which the linguist tries to directly derive meaning and conclusions based on the quantitative data given by one or more software included in the pipeline. For this reason, this section is divided into which information was derived from which presented platform.

Open Sesame ConLL files:


The Data Analysis Processes based on the Open Sesame software has its roots on the ConLL files delivered by Open Sesame. In these files it is shown that there is a myriad of information that Open Sesame provides to the research, those mainly being in the axis of Frame count, Target count (Lexical unit that invokes the frame, can be counted by frame, or by itself), argument count (An argument that is invoked by an specific frame, can be counted by frame, or by simply counting the arguments).

These findings with respect to the ConLL files can be found here in the excel folder of the statistics part of the GitHub project. Initially these findings were taken,by abstract, by fields of research, such as Biology, Hard Sciences or Human Sciences, by discipline, such as chemistry, or health, journal, such as, Academic Medicine, or Advanced Materials, and finally, all the files together in a single count. 

After analyzing the  above mentioned statistical files, which provide an overview of the frames encountered by Open Sesame the research team decided to not use the data from Open Sesame as a trustworthy software to frame annotation. One of the main points that guided this decision was that the most found frame by the Open Sesame resulting dataset is the frame Age, mainly evoked by the lexical unit "of", however, many instances of the word "of" in the dataset, when contrasted by human evaluation would not point to the frame Age. 

With that in mind the team decided a different approach running another Frame Annotation Software, Semafor, using Framenet 1.5.

Semafor in FrameNet 1.5

This parsing for the data analisys part of the research starts from a point where we have Semafor’s output as a JSON file for each abstract of the corpus. The resulting JSON from the frame analisys have the structure that can be seen bellow:

Captura-de-tela-de-2020-04-15-21-58-36

The file is diveded by line, each line in the JSON corresponds to a line of text in the abstract. With that being the case, each line has a JSON object with the attribute “frames” containing the frames in the marked up sentence by this line.

As a sentence can contain more than one frame, the attribute "frames" contains an Array, each position in the Array is an identified frame, this information comes in two attributes respectivelly, “target” and “annotationSets”.

  • “target” is where general information about the frame is, its designation and also the lexical unit that called the frame and its position (counting spaces),this positional information can be looked up by the attributes “start”, “end” and “text”.
  • “annotationSets” is where information about the result of semafor’s analysis is found. Attributes like “rank” and “score”, that measures the accuracy of certain numbers by semafor. Also the “frameElements” attribute is included in this part and its information designate the frame elements identified by which.


With these information in mind, as it is know, the research in this part has two main purposes.
  • 1. Find the occurrence of frames in the corpus, for all abstracts, per field, per discipline and per journal.
  • 2. Find the occurrence of Lexical Units that evoke the above found frames in the corpus, for all abstracts, per field, per discipline, per journal and also for each frame (considering frame subdivisions).


In order to evalluate such objects there are some different tools that are described bellow for a better understanding of the coding process. These are shown bellow:

Tools

The two main tools used in the data analysis process are the python language, as it is also the rest of the pipeline, as well as Jupyter Notebook Environment this technology allow us to have a practical environment to generate the statistical info, with the concept of "live code", which means that the code can run from a point where data have already been treated, without needing to run all the code every time a new statistical visualization is needed.

Inside these Jupyter Notebooks, Python code is normally run, all files and their respective functions and features can be viewed here.


Results

The code written in this section of the research process mainly aims at two specific statistical purposes, already mentioned above, frame count and lexical unit count.

Firstly there is a division by all the aforementioned categories (for all abstracts, per field, per discipline and per journal), then inside of each directory there is two directories, one for frames, another one for lexical units. In Practice, the way the results can be found here.

In terms of generating artifacts for representing the results three main types of artifacts were generated by the analisys.

These findings are today guiding our next steps in the research, mainly in terms of having more statistical analysis. The expansion will come in the front of ranking which lexical units evoked a single frame.

Despite all the processes that are described in this section being about the data analysis extracted from the Semafor frame annotation software the actual installation process of Semafor is yet documented, currently we have the effort document the platform instalation and running processes.  
Constraints notes:
Here there is one big constraint that must be mentioned. In order for all the statistical data to be extracted from the corpus, all the data had to be manipulated through python scripts. Since the retrieval of the Semafor data, to the actual calculations to make all these graphs based only on counting occurrences of frames and lexical units in the corpus.
We intent that solely with the purpose of analyzing the data more thoroughly and less sparsely to insert this data into a database of some sort and query the database to extract the spreadsheets that would infer the above mentioned graphs.

Projects using the Pipeline


As the development and construction of this pipeline is part of the Red Hen Lab it has a collaborative nature. In this context it is natural to assume that several people would use the work already finished to achieve different results with different inputs into the system.
Currently there are three projects that are being developed inside this pipeline.

Narrative & Figurative Patterns in Science Communication: frame and blending constructions


This project lies between cognition and language and between the research and the communication processes. It is developed under three axes:
  • Theoretical: Under the light of Cognitive Linguistics, it investigates in what extent narrativity and figurativity account for developing, describing and communicating scientific concepts and processes. In addition, as the audience are both scientists in interdisciplinary research settings and novice researchers having to engage into a new jargon, we aim to look into blending constructions for the communication of new/original complex technical concepts and processes.

  • Technological: The Basic Text Pipeline provides the project with tech tools for linguistic analysis. It is being developed for semantic and syntactic automatic tagging of data. High-impact published scientific abstracts are tagged for frames in Semafor, whose statistical output makes up the narrative patterns. The same corpus is tagged for parts of speech in Standford Core NLP for CQPweb search of blending constructions (analogies and metaphors), which make up the figurative patterns.

  • Pedagogical: Laletec Extension Project is the branch responsible for modelling the linguistic analysis results into pedagogical practices and products as evidences on how narrative and figurative patterns contribute to the teaching of scientific writing.


The project has been developed by Rosana Ferrareto (IFSP, Brazil) since August/2018, when she was a visiting PhD scholar at Case Western Reserve University under the supervision of Mark Turner and is now part of the Research Program aCOMTECe.

WriteFrame


This project investigates how novice researchers frame their research; specifically, if they use personal narratives that differ from the conventional narrative structure of scientific writing. The project is being developed by Hana Gustafsson (NTNU, Norway) in close collaboration with Rosana Ferrareto and Mark Turner. WriteFrame’s text pipeline - which is directly based on the Basic Text Pipeline - has been developed and implemented by Yamen Zaza in close collaboration with Rosana’s technical team.

Future Pipeline Steps


As it can be foreseen this page is a growing continuum of the construction of the basic text pipeline, as we as a team intent to help and serve as guidance to Rosana's post Doctoral and future research there are some tasks that are about to happen, and a myriad of another tasks that would be awesome if they could integrate the pipeline, as it follows our intentions about the future are:

Tasks about to be documented/implemented:


  • Semafor installation and data formatting
  • This point is very interesting, as commented above, Open Sesame is one of a myriad of software that annotates frames, currently its main counterpart is the software Semafor developed by Noah Smith's lab.
    Obviously, at the end of the process we intent to manipulate the files, the details about this procedure should be documented in this page.

  • Frame Index
  • A new educational System developed with another Purposes by the Same Brazilian Red Hens!