Basic Text Pipeline

Smiley face

Some "multimodal communications" consist of text files.  A scientific article is such a piece of data. These files can be processed and tagged so as to create files of metadata that can then be searched and further processed. For example, Red Hen has a project focused on processing and tagging abstracts of scientific articles to make it easier to analyze them for image-schematic and narrative structure.
The Basic Text Pipeline is project within Red Hen Lab.            
A Brazilian team from IFSP - Rafael Ruggi (student), Lucas Spreng (student), Rubens dos Santos Junior (student) and Professor Gustavo Aurelio Prieto - mentored by Professor Rosana Ferrareto -  is currently helping to develop this pipeline. 

Respectively, the mentioned participants are:

Smiley face Smiley face Smiley face Smiley face Smiley face

Would you like to assist? If so, write to: and we will try to connect you with a mentor.

Currently we have the following picture as a guideline for the development of the pipeline as a whole.

Constraints notes:
During the construction of the pipeline several problems were encountered by the developers and linguists, all the problems had during the construction period will be noted using this format, just a heads up, nothing to report yet!

File Acquisition

This can be accomplished in a myriad of ways. In the current state of this pipeline, these files are Journal Abstracts gathered manually from high impact journals in the Web of Science Platform ( 
The current Dataset being worked at the Red Hen Basic Text Pipeline is a corpus composed of 1,000 Abstracts gathered by Cognitive Linguist and Language Professor Rosana Ferrareto, faculty member of IFSP and PostDoctoral visiting researcher at CWRU, in the CogSci Department and in the Red Hen Lab.
The processes of abstract acquisition can be improved, and one of the ways in which this is possible is by creating web robots, which can do the steps required to download the abstracts and then upload those onto Gallina.
Here's an example of an raw manually collected file:

FN Clarivate Analytics Web of Science
VR 1.0
AU Matsuda, PK
AF Matsuda, Paul Kei
TI Identity in Written Discourse
AB This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.
SN 0267-1905
EI 1471-6356
PY 2015
VL 35
BP 140
EP 159
DI 10.1017/S0267190514000178
UT WOS:000351470600008


File normalization and conformity

This section specifies the rules that the names and content of files should have in order to be searchable by Edge Search Engine 4 provided by Red Hen (  It is also worth pointing out that these rules have been established as in conformity to the Journal Abstracts to be searchable for the purpose of this specific research.
If you wish to contribute to the pipeline with different types of files please send an e-mail to the already specified e-mail

Name specifications

Example filename:
1. COMPLETE-DATE is not numeric due to the time of data that is given by the source file, since an accurate publication date is not provided, however the magazine date is provided, mostly by range, so this type of notation is found in several files 2015-apr-may as COMPLETE-DATE, since this data is not used in the moment in the pipeline it is best to leave the date as original found.
2. DOI is a number that specifies a file, see more at:

A python script was made to sweep through all files, reorganizing each individual name into the aforementioned format:

# -*- coding: UTF-8 -*-
import os

listOfFiles = list()
for (dirpath, dirnames, filenames) in os.walk("../raw"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenames]
lines = 0

nameOfFiles = list()
date = ""
ano = ""
fonte = ""
autor = ""
doi = ""

for file in listOfFiles:
    if file.endswith(".txt"):
        with open(file) as f:
            lines = [line.rstrip('\n') for line in open(file)]
            head, sep, tail = lines[0].partition("FN ")
            mes = ""
            ano = ""
            fonte = ""
            autor = ""
            doi = ""

            for l in lines:
                l.replace("\ufeff", "")

                if l.startswith("PD "):
                    date = l.replace("PD ", "")
                    date = date.replace(" ", "-")
                    date = date.lower()

                if l.startswith("PY "):
                    ano = l.replace("PY ", "")

                if l.startswith("DI "):
                    doi = l.replace("DI ", "")

                if l.startswith("SO "):
                    journal = l.replace("SO ", "")
                    fonte += journal.replace(" ", "-").lower()

                if l.startswith("AF "):
                    autorComplete = l.replace("AF ", "")
                    autorLast, sep, autorFirstComplete = autorComplete.partition(', ')
                    autorLast = autorLast.replace(" ", "-").lower()

                    aux = 2 # this is two taking into consideration that an abreviated name is the first letter of the name plus a comma eg. D.
                    autorFirstList = autorFirstComplete.split(" ")
                    if len(autorFirstList) == 1 and len(autorFirstList[0]) != aux:
                        autorFirst = autorFirstList[0]
                        for autorName in autorFirstList:
                            if len(autorName) > aux and len(autorName) != aux:
                                autorFirst = autorName
                                autorFirst = ""
                    autorFirst = autorFirst.lower()

            #start assembling string
            stringFileName = str(ano)

            if date != "":
                stringFileName += "-"
                stringFileName += str(date)
                stringFileName += "00"

            #separating Date and DOI putting the annotation name
            stringFileName += "_JA_"
            if doi != "":
                stringFileName += doi
                stringFileName += "00-0000_0000000000000000"

            #separating DOI and file source
            stringFileName += "_"+str(fonte)

            #adding first author name
            stringFileName += "_"
            stringFileName += str(autorLast)
            if autorFirst != "":
                stringFileName += "_"
                stringFileName += str(autorFirst)

            #slashes are underscores
            #dots are dashes
            stringFileName = stringFileName.replace("/", "_")
            stringFileName = stringFileName.replace(".", "-")

            #adding extention
            stringFileName += ".txt"


            with open("../abstracts/"+ano+"/"+stringFileName, "w+") as fileReady:
                for lineInFile in f:


Constraint Notes:
Before we arrived at this version of the file given above we had at least 3 versions of this file, the history can be accessed here.

File Headers specifications:

Expected file header format:

Example file header format:
COL|Journal Abstracts, Red Hen Lab
TTL|Identity in Written Discourse
CON|This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.

The following python Script was written in order to conform the file with the norms above:

import os
import uuid
import re

listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../abstracts"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

for file in listOfFiles:
    if file.endswith(".txt"):
        path = file.split('/')

        completeDate = path[path.__len__()-1].split("_")[0]

        completePath = path[path.__len__()-2]+"/   "+path[path.__len__()-1]
        completePath = completePath.replace("..", ".")
        completePath = completePath.replace(" ", "")


        line1 = "TOP|"+completeDate+"|"+completePath+'\n'

        line2 = "COL|Journal Abstracts, Red Hen Lab"+'\n'

        line3 = "UID|"+uuid.uuid4().hex+'\n'

        line5 = "CMT|"+'\n'

        line6 = "CC1|ENG"+'\n'

        line9 = "END|"+completeDate+"|"+completePath+'\n'

        with open(file) as f:
            lines = [line.rstrip('\n') for line in open(file)]

            #SEARCH FOR TITLE
            r ="TI ((.*?\n)+)SO", "\n".join(map(str, lines)))

            if r:
                line7 = "TTL|""\n", " ").replace("    ", " ")+"\n"

            r ="AB ([\w\W]*?)(?=\n[A-Z]{2}\s)", "\n".join(map(str, lines)))            

            if r: 
                line8 = "CON|"+"\n", " ").replace("    ", " ")+"\n"

            for l in lines:
                l.replace("\ufeff", "")

                if l.startswith("SO "):
                    subLine = l[2:]
                    trimmedSubLine = subLine.strip()
                    line4 = "SRC|"+trimmedSubLine+'\n'

            stringFinal = line1 + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9
            with open("../headers/" + completePath, "w+") as fileReady:

The above mentioned python script was preceded by a PHP version of the same file, however as the machine that stores and runs the code does not have the capabilities to run PHP code, it was refurbished in the aforementioned format (the file can be found on the GitHub for the project). 

Constraint Notes:
This piece of code actually gave us some trouble, on the initial versions of the code, that can be found here, it is clear that the code does not account for a multiple lined Title neither a multiple lined Abstract content. Regexses had to be arranged in order for the resulting header file to be correct, since we had a bunch of files with incomplete titles and abstract contents.

Pragmatic Segmenter

Pragmatic segmenter is a third party software that is used in the pipeline in order to organize the file content (SRC, TTL and CON headers) into a file with a .seg extension. In this file, each sentence of the content occupies a line. Using the raw file at the first example we get the following resulting pragmatically segmented file (copy and paste the file in a plain text reader for the better understanding of the example):

Identity in Written Discourse
This article provides an overview of theoretical and research issues in the study of writer identity in written discourse.
First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized.
Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized.
The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity.
The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.

Steps to the procedure:

Pragmatic Segmenter is a software written in a programming language called ruby, initially ruby needs to be installed locally on a machine to run the application. For an installation guide follow the related link in the top of the page. After ruby is installed it is needed to download Pragmatic Segmenter. For the creation of this pipeline, the version developed by Kevin Dias was used. Here is the link to the step by step guide to installing it

Once pragmatic segmenter was installed, a brief ruby script was arranged to sweep through a path. Then the files were given an argument call on command prompt, this was arranged because as the whole pipeline is written in python, code conformity was preferred, rather then dealing with ruby nativelly to deal with the files. The next script was created based on the example given in the installation instructions page:

require 'pragmatic_segmenter'

fileLocation = ARGV[0]
content = ""
f ="../pragmatic/cache/"+fileLocation, "r")
f.each_line do |line|
    if content == ""
        content += line
        content += line

ps = content)

segments = ps.segment

stringFinal = ""
segments.each do |seg|
    stringFinal += seg + "\n"

f ="../pragmatic/"+fileLocation, "w+")

In this file, we can see that this ruby file sweeps through a file in the cache directory and segments its contents line by line. After that it will record its results to a file that has the same name of the argument and has the .seg extension. In order to put the files with just the contents of TTL and CON headers into the cache directory and to call the ruby script, a python script was created:

import os
listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../headers"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

for file in listOfFiles:
    if file.endswith(".txt"):
        path = file.split('/')
        path = path[path.__len__()-1]
        path = path.replace("..", ".")


        sendToPragmatic = ''

        with open(file) as f:
            lines = [line.rstrip('\n') for line in open(file)]
            head, sep, tail = lines[0].partition("FN ")

            for l in lines:
                l.replace("\ufeff", "")

                if l.startswith("SRC|"):
                    sendToPragmatic = l[4:].capitalize().strip() + '.' + '\n'

                if l.startswith("TTL|"):
                    sendToPragmatic += l[4:].strip() + '.' + '\n'

                if l.startswith("CON|"):
                    sendToPragmatic += " "
                    sendToPragmatic += l[4:]

            with open("../pragmatic/cache/" + path[:-3]+'seg', "w+") as fileReady:

            pragmaticReturn = os.system('ruby ps.rb "' + path[:-3]+'seg' + '"')

The function os.system() at the end of the script would call the ruby application through a terminal command line. For each file that the python script reads, it sends the contents of its SRC + TTL + CON headers to a file in the cache folder with a .seg extension, then it sends the location of that recently created file to the aforementioned ruby script which segments the file as described and gives the results of the first example in this section. 

This procedure to arrange the file in the specified format is essential to the process executed to run OpenSesame on Framenet 1.7 (step five of such guide), although the need for this procedure is unknown in order for the files to be found by Edge Search Engine 4 given by Red Hen.

Stanford Core NLP

Stanford Core NLP (SC NLP) is another third party software that serves the purpose of marking text in a variety of ways in order to gather information from it (the various ways in which segments and surfaces information on the content are provided can be found on its webpage). The first thing to do in order to use SC NLP is the download of the specific software so that one can proceed to run Stanford Core NLP in any text.

There are several "wrappers" of SC NLP since its main code in written in java. For this project the wrapper developed by Lynten was used, and the link for the usage and installation of its pip library is

In this pipeline, the SC NLP software is used as the base marker for text in general, this is the first software of the pipelines that "adds" data, and not just "rearranges" its data for future use. The biggest function of the software is to use the annotators: tokenize, ssplit, pos, lemma, these respectively puts the sentence into tokens, ssplit to sentence split the sentence, just as Pragmatic Segmenter, but internally, pos for Part of Speech tags and lemma to lemmatize the sentence into the base forms of words.  

In a first step, the SC NLP output is going to be used for the base insertion of the Corpus into CQPweb, which is a software described bellow.

Once everything was installed, a script was assembled to run the pipeline through the software. The python script is shown below:

import os
import uuid
import re
from stanfordcorenlp
import StanfordCoreNLP

nlp = StanfordCoreNLP(r '/var/python/stanford-corenlp-full-2018-10-05')

props = { 'annotators': 'tokenize, ssplit, pos, lemma, parse', 'pipelineLanguage': 'en', 'outputFormat': 'json' }

listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../pragmatic"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

for file in listOfFiles:
    caminho = file.split('/')
    if caminho[2] != "cache":
        stanfordFileName = caminho[2]
        stanfordFileName = os.path.splitext(stanfordFileName)[0]

        with open(file) as f:

            text =

            annotatedText = nlp.annotate(text, properties = props)

            with open("../stanford/" + stanfordFileName + ".stf", "w+") as fileReady:
                print stanfordFileName + "\n"

print "end"
The output format of the function currently is a JSON file which contains the annotations commented above. As the file is too big to be directly inserted here, it follows the link for the JSON output.

Data Modeling

After all of the previous phases, the project needed to be inserted into a corpus concordant of some kind. Initially it was thought that the corpus of abstracts was going to be inserted into Edge Search Engine 4, since it already has the capability of showing the data by frames, however the platform mainly serves the purpose of video and would become ill fitted with the data collected on this page.
In order to overcome this "corpus format" issue, it was set that we would use CQPweb (as it can be seen in the next sessions) this platform can provide a wide array of searches, such as by POS, grammatical constructions as well as its inputs can be directly modeled into it. With that in mind the project could grow a side in which data was modeled based on the tools already presented to fit CQPweb.
After a first glance it was clear that the inference of inserting the collected frames into CQPweb directly would be too hard and time consuming, so it was decided that in a first phase of the project, with the fixed deadline of july 2019 the Corpus would be searchable, at least by constructions on CQPweb.

VRT files

Throughout the research process it was discovered that the type of files accepted by CQPweb to be searchable as a corpus would be a file with extension .vrt, which stands for vertical text (i.e. one word per line).
The tutorial of the insertion of SaCoCo Corpus into CQPweb describes the process of creating such files.
The action of following the aforementioned tutorial on a private installation of CQPweb (the installation process is described in the next topic) was helpful mainly because it gave ground to the research process of modelling the data, and it was a hands on process needed to advance in this direction. The "Easy" section of the tutorial was followed, and by doing that the project received an example VRT, a blueprint that could be used to process the research's data set.
The VRT file stands for VeRTical xml file, basically it follows a very definite structure, in which, each text (in the case of this research, each abstract) is surrounded with the <text> tag, this tag defines the beggining and the end of each file in the corpus, following this tag, there is the <p> tag, that surrounds every paragraph of text, further on there is the <s> tag, which surrounds every sentence. Each line of the VRT file that contains the corpus has to be structure in the following way:
searchable_information(A TABULATION SPACE)searchable_information(A TABULATION SPACE)searchable_information...
Once the format of the file was well defined, it was time to create a script that regulates its information:
import os
import uuid
import re
import json
import hashlib
import xml.etree.cElementTree as ET

import dictionaries

# function to indent the XML properly
def indent(elem, level=0):
    i = "\n" + level*" "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + " "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
            if level and (not elem.tail or not elem.tail.strip()):
                elem.tail = i

listOfFiles = list()
for (dirpath, dirnamonth, filenamonth) in os.walk("../stanford"):
    listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

lines = []

completeVrtString = ""
metadataString = ""

for file in listOfFiles:
    if file.endswith(".stf"):
        path = file.split('/')

        fileNameNoExt = path[path.__len__()-1].split(".")[0]

        publicationYear = path[path.__len__()-1].split("_")[0].split("-")[0]

        f=open("../headers/"+publicationYear+"/"+fileNameNoExt+".txt", "r")

        #dealing with the first sets of metadata.
        headerLines = f.readlines()
        for line in headerLines:

            if line.startswith("TTL|"):
                abstractName = line[4:].replace("&#10", "").replace("\ufeff", "").replace("\n","")

            if line.startswith("SRC|"):
                journalName = line[4:].rstrip().replace("-"," ").upper().replace("\n","")
                field = dictionaries.dictionaryFields()[journalName].replace("\n","")
                discipline = dictionaries.dictionaryDisciplines()[field].replace("\n","")

        stringIdHash = hashlib.md5(fileNameNoExt.encode()).hexdigest()

        metadataString += stringIdHash+ "\t" + abstractName + "\t" + journalName + "\t" + field + "\t" + discipline + "\n"

        print(fileNameNoExt+" "+stringIdHash)

        #inputing metadata text tags on file.
       text = ET.Element('text', _id=stringIdHash, abstract_name=abstractName, jounal_name=journalName, field=field, discipline=discipline)

        #dealing with the stanfordCoreNLP sstricture and files.
        with open(file) as json_file:
            data = json.load(json_file)
            # tag
            p = ET.SubElement(text, 'p')
            for sentence in data['sentences']:

                for token in sentence['tokens']:
                    lines.append(token['word'] + "\t" + token['pos'] + "\t" + token['lemma'] + "\t" + token['originalText'])
                # tag
                s = ET.SubElement(p, 's').text = "\n"+"\n".join(lines)+"\n"
                lines = []


        tree = ET.ElementTree(text)

        completeVrtString += ET.tostring(text, encoding='utf-8').replace("_id", "id")

        tree.write("../vrt/" + fileNameNoExt+".vrt", encoding="utf8", xml_declaration=True, method="xml")

vrt_file = open("../vrt/completeVrtString.vrt", "w")

meta_file = open("../vrt/completeMetaString.meta", "w")
The above script also creates a metadata file for the corpus, which indexes the file togheter with its metadata, this means that important text information connected with each file can be found on CQPweb, as it can be seen in the next steps.
The first abstract formatted in VRT can be seen bellow:
<?xml version='1.0' encoding='utf8'?>
<text _id="73ae83f510225d57ae31968abadf8300" abstract_name="The dark side of customer co-creation:
exploring the consequences of failed co-created services" discipline="SOCIAL & HUMAN SCIENCES" field="BUSINESS" jounal_name="JOURNAL OF THE ACADEMY OF MARKETING SCIENCE">
Journal NNP Journal Journal
of IN of of
the DT the the
academy NN academy academy
of IN of of
marketing NN marketing marketing
science NN science science
. . . .
The DT the The
dark JJ dark dark
side NN side side
of IN of of
customer NN customer customer
co-creation NN co-creation co-creation
: : : :
exploring VBG explore exploring
the DT the the
consequences NNS consequence consequences
of IN of of
failed VBN fail failed
co-created JJ co-created co-created
services NNS service services
. . . .
Whereas IN whereas Whereas
current JJ current current
literature NN literature literature
emphasizes VBZ emphasize emphasizes
the DT the the
positive JJ positive positive
consequences NNS consequence consequences
of IN of of
co-creation NN co-creation co-creation
, , , ,
this DT this this
article NN article article
sheds VBZ shed sheds
light NN light light
on IN on on
potential JJ potential potential
risks NNS risk risks
of IN of of
co-created JJ co-created co-created
services NNS service services
. . . .
Specifically RB specifically Specifically
, , , ,
we PRP we we
examine VBP examine examine
the DT the the
implications NNS implication implications
of IN of of
customer NN customer customer
co-creation NN co-creation co-creation
in IN in in
service NN service service
failure NN failure failure
episodes NNS episode episodes
. . . .
The DT the The
results NNS result results
of IN of of
four CD four four
experimental JJ experimental experimental
studies NNS study studies
show VBP show show
that IN that that
in IN in in
a DT a a
failure NN failure failure
case NN case case
, , , ,
services NNS service services
high JJ high high
on IN on on
co-creation NN co-creation co-creation
generate VBP generate generate
a DT a a
greater JJR greater greater
negative JJ negative negative
disconfirmation NN disconfirmation disconfirmation
with IN with with
the DT the the
expected JJ expected expected
service NN service service
outcome NN outcome outcome
than IN than than
services NNS service services
low JJ low low
on IN on on
co-creation NN co-creation co-creation
. . . .
Moreover RB moreover Moreover
, , , ,
we PRP we we
examine VBP examine examine
the DT the the
effectiveness NN effectiveness effectiveness
of IN of of
different JJ different different
service NN service service
recovery NN recovery recovery
strategies NNS strategy strategies
to TO to to
restore VB restore restore
customer NN customer customer
satisfaction NN satisfaction satisfaction
after IN after after
failed VBD fail failed
co-created JJ co-created co-created
services NNS service services
. . . .
According VBG accord According
to TO to to
our PRP$ we our
results NNS result results
, , , ,
companies NNS company companies
should MD should should
follow VB follow follow
a DT a a
matching NN matching matching
strategy NN strategy strategy
by IN by by
mirroring NN mirroring mirroring
the DT the the
level NN level level
of IN of of
customer NN customer customer
participation NN participation participation
in IN in in
service NN service service
recovery NN recovery recovery
based VBN base based
on IN on on
the DT the the
level NN level level
of IN of of
co-creation NN co-creation co-creation
during IN during during
service NN service service
delivery NN delivery delivery
. . . .
In IN in In
particular JJ particular particular
, , , ,
flawed JJ flawed flawed
co-creation NN co-creation co-creation
promotes VBZ promote promotes
internal JJ internal internal
failure NN failure failure
attribution NN attribution attribution
which WDT which which
in IN in in
turn NN turn turn
enhances VBZ enhance enhances
perceived VBN perceive perceived
guilt NN guilt guilt
. . . .
Our PRP$ we Our
results NNS result results
suggest VBP suggest suggest
that IN that that
in IN in in
such JJ such such
case NN case case
customer NN customer customer
satisfaction NN satisfaction satisfaction
is VBZ be is
best JJS best best
restored VBN restore restored
by IN by by
offering VBG offer offering
co-created JJ co-created co-created
service NN service service
recovery NN recovery recovery
. . . .
Another thing about the VRT that is important to highlight is the information the research team inserted in it, in the case above, the following information were extracted from StanfordCoreNLP to be inserted into CQPweb:
word, part_of_speech, lemma, Original word

Corpus on CQPweb

CQPweb is query processor and Corpus Workbench (Corpus Query Processor hence CQP). More information on the purpose and aims of the tool could be found here.
As this is a documentation on the process regarding the construction of the pipeline, we will restrain from explaining every detail on the platform, extracting just what was important for the creation and maintenance of the pipeline.
CQPweb makes it possible to query a corpus from grammatical patterns. For the purposes of this documentation CQPweb came to fill the gap left by Edge Search Engine 4 which will not be available for the nature of the pipeline.
Once the corpus is inside the platform it will be possible to query its content for: targets, frames and arguments highlighted by Open Sesame, as well as part of speech (POS) tags displayed by Stanford Core NLP as well as any other annotation at the word level in the corpus, besides obviously the tools and possibilities already provided just by the use of CQPweb.
The presentation of the usage steps will take the following course: CQPweb installation, verification of needs of corpora insertion on the platform as well as how to query the corpus for searches for frames (narrative patterns) and linguistic constructions (figurative patterns).

CQPweb Installation

In order for one to install CQPweb locally in a machine, at least a Debian distro of linux is required. In this case, version 18.04 of Ubuntu was used. This installation occurred via a SSH connection on a virtual machine in the cloud, in case any of these concepts are alien to you, do not worry, any machine with the aforementioned Linux distro will be able to at least follow our installation steps.

The basic guidelines for installing the software are available here.
We focus here on which problems yielded from following the Guide.

Initially the manual is very objective in pointing out that the following sotware is needed to install  CQPweb:
  • Apache or some other webserver
  • MySQL (v5.0 at minimum, and preferably v5.7 or higher)
  • PHP (v5.3.0 at minimum, and preferably v5.6 or v7.0+)
  • Perl (v5.8 at minimum, and preferably v5.20 or higher)
  • Corpus Workbench (see 1.4 and 1.5 for version details)
  • R
  • Standard Unix-style command-line tools: awk, tar, and gzip; either GNU versions, or versions compatible with them.
  • Also, in spite of not having any mentions of SVN on the referenced installation document, we recommend it.

For the installation of all these softwares combined, but the standard Unix-Style command-line tools which come with the linux distro usually, as well as Corpus Workbench, one might use the following command:

sudo apt-get install apache2 mysql-server php7.0 perl libopenblas-base r-base svn

For the installation of Corpus Workbench, we recommend the instructions available here

After the preceding software is installed, it is time to install CQPweb per se. In order for this process to begin, please follow the below instructions provided by Peter Uhrig, one of the mentors of the pipeline:

#creation of basic CQPweb directory tree
cd /data
sudo mkdir corpora
cd corpora
sudo mkdir cqpweb
cd cqpweb
sudo mkdir upload
sudo mkdir tmp
sudo mkdir corpora
sudo mkdir registry

#configuring apparmor
cd /data
sudo chmod -R a+w corpora 
sudo nano /etc/apparmor.d/usr.sbin.mysqld

# ADD   /data/corpora/cqpweb/** rw, BEFORE THE LAST CLOSING BRACE

sudo service apparmor restart

#configuring apache2
cd /etc/apache2/conf-available
sudo nano security.conf


<Directory /var/www/html/CQPweb/bin>
deny from all
<Directory var/www/html/CQPweb/lib>
deny from all

sudo service apache2 reload

#creating databases
mysql -u root -p
create database cqpweb_db default charset utf8;
create user cqpweb identified by 'supersecret';
grant all on cqpweb_db.* to cqpweb;
grant file on *.* to cqpweb;

cd /var/www/html/
sudo svn co CQPweb

cd /var/www/html/CQPweb/bin
sudo php autoconfig.php
#data input inside autoconfig
user: cqpwebadmin

sudo php autosetup.php

sudo nano
$path_to_cwb = '/usr/local/cwb-3.4.14/bin

cd /var/www/html
sudo chown -R www-data CQPweb/

After this process, it might be possible to access the system through a web browser in the URL:


OpenSesame on FrameNet 1.7

Open Sesame is an open source frame parser software, or in a more trivial way, it is a software that will highlight the frames of a given text using a certain parsing model (Frame Semantics is a huge topic in which we will not deepen in definition on this page. For more on this follow this link). Open Sesame uses FrameNet 1.7 as its dictionary, to annotate text for frames.

The process of dealing with Open Sesame to annotate frames in the Journal Abstracts dataset was divided mainly in three steps: dependency software installations and downloads of various required software, training, and running on the text files.

Dependency software installations and download of various required software

Firstly it is important clarify that all the softwares previously described and used in this pipeline were initially run locally. However this is not the case for Open Sesame, since the training phase of the software requires a lot of processing capacity. This fact forced the research team to search for an alternative and this would be one of the Red Hen's servers. Once the access was made to the platform, the following steps were taken in order to use Open Sesame properly. All the instructions given here to installation imply the use of a UNIX server.

As it can be seen in the page of Open Sesame, the software is written in python and it solves its dependencies using PIP, so both software needed to be downloaded and installed plus the detailed instructions for this process are given here. In the specific case of this installation, arrangements were made in order for the software to be installed just for the logged in user of the machine. Below are the steps of that installation.

Python was already installed globally, so pip had to be downloaded and installed for just the current user:
$ cd ~/home
$ mkdir get-pip
$ cd get-pip/ $ wget
$ python --user
$ pip -V

Once pip and python were installed it was time for the download and installation of Open Sesame's dependencies, again, all locally:
$ cd ~/home/path/to/server's/actual/home/directory/
git clone $ cd open-sesame/

$ #once inside open-sesame directory its time to install its dependencies via pip
$ pip install dynet --user
$ pip install nltk --user

The next step following the documentation given in the file on the github page would be:
python -m nltk.downloader averaged_perceptron_tagger wordnet

However an error was given by this command in which a specifi NLTK library was not installed. In order to fix the issue, a small search indicated that such library had to be inside the python shell.
$ python

Python 3.7.2 (default, Dec 29 2018, 21:15:15)
[GCC 8.2.1 20181127] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk 
[nltk_data] Downloading package punkt to /home/lucas/nltk_data...
[nltk_data]   Unzipping tokenizers/
True >>>Ctrl + z


After the described installation through pyhton shell, the command could run with no bigger problems:
python -m nltk.downloader averaged_perceptron_tagger wordnet

Once all this software is installed, it´s time to begin the processes of downloading the data the program will need to train as its dataset. Once the system is a neural network it uses training models in order to more accurately predict the frames in the target data. Firstly the data from FrameNet needs to be requested. Once the request was approved they sent the data. For evaluation and organization purposes, all data that will be used by Open Sesame has to be inside the \data folder. The FrameNet data was downloaded and uncompressed inside the \data directory.

$ mkdir data $ cd data/
$ #cookies.txt not directly created once the wget is done
$ touch cookies.txt
$ wget --load-cookies cookies.txt "$(wget --quiet --save-cookies cookies.txt --keep-session-cookies --no-check-certificate '' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1s4SDt_yDhT8qFs1MZJbeFf-XeiNPNnx7" -O fndata-1.7.tar.gz && rm -rf cookies.txt
$ tar xvfz fndata-1.7.tar.gz fndata-1.7/

Once the FrameNet data is uncompacted inside the \data directory, it´s time to download the glove. Glove is a pretrained word embedings set that Open Sesame uses its ouput trained by 6B tokens.

$ wget ""   
$ unzip 

The last thing to do in preparation for training and running the software on real files is the preprocessing command, which has to be run in the root of the project:
$ cd ..
$ python -m sesame.preprocess

Open Sesame Developer describes the preprocess script as "The above script writes the train, dev and test files in the required format into the data/neural/fn1.7/ directory. A large fraction of the annotations are either incomplete, or inconsistent. Such annotations are discarded, but logged under preprocess-fn1.7.log, along with the respective error messages."


Once the process described above is done, the training of the data can begin. @swabhs explains the process of training as threefold, in which each individual step is used for tests later in the process. The main command given to training is:
$ python -m sesame.$MODEL --mode train --model_name $MODEL_NAME
Where $MODEL is the type of training that can be performed, the types available are "targetid", "frameid" and "argid", a more specific explanation about each of these functions in relation to its properties can be found in the readme page as well as in the paper, released by the same author.

$MODEL_NAME refers to a .conll file which will be used as a model to identify frames in each of the previously described models ("targetid", "frameid" and "argid").

This is the point where time consumption of computational resources was noticed, and measures had to be taken, moving the processing to the Red Hen's server, as the software is training a neural network. That said it is also noticeable the fact that any of the three stage models can be trained forever. Usually the time used for training the models was about a 24h period.

Here are the examples of the training made in the pipeline:
$ python -m sesame.targetid --mode train --model_name targetid-01-17
$ python -m sesame.frameid --mode train --model_name frameid-01-18
$ python -m sesame.argid --mode train --model_name argid-01-22

After letting each of this training commands run for approximately 24 hours, the system had prepared the models "targetid-01-17", "frameid-01-18" and "argid-01-22" for prediction, which is the next step. 


For the predictions, the models previously derived from training are used in order to predict the frames in each level, target, frame and argument, these models are required for the software to work. Also for the software to work it is needed that the sample or test file to be separated in just sentences (see example 1 in the pragmatic segmenter header in this page for the an example). In this example we used the pure output of step 3 in this page.

As for the next python inputs on the command prompt, they go as follows:
python -m sesame.targetid --mode predict --model_name targetid-01-17 --raw_input 2015-00-00_JA_10.1017∕S0267190514000178_Annual-Review-Of-Applied-Linguistics_Matsuda_Paul-Kei.seg 

 As it can be seen, the command is very similar to the training model, since it´s running the same script. However only the flag --mode that will change from "training" to "predict", obviously now, can also be seen as the --raw_input flag, which designates in which file the Open Sesame will search for frames. After the program runs, it will output a .conll file in the directory "logs/$MODEL_NAME/predicted-targets.conll", which will be used as raw input for the next step. So in this case it becomes:
python -m sesame.frameid --mode predict --model_name frameid-01-18 --raw_input logs/targetid-01-17/predicted-targets.conll

The runsit will output the frame predictions in the file logs/$MODEL_NAME/predicted-frames.conll which will be used as an input for the next step:
$ python -m sesame.argid --mode predict --model_name argid-01-22 --raw_input logs/frameid-01-18/predicted-frames.conll

After all, the argument predictions file will be at: logs/argid-01-22/predicted-args.conll. Here is a preview of the predicted-args.conll file.

Data Analysis

The Data analysis part of the research is that in which it is tried to directly derive meaning and conclusions based on the quantitative data given by one or more software included in the pipeline, for this reason, this section is divided into which information was derived from which presented platform.

Open Sesame ConLL files:

The Data Analysis Processes based on the Open Sesame software has its roots on the ConLL files delivered by Open Sesame. In these files it is shown that there is a myriad of information that Open Sesame provides to the research, those mainly being in the axis of Frame count, Target count (Lexical unit that invokes the frame, can be counted by frame, or by itself), argument count (An argument that is invoked by an specific frame, can be counted by frame, or by simply counting the arguments).

These findings with respect to the ConLL files can be found here in the excel folder of the statistics part of the GitHub project. Initially these findings were taken,by abstract, by fields of research, such as Biology, Hard Sciences or Human Sciences, by discipline, such as chemistry, or health, journal, such as, Academic Medicine, or Advanced Materials, and finally, all the files together in a single count. 

The file processing started by using the following script: SCRIPT

Future Pipeline Steps

As it can be foreseen this page is a growing continuum of the construction of the basic text pipeline, as we as a team intent to help and serve as guidance to Rosana's post Doctoral and future research there are some tasks that are about to happen, and a myriad of another tasks that would be awesome if they could integrate the pipeline, as it follows our intentions about the future are:

Tasks about to be documented/implemented:

  • Describing corpus insertion into CQPweb
  • Currently the pipeline is moving its efforts in understanding and researching how to integrate its corpus and all of its data into CQPweb for searching, these steps will be described here.

  • Semafor installation and data formatting
  • This point is very interesting, as commented above, Open Sesame is one of a myriad of software that annotates frames, currently its main counterpart is the software Semafor developed by Noah Smith's lab.
    Obviously, at the end of the process we intent to manipulate the files, the details about this procedure should be documented in this page.

  • Systems integration
  • At the end of the above mentioned tasks the pipeline team will focus its efforts in making a file that unites all of the frame annotators output, plus the Stanford Core NLP output into a single file.
    Once this is completed it will be time to integrate this file into CQPweb so that frames can be searched by system and by any other needs Rosana might need for her research.

  • Data Analysis
  • Of course that at the same time of all those tasks individual tasks monitoring frame frequency across abstracts and a myriad of other queries will be posted here as well.