Basic Text Pipeline

        Some "multimodal communications" consist of text files.  A scientific article is such a piece of data. These files can be processed and tagged so as to create files of metadata that can then be searched and further processed. For example, Red Hen has a project focused on processing and tagging abstracts of scientific articles to make it easier to analyze them for image-schematic and narrative structure.  A Brazilian team from IFSP - students Rafael Ruggi and Lucas Spreng mentored by Professor Rosana Ferrareto -  is currently helping to develop this pipeline. Would you like to assist? If so, write to:
and we will try to connect you with a mentor.


Related Scrolls

Related Links

More Information

Elements of the Pipeline 

  1. File Acquisition

  2.         This can be accomplished in a myriad of ways, in the current state of research these files are Journal Abstracts gathered manually, one by one in the Web of Science Platform (webofknowledge.com). The current corpus being worked at the Red Hen Basic Text Pipeline is a corpus composed of 1000 Abstracts gathered by Cognitive Linguist and Language Professor Rosana Ferrareto Lourenço Rodrigues, faculty member of IFSP and post-doc visiting researcher at Red Hen Lab, the processes of abstract acquisition can be improved, one of the ways in which this is possible is by creation of web robots, which can do the steps required to download the required abstracts and upload those onto Gallina.

            Here's an example of an raw manually collected file:
        
            
    FN Clarivate Analytics Web of Science
    VR 1.0
    PT J
    AU Matsuda, PK
    AF Matsuda, Paul Kei
    TI Identity in Written Discourse
    SO ANNUAL REVIEW OF APPLIED LINGUISTICS
    AB This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.
    SN 0267-1905
    EI 1471-6356
    PY 2015
    VL 35
    BP 140
    EP 159
    DI 10.1017/S0267190514000178
    UT WOS:000351470600008
    ER

    EF

     
  3. File normalization and conformity

  4. This section specifies rules that the names and content of files should have in order to be searchable by Edge Search Engine 4 provided by Red Hen (https://sites.google.com/case.edu/techne-public-site/red-hen-edge-search-engine). 
    It is also worth pointing out that these rules where established to be used in Journal Abstracts to be searchable as purpose of an specific research, if you wish to contribute to the pipeline with different types of files please send an e-mail to the already specified e-mail redhenlab@gmail.com.

    Name specifications

    Rules:
    COMPLETE-DATE_DATATYPE_DOI_JOURNAL_AuthorLastName_AuthorFirstName.txt
    Example filename: 
    2015-00-00_JA_10.1017∕S0267190514000178_Annual-Review-Of-Applied-Linguistics_Matsuda_Paul-Kei.txt

    Notes:
    1. Zeros in COMPLETE-DATE indicate that such information is not available.
    2. DOI is a number that specifies a file, see more at: https://www.doi.org/

    Method:
    A python script was made to sweep through all files, reorganizing each individual name into the aforementioned format:
    import os

    listOfFiles = list()
    for (dirpath, dirnames, filenames) in os.walk("."):
        listOfFiles += [os.path.join(dirpath, file) for file in filenames]
    lines = 0

    nameOfFiles = list()
    mes = ""
    ano = ""
    fonte = ""
    autor = ""
    doi = ""

    for file in listOfFiles:
        if file.endswith(".txt"):
            with open(file) as f:
                lines = [line.rstrip('\n') for line in open(file)]
                head, sep, tail = lines[0].partition("FN ")
                fonte = tail.replace(" ", "_").lower()

                for l in lines:
                    l.replace("\ufeff", "")

                    if l.startswith("PD "):
                        mes = l.replace("PD ", "")
                        mes = mes.replace(" ", "-")
                        mes = mes.replace("JAN", "01")
                        mes = mes.replace("FEB", "02")
                        mes = mes.replace("MAR", "03")
                        mes = mes.replace("APR", "04")
                        mes = mes.replace("MAY", "05")
                        mes = mes.replace("JUN", "06")
                        mes = mes.replace("JUL", "07")
                        mes = mes.replace("AUG", "08")
                        mes = mes.replace("SEP", "09")
                        mes = mes.replace("OCT", "10")
                        mes = mes.replace("NOV", "11")
                        mes = mes.replace("DEC", "12")

                    if l.startswith("PY "):
                        ano = l.replace("PY ", "")

                    if l.startswith("DI "):
                        doi = l.replace("DI ", "")

                    if l.startswith("SO "):
                        journal = l.replace("SO ", "")
                        fonte += "-"
                        fonte += journal.replace(" ", "_").lower()

                    if l.startswith("AU "):
                        autorL = l.replace("AU ", "")
                        autor, sep, tail = autorL.partition(', ')

                stringFileName = str(ano)

                if mes != "":
                    stringFileName += "-"
                    stringFileName += str(mes)
                else:
                    stringFileName += "00"

                stringFileName += "_"
                stringFileName += str(fonte)

                if doi != "":
                    stringFileName += "-"
                    stringFileName += doi
                else:
                    stringFileName += "00.0000∕0000000000000000"

                stringFileName += "-"
                stringFileName += str(autor)
                stringFileName += ".txt"

                stringFileName = stringFileName.replace("/", "\u2215")

                mes = ""
                ano = ""
                fonte = ""
                autor = ""
                doi = ""

                nameOfFiles.append(stringFileName)

                with open("testes/"+stringFileName, "w+") as fileReady:
                    for lineInFile in f:
                        fileReady.write(lineInFile)

    print(nameOfFiles)

    File Headers specifications:

    Expected file header format:
    TOP|COMPLETEDATE|FILENAME
    COL|PLACE WHERE FILE IS BEING HELD
    UID|UUID IDENTIFICATION NUMBER
    SRC|JOURNAL COMPLETE NAME
    CMT|SPECIFIC COMMENTS ABOUT THE FILE
    CC1|LANGUAGE USED IN FILE
    TTL|JOURNAL TITLE
    CON|ABSTRACT CONTENT
    END|COMPLETEDATE|FILENAME

    Example file header format:
    TOP|2015000012.0000|2015-00-00_JA_10.1017∕S0267190514000178_Annual-Review-Of-Applied-Linguistics_Matsuda_Paul-Kei.txt
    COL|Journal Abstracts, Red Hen Lab
    UID|0f8405b384c649ee92de4a45cc1840d0
    SRC|ANNUAL REVIEW OF APPLIED LINGUISTICS
    CMT|
    CC1|ENG
    TTL|Identity in Written Discourse
    CON|This article provides an overview of theoretical and research issues in the study of writer identity in written discourse. First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized. Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized. The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity. The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.
    END|2015000012.0000|2015-00-00_JA_10.1017∕S0267190514000178_Annual-Review-Of-Applied-Linguistics_Matsuda_Paul-Kei.txt

    Notes:
    1. Zeros in COMPLETEDATE indicate that such information is not available.
    2. UUID is gathered by running the following command in Linux shell "uuid -n1".

    The following python Script was written in order to conform the file with the norms above:
    import os
    import uuid

    listOfFiles = list()
    for (dirpath, dirnamonth, filenamonth) in os.walk("abstracts"):
        listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

    for file in listOfFiles:
        if file.endswith(".txt"):
            caminho = file.split('/')
            caminho = caminho[caminho.__len__()-1]
            caminho = caminho.replace("..", ".")

            stringData = caminho[:10]
            completeDate = stringData.split('-')
            completeDate = ''.join(completeDate) + '12.0000'

            line1 = "TOP|"+completeDate+"|"+caminho+'\n'

            line2 = "COL|Journal Abstracts, Red Hen Lab"+'\n'

            line3 = "UID|"+uuid.uuid4().hex+'\n'

            line5 = "CMT|"+'\n'

            line6 = "CC1|ENG"+'\n'

            line9 = "END|"+completeDate+"|"+caminho+'\n'


            with open(file) as f:
                lines = [line.rstrip('\n') for line in open(file)]
                head, sep, tail = lines[0].partition("FN ")
                fonte = tail.replace(" ", "_").lower()

                for l in lines:
                    l.replace("\ufeff", "")

                    if l.startswith("TI "):
                        subLine = l[2:]
                        trimmedSubLine = subLine.strip()
                        line7 = "TTL|"+trimmedSubLine+'\n'

                    if l.startswith("SO "):
                        subLine = l[2:]
                        trimmedSubLine = subLine.strip()
                        line4 = "SRC|"+trimmedSubLine+'\n'

                    if l.startswith("AB "):
                        subLine = l[2:]
                        trimmedSubLine = subLine.strip()
                        line8 = "CON|"+trimmedSubLine+'\n'

                stringFinal = line1 + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9
                with open("headers/" + caminho, "w+") as fileReady:
                    fileReady.write(stringFinal)

    It is worth noting that the above mentioned python script was preceded by a PHP version of the same file, however as the machine that stores and runs the code does not have the capabilities to run PHP code, it was refurbished in the aforementioned format (the file can be found on the
    GitHub for the project). 
  5. Pragmatic Segmenter

    Pragmatic segmenter is a third party software that is used in the pipeline in order to organize the file content (TTL and CON headers) into a file with a .seg extension, in this file each sentence of the content occupies a line in a file. Using the raw file at the first example we get the following resulting pragmatically segmented file (copy and paste the file in a plain text reader for the better understanding of the example):

    Identity in Written Discourse
    This article provides an overview of theoretical and research issues in the study of writer identity in written discourse.
    First, a historical overview explores how identity has been conceived, studied, and taught, followed by a discussion of how writer identity has been conceptualized.
    Next, three major orientations toward writer identity show how the focus of analysis has shifted from the individual to the social conventions and how it has been moving toward an equilibrium, in which the negotiation of individual and social perspectives is recognized.
    The next two sections discuss two of the key developments-identity in academic writing and the assessment of writer identity.
    The article concludes with a brief discussion of the implications and future directions for teaching and researching identity in written discourse.

    Steps to the procedure:

    Pragmatic Segmenter is a software written in ruby, initially ruby needs to be installed locally on a machine to run the application. For an installation guide follow the related link in the top of the page. After ruby is installed it is needed to download Pragmatic Segmenter. In the creation of this pipeline the version developed by Kevin Dias was used, here is the link to the step by step guide to installing it https://github.com/diasks2/pragmatic_segmenter.

    Once pragmatic segmenter was installed a brief ruby script was arranged to swipe through a path to files given to it by an argument call through command line, this file was created based on the example given in the installation instructions page.

    require 'pragmatic_segmenter'

    #fileLocation location of the files to be segmented given by argument
    fileLocation = ARGV[0]
    content = ""
    f = File.open("pragmatic/cache/"+fileLocation, "r")

    #for each file in file
    f.each_line do |line|
    if content == ""
    content += line
    else
    content += line
    end
    end
    f.close

    ps = PragmaticSegmenter::Segmenter.new(text: content)

    segments = ps.segment

    stringFinal = ""
    segments.each do |seg|
    stringFinal += seg + "\n"
    end


    f = File.open("pragmatic/"+fileLocation, "w+")
    f.write(stringFinal)
    f.close
     
    In this file we can see that this ruby file sweeps through a file in the cache directory and segments its contents line by line, after that it will record its results to a file that has the same name as it was passed in the argument, and has the .seg extension. In order to put the files with just the contents of TTL and CON headers into the cache directory and to call the ruby script a python script was created:

    import os
    listOfFiles = list()

    #headed files needed for this process
    for (dirpath, dirnamonth, filenamonth) in os.walk("headers"):
        listOfFiles += [os.path.join(dirpath, file) for file in filenamonth]

    #for each file gathered
    for file in listOfFiles:
        if file.endswith(".txt"):
            path = file.split('/')
            path = path[path.__len__()-1]
            path = path.replace("..", ".")

            sendToPragmatic = ''

            with open(file) as f:
                lines = [line.rstrip('\n') for line in open(file)]
                head, sep, tail = lines[0].partition("FN ")
                fonte = tail.replace(" ", "_").lower()

                for l in lines:
                    l.replace("\ufeff", "")

                    if l.startswith("TTL|"):
                        sendToPragmatic = l[4:] + '\n'

                    if l.startswith("CON|"):
                        sendToPragmatic += " "
                        sendToPragmatic += l[4:]
                
                with open("pragmatic/cache/" + path[:-3]+'seg', "w+") as fileReady:
                    fileReady.write(sendToPragmatic)

                pragmaticReturn = os.system('ruby ps.rb "' + path[:-3]+'seg' + '"')

    The function os.system() at the end of the script would call the ruby application through a terminal command line. For each file that the python script reads it send the contents of its TTL + CON headers to a file in the cache folder with a .seg extension, then it send the location of that recently created file to the aforementioned ruby script which segments the file as described and gives the results of the first example in this section. 

    It is worth noting that this procedure to arrange the file in the specified format is essential to the process executed to run OpenSesame on Framenet 1.7 step five of this guide. Although the necessity of the procedure is unknown in order for the files to be found by Edge Search Engine 4 given by Red Hen.

  6. Stanford Core NLP

  7. Stanford Core NLP (SC NLP)is a third party software that serves the purpose of marking text in a variety of ways in order to gather information from it. The various ways in which segments and surfaces information on the content its provided to it can be found on its webpage. The first thing to do in order to use SC NLP is the download of the specific software. Once this process is done there is a myriad of ways one can proceed to run Stanford Core NLP in any text. There are several "wrappers" of SC NLP since its main code in written in java, for this project the wrapper developed by Lynten was used, the link for the usage and instalation of its pip library is https://github.com/Lynten/stanford-corenlp.

    Once everything is installed a test script was assembled to test the capabilities of the software, the pythin script is shown bellow:

    from stanfordcorenlp import StanfordCoreNLP

    nlp = StanfordCoreNLP(r'/var/python/stanford-corenlp-full-2018-10-05')


    text = 'Guangdong University of Foreign Studies is located in Guangzhou. ' \
           'GDUFS is active in a full range of international cooperation and exchanges in education. '

    props={'annotators': 'tokenize,ssplit,pos','pipelineLanguage':'en','outputFormat':'json'}
    print nlp.annotate(text, properties=props)
    nlp.close()
     
    This is the most uncertain piece of code in this entire collection, since the output format of the function is still not accorded with members of Red Hen Lab, within the following days this documentation will contain better steps in order to frame everything in the texts files searchable in the Edge search engine 4 provided by Red Hen.
  8. OpenSesame on FrameNet 1.7

  9. Open Sesame, in an open source frame parser software, or in a more trivial way it is a software that will highlight the frames of a given text using a certain parsing model (Semantic Frames is a huge topic in which we will not deepen in definition this page, for more on this follow this link). Open Sesame uses FrameNet 1.7 as its dictionary, to mark all the text, basically it highlights the text base on which frames that part of the text will have.     

    The process of dealing with Open Sesame to mark frames in the abstract data was divided mainly in three steps: dependency software installations and downloads of various required software, training, and running on the text files.

    Dependency software installations and download of various required software


    Firstly it is important to notice that until now all the previous software described in the pipeline was initially run locally , however this is not the case for Open Sesame, since the training phase of the software requires a lot of processing capacity. This fact forced the research team to search for an alternative, this would be one of the Red Hen's servers. Once the access was made to the platform the following steps were taken in order to use Open Sesame properly, all the instructions given here to installation imply the use of a UNIX server.

    As it can be seen in the read.me page of Open Sesame, the software is written in python and it solves its dependencies using PIP so, both software needed to be downloaded and installed, datailed instructions for this process are given here. In the specific case of this installation arrangements were made in order for the software to be installed just for the logged in user of the machine, bellow are the steps of that installation.

    Python was already installed globally, so pip had to be downloaded and installed for just the current user:
    $ cd ~/home
    $ mkdir get-pip
    $ cd get-pip/
    $ wget https://bootstrap.pypa.io/get-pip.py
    $ python get-pip.py --user
    $ pip -V

    Once pip and python were installed it was time for the download and install Open Sesame's dependencies, again, all locally:
    $ cd ~/home/path/to/server's/actual/home/directory/
    git clone https://github.com/swabhs/open-sesame.git
    $ cd open-sesame/

    $ #once inside open-sesame directory its time to install its dependencies via pip
    $ pip install dynet --user
    $ pip install nltk --user

    The next step following the documentation given in the read.me file on the github page would be:
        $ python -m nltk.downloader averaged_perceptron_tagger wordnet

    However an error was given by this command in which a specifi NLTK library was not installed, in order to fix the issue a small search indicated that such library had to be inside the python shell.
    $ python
    → SHELL PYTHON

    Python 3.7.2 (default, Dec 29 2018, 21:15:15)
    [GCC 8.2.1 20181127] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import nltk 
    >>> nltk.download("punkt")
    [nltk_data] Downloading package punkt to /home/lucas/nltk_data...
    [nltk_data]   Unzipping tokenizers/punkt.zip.
    True
    >>>Ctrl + z

    → END SHELL PYTHON

    After the described installation through pyhton shell, the command could run with no bigger problems:
        $ python -m nltk.downloader averaged_perceptron_tagger wordnet

    Once all this software is installed its time to begin the processes of downloading the data the program will need to train its dataset, once the system is a neural network it uses training models in order to more accurately predict the frames in the target data. Firstly the data from FrameNet needs to be requested, once the request was approved they sent the data. For evaluation and organization purposes all data that will be used by Open Sesame has to be inside the \data folder. The FrameNet data was downloaded and uncompressed inside the \data directory.

    $ mkdir data
    $ cd data/
    $ #cookies.txt not directly created once the wget is done
    $ touch cookies.txt
    $ wget --load-cookies cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1s4SDt_yDhT8qFs1MZJbeFf-XeiNPNnx7' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1s4SDt_yDhT8qFs1MZJbeFf-XeiNPNnx7" -O fndata-1.7.tar.gz && rm -rf cookies.txt
    $ tar xvfz fndata-1.7.tar.gz fndata-1.7/

    Once the FrameNet data is uncompacted inside the \data directory its time to download the glove. Glove is a pretrained word embedings set that Open Sesame uses its ouput trained by 6B tokens.
    $ wget "http://nlp.stanford.edu/data/glove.6B.zip"   
    $ unzip glove.6B.zip 

    The last thing to do in preparation for training and running the software on real files is the preprocessing command which has to be ran in the root of the project:
    $ cd ..
    $ python -m sesame.preprocess

    Open Sesame Developer describes the preprocess script as "The above script writes the train, dev and test files in the required format into the data/neural/fn1.7/ directory. A large fraction of the annotations are either incomplete, or inconsistent. Such annotations are discarded, but logged under preprocess-fn1.7.log, along with the respective error messages."

    Training

    Once the process described above is done the training of the data can begin, @swabhs explains the process of training as threefold, and each step individual and used for tests later in the process the main command given to training is:
    $ python -m sesame.$MODEL --mode train --model_name $MODEL_NAME
     
    Where $MODEL is the type of training that can be performed, the types available are "targetid", "frameid" and "argid", a more specific explanation about each of these functions in relation to its properties can be found in the readme page as well as in the paper, released by the same author.

    $MODEL_NAME refers to a .conll file which will be used as a model to identify frames in each of the previously described models ("targetid", "frameid" and "argid").

    It is worth noting that this was where the time that the consumption of computational resources were noticed, and measures had to be taken, moving the processing to the Red Hen's server, as the software is training a neural network. That said it is also of note the fact that any of the three stage models can be trained for ever, usually the time used for training the models was about a 24h period.

    Here are the examples off training made in the pipeline:
    $ python -m sesame.targetid --mode train --model_name targetid-01-17
    $ python -m sesame.frameid --mode train --model_name frameid-01-18
    $ python -m sesame.argid --mode train --model_name argid-01-22

    After letting each of this training commands run for approximately 24 hours the system had prepared the models "targetid-01-17", "frameid-01-18" and "argid-01-22" for prediction, which is the next step. 

    Predictions

    For the predictions the models previously derived from training are used in order to predict the frames in each level, target, frame and argument, these models are required for the software to work. Also for the software to work it is needed that the sample or test file to be separated in just sentences (see example 1 in the pragmatic segmenter header in this page for the an example). In this example we used the pure output of step 3 in this page.

    As for the next pytthon inputs on the command prompt they go as follows:
    python -m sesame.targetid --mode predict --model_name targetid-01-17 --raw_input 2015-00-00_JA_10.1017∕S0267190514000178_Annual-Review-Of-Applied-Linguistics_Matsuda_Paul-Kei.seg 

     As it can be seen the command is very similar to the training model, since its running the same script, however only the flag --mode that will change from "training" to "predict", obviously now it can also be seen the --raw_input flag which designates in which file the Open Sesame will search for frames. After the program runs it will output a .conll file in the directory "logs/$MODEL_NAME/predicted-targets.conll", which will be used as raw input for the next step. So in this case it becomes:
    python -m sesame.frameid --mode predict --model_name frameid-01-18 --raw_input logs/targetid-01-17/predicted-targets.conll

    Once this runsit will output the frame predictions in the file logs/$MODEL_NAME/predicted-frames.conll which will be used as an input for the next step:
    $ python -m sesame.argid --mode predict --model_name argid-01-22 --raw_input logs/frameid-01-18/predicted-frames.conll

    After all of that the argument predictions file will be at: logs/argid-01-22/predicted-args.conll here is a preview of the predicted-args.conll file.