After all of the previous phases, the project needed to be inserted into a corpus concordant of some kind. Initially it was thought that the corpus of abstracts was going to be inserted into Edge Search Engine 4 (as the previous text of this page infers), since it already has the capability of showing the data by frames, however the platform mainly serves the purpose of video and would become ill fitted with the data collected on this page.
In order to overcome this "corpus format" issue, it was set that we would use CQPweb (as it can be seen in the next sessions) this platform can provide a wide array of searches, such as by POS, grammatical constructions as well as its inputs can be directly modeled into it. With that in mind the project could grow a side in which data was modeled based on the tools already presented to fit CQPweb.
After a first glance it was clear that the inference of inserting the collected frames into CQPweb directly would be too hard and time consuming, so it was decided that in a first phase of the project, with the fixed deadline of july 2019 the Corpus would be searchable, at least by constructions on CQPweb.
Throughout the research process it was discovered that the type of files accepted by CQPweb to be searchable as a corpus would be a file with extension .vrt, which stands for vertical text (i.e. one word per line).
The tutorial of the insertion of SaCoCo Corpus into markingCQPweb describes the process of creating such files.
The action of following the aforementioned tutorial on a private installation of CQPweb (the installation process is described in the next topic) was helpful mainly because it gave ground to the research process of modelling the data, and it was a hands on process needed to advance in this direction. The "Easy" section of the tutorial was followed, and by doing that the project received an example VRT, a blueprint that could be used to process the research's data set.
The VRT file stands for VeRTical xml file, basically it follows a very definite structure, in which, each text (in the case of this research, each abstract) is surrounded with the <text> tag, this tag defines the beggining and the end of each file in the corpus, following this tag, there is the <p> tag, that surrounds every paragraph of text, further on there is the <s> tag, which surrounds every sentence. Each line of the VRT file that contains the corpus has to be structure in the following way:
searchable_information(A TABULATION SPACE)searchable_information(A TABULATION SPACE)searchable_information...
Once the format of the file was well defined, it was time to create a script that regulates its information: