Tweet Pipeline


Twitter can be regarded as a broadcast network. Tweet archives could be processed and distributed in the /tv tree in the normal fashion: that tree is organized with subdirectories by year, sub-sub-directories by month, and sub-sub-sub-directories by UTC day.

There is a public streaming API that provides, "a small random sample of all public statuses". Red Hen has consumed this data in its "raw" JSON format, but would like to process the data to follow a standard format shared by all of the TV news data.

Given the size of the files, we are currently planning to create one file with tweets per UTC hour. For example, /tv/2018/2018-10/2018-10-07/2018-10-07_1000_WW_Spritzer.twt would contain all the tweets in the public sample stream (known informally as "spritzer") for the one hour form 10:00 (inclusive) to 11:00 (exclusive) UTC on October 7, 2018.

Each line in the file will correspond to one tweet, and fields will be separated with the | (pipe) character. The exact fields and order are currently being decided, and input is welcome on the GitHub discussion page.

Once the format is decided and the initial data ingested, tweet files could then be processed for sentence segmentation, NLP, Frames, etc., and the results placed in a separate file with extension, perhaps .twtmeta. This system would need to be designed to be compatible with existing Red Hen data structures and processes. 

Related links


Recording the public/sample stream. The code used to record tweets will be shared shortly. This code is simple and has only one task: to write content from the stream directly to files on the hard disk. Twitter does not buffer the stream; so, no processing occurs in this script. All content is written to disk for later scripts to analyse and format. The focus of this pipeline is not near-real time; so, the scripts can run in batches.

Formatting and distributing .twt files within the Red Hen file structure. Scott is developing a Python script to handle this step of the process. The code is open source within the GitHub repository.

Open tasks. We want to ingest tweets from other archives in addition to the public/sample stream. The final pipeline should be sufficiently flexible to accommodate a wide variety of input formats.

Other tools.

json-to-csv