Tweet Pipeline

Twitter can be regarded as a broadcast network. Tweet archives could be processed and distributed in the /tv tree in the normal fashion: that tree is organized with subdirectories by year, sub-sub-directories by month, and sub-sub-sub-directories by UTC day.

There is a public streaming API that provides, "a small random sample of all public statuses". Red Hen has consumed this data in its "raw" JSON format, but would like to process the data to follow a standard format shared by all of the TV news data.

Given the size of the files, we are currently planning to create one file with tweets per UTC hour. For example, /tv/2018/2018-10/2018-10-07/2018-10-07_1000_WW_Spritzer.twt would contain all the tweets in the public sample stream (known informally as "spritzer") for the one hour form 10:00 (inclusive) to 11:00 (exclusive) UTC on October 7, 2018.

Each line in the file will correspond to one tweet, and fields will be separated with the | (pipe) character. The exact fields and order are currently being decided, and input is welcome on the GitHub discussion page.

Once the format is decided and the initial data ingested, tweet files could then be processed for sentence segmentation, NLP, Frames, etc., and the results placed in a separate file with extension, perhaps .twtmeta. This system would need to be designed to be compatible with existing Red Hen data structures and processes.

Tweet Pipeline

Related links