Tweet Pipeline

Task

Twitter can be regarded as a broadcast network. Tweet archives could be processed and distributed in the /tv tree in the normal fashion: that tree is organized with subdirectories by year, sub-sub-directories by month, and sub-sub-sub-directories by UTC day. Tweets posted on a UTC day, e.g. 2018-10-07, could, in principle, be placed inside a file /tv/2018/2018-10/2018-10-07/2018-10-07.twt.  Perhaps each line in this file would correspond to a tweet, with standard fields that are delimited by a pipe (|). Tweets in that file could then be processed for sentence segmentation, NLP, Frames, etc., and the results placed in a separate file with extension, perhaps .twtmeta. This system would need to be designed to be compatible with existing Red Hen data structures and processes. Can you help develop the Tweet Processing Pipeline? If so, write to 

 
and Red Hen will try to connect you with a mentor.

Shane Karas and Ram Gullapalli are the initial organizers of this team.

Related links

More information

json-to-csv

Red Hen imagines that some json-to-csv library will be the place to start.

some bash background on working with .json tweets:

  1. A tweet can mention another person by twitter handle, known as screen_name. In that case, the json record tags the screen_name and provides and tags the corresponding real name.
  2. A tweet can be a retweet, with a cascade of info.
  3. A tweet can quote another tweet. More cascades.
  4. Unfortunately, tags are reused. screen_name and real_name are used for everybody involved in the tweet. Red Hen is under the impression that the json record lists the actual top-level tweeter last. 
  5. But this means that lines for tweets do not always have the same number of fields. This would make sophisticated work on fields difficult, and would make statistics on fields difficult.
  6. Red Hen might want to improve the situation by moving what we think is the crucial information to the first N fields, such that n in N is always a known category. E.g. time of creation of tweet, in UTC, screen name of tweeter, real name of tweeter, full text
  7. Red Hen would then be able to tell her processing scripts (sentence splitter, NLP tagger, Frame tagger, etc.) to ignore everything after the first N fields. Red Hen would also be able to tell the script which field to work on. E.g., split sentences and do NLP tagging on field 4 only.

Manipulating json info into one-tweet-per-line .twt files

Red Hen expects that the json-to-csv library is the place to start.  But bash has many string manipulation abilities.  As an example:
$ cat master_2018.json | python -m json.tool > masterpretty_2018.json 
$ cat masterpretty_2018.json | sed 's/^[ \t]*//' > t1.twt
$ cat t1.twt | grep -v "Wed\ Mar\ 18.*2009" | grep -v "in_reply_to_screen_name" | grep "time_zone\|created_at\|full_text\|retweeted\|screen_name\|^\"name" > t2.twt

[careful: the command above requires knowing the content of the line that specifies the created_at time for the creation of the user account, in this case Wed\ Mar\ 18.*2009. That is not a general solution to filtering out that line]

$ cat t2.twt | sed 's/^.full/\|&/'| sed 's/^.created/\|&/' | sed 's/^.retweeted/\|&/' | sed 's/^.screen_name/\|&/' |sed 's/^.name/\|&/' > t3.twt
$ cat t3.twt | tr -d '\n' | sed 's/\"time_zone/\n&/g' >t4.twt
$ cat t4.twt | sed 's/\(^\"time_zone.*\)\(\"created_at".*[0-9][0-9][0-9][0-9]\"\,\)\(.*\)/\2\1\3/' > t5.twt
$ cat t5.twt | sed 's/\"created_at\":\ \"//' > t6.twt
$ cat t6.twt | sed 's/\"\,\"/\|/' > t7.twt


Such steps convert a json twitter archive into a file with lines like this:

Mon Jan 01 13:37:52 +0000 2018|time_zone": "Eastern Time (US & Canada)",|"full_text": "Will be leaving Florida for Washington (D.C.) today at 4:00 P.M. Much work to be done, but it will be a great New Year!",||"retweeted": false,|"screen_name": "realDonaldTrump",|"name": "Donald J. Trump",

Mon Jan 01 12:44:40 +0000 2018|time_zone": "Eastern Time (US & Canada)",|"full_text": "Iran is failing at every level despite the terrible deal made with them by the Obama Administration. The great Iranian people have been repressed for many years. They are hungry for food & for freedom. Along with human rights, the wealth of Iran is being looted. TIME FOR CHANGE!",||"retweeted": false,|"screen_name": "realDonaldTrump",|"name": "Donald J. Trump",

Mon Jan 01 12:12:00 +0000 2018|time_zone": "Eastern Time (US & Canada)",|"full_text": "The United States has foolishly given Pakistan more than 33 billion dollars in aid over the last 15 years, and they have given us nothing but lies & deceit, thinking of our leaders as fools. They give safe haven to the terrorists we hunt in Afghanistan, with little help. No more!",||"retweeted": false,|"screen_name": "realDonaldTrump",|"name": "Donald J. Trump",
And the date and time in that format is properly converted by the date command:
$ date -ud 'Mon Jan 01 13:37:52 +0000 2018' '+%Y-%m-%d %H:%M %Z %z' 
2018-01-01 13:37 UTC +0000