Tweets to @realdonaldtrump; How many fucks are there to give? | Nick Ruest Search Nick Ruest Home C.V. Posts Presentations Publications Projects Visualizations Music Contact Tweets to @realdonaldtrump; How many fucks are there to give? Jun 14, 2018 3 min read I’ve been collecting tweets to @realDonaldTrump since June 2017. In my most recent time pulling together, and deduping the dataset I asked myself, “I wonder how many occurrences of ‘fuck’ are in the dataset.” Or, how many fucks are there to give? Well… The data is updated by running a query on the Standard Search API every five days. $ twarc search 'to:realdonaldtrump' --log donald_search_$DATE.log > donald_search_$DATE.jsonl Which yields something like this every five days. ... donald_search_2018_05_01.jsonl donald_search_2018_05_01.log donald_search_2018_05_06.jsonl donald_search_2018_05_06.log donald_search_2018_05_11.jsonl donald_search_2018_05_11.log donald_search_2018_05_16.jsonl donald_search_2018_05_16.log donald_search_2018_05_21.jsonl donald_search_2018_05_21.log donald_search_2018_05_26.jsonl donald_search_2018_05_26.log donald_search_2018_05_31.jsonl donald_search_2018_05_31.log donald_search_2018_06_01.jsonl donald_search_2018_06_01.log donald_search_2018_06_06.jsonl donald_search_2018_06_06.log ... Periodically, I cat all the jsonl files together, and then deduplicate them with deduplicate.py. So, this currently leaves us with 90,355,874 tweets to work with. If you want to follow along, you can grab the most recent set of tweet ids from here. Then “hydrate” them like so: $ gunzip to_realdonaldtrump_20180606_ids.txt.gz $ twarc hydrate to_realdonaldtrump_20180606_ids.txt > 20180609.jsonl This will probably take quiet a while since there are potentially 90,355,874 tweets to hydrate. In the end, you’ll end up with a jsonl file around 368G. Once we have our full dataset, first thing we’ll do is remove all of the retweets with noretweets.py, giving us just original tweets at @realDonaldTrump. $ noretweets.py 20180609.jsonl > 20180609_no_retweets.jsonl This brings us down to 69,013,268 unique tweets. Your number will probably be less if you’re working with a hydrated dataset because deleted tweets, suspended accounts, and protected accounts will not have tweets hydrated. $ wc -l 20180609_no_retweets.jsonl Over the time of collecting, some of the Twitter APIs and fields changed slightly (extended tweets, and 280 character tweets). For us, this means the “text” of our tweets can reside in two different attributes; text or full_text. So, we need to extract out the text. Let’s use tweet_text. $ tweet_text.py 20180612_no_retweets.jsonl >| 20180612_tweet_text.txt Now that we have just the text, we can count how many fucks there are with grep and wc! $ grep -i "fuck" 20180612_tweet_text.txt | wc -l 1882456 There are 1,882,456 fucks to give! That’s a fuck to tweet ratio of 2.73%. … … … For some more fun, let’s take the last 1000 lines of the our new text file, and make an animated gif out of it. First, let’s get our text: $ grep -i "fuck" 20180612_tweet_text.txt > fucks.txt $ tail -n 1000 fucks.txt > 1000_fucks.txt Then let’s create a little bash script. #!/bin/bash index=0 cat /path/to/1000_fucks.txt | while read line; do let "index++" pad=`printf "%05d" $index` convert -size 800x600 -background black -weight 300 -fill white -gravity Center -font Ubuntu caption:"$line" /path/to/images/$pad.png done cd /path/to/images convert -monitor -define registry:temporary-path=/tmp -limit memory 8GiB -limit map 10GiB -delay 90 *.png -loop 0 1000_fucks.gif Give it a filename, then make it executable, and run it! In the end, you’ll end up with something like this: twitter Donald Trump fucks webarchiving art Nick Ruest Associate Librarian Related Twitter Wordcloud Pipeline The world is a beautiful and terrible place Thumbnails in Warclight Twitter Bots Twitter Datasets and Derivative data CC-BY · Powered by the Academic theme for Hugo. Cite × Copy Download