twut/usage.md at main · archivesunleashed/twut · GitHub


      Skip to content
      
    
                Sign up
              

                    Why GitHub?
                    
                      
                    Features →
                    	Mobile →
	Actions →
	Codespaces →
	Packages →
	Security →
	Code review →
	Project management →
	Integrations →


                    	GitHub Sponsors →
	Customer stories →
	Security →


                Team
              
	
                Enterprise
              
	
                    Explore
                    
                      
                    	Explore GitHub →


                    Learn & contribute

                    	Topics →
	Collections →
	Trending →
	Learning Lab →
	Open source guides →


                    Connect with others

                    	The ReadME Project →
	Events →
	Community forum →
	GitHub Education →
	GitHub Stars program →


                Marketplace
              
	
                    Pricing
                    
                       
                    Plans →

                    	Compare plans →
	Contact Sales →


                    	Nonprofit →
	Education →


        In this repository
      
      
        All GitHub
      
      ↵
    

      Jump to
      ↵
    

    No suggested jump to results
  

        In this repository
      
      
        All GitHub
      
      ↵
    

      Jump to
      ↵
    

        In this organization
      
      
        All GitHub
      
      ↵
    

      Jump to
      ↵
    

        In this repository
      
      
        All GitHub
      
      ↵
    

      Jump to
      ↵
    

          Sign in
        
            
              Sign up
            
      
      {{ message }}


      archivesunleashed
    
    /
  
    twut
  
  
    Watch
    
      3
    

        Star

    
      6
    

          Fork

      
        2
      
  
            Code
              

            Issues
              1


            Pull requests
              0


            Actions
              

            Projects
              0


            Security
              

            Insights
              

            More
          

                    Code
                
	
                    Issues
                
	
                    Pull requests
                
	
                    Actions
                
	
                    Projects
                
	
                    Security
                
	
                    Insights
                

    Permalink

    
      main
      
    
        twut/docs/usage.md
      

        Go to file
      

                Go to file
                T
            
	
                  Go to line
                  L
                
              
                Copy path
              
            
            ruebot

              
                Add language extractor; resolves #6.
              
          
              …
            

            Latest commit
            a414dec
            Dec 13, 2019
          

              History
            
          
        - Add extractor for Scala and Python.
- Add test
- Update documentation


        2
        
        contributors
      
      
            Users who have contributed to this file
          

      869 lines (644 sloc)
      
    23.8 KB
  

      Raw
        Blame
    

    twut usage

How do I...

	Extract tweet ids
	Extract tweet language
	Extract user information
	Extract tweet text
	Extract tweet times
	Extract tweet sources
	Extract hashtags
	Extract urls
	Extract animated gif urls
	Extract image urls
	Extract media urls
	Extract video urls
	Remove sensitive tweets
	Remove retweets
	Remove non-verified user tweets
	Work with DataFrame Results


Extract Tweet IDs

Single-column DataFrame containing Tweet IDs.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

ids(tweetsDF).show(2, false)


Output:

+-------------------+
|id_str             |
+-------------------+
|1201505319257403392|
|1201505319282565121|
+-------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.ids(df).show(2, False)


Output:

+-------------------+
|id_str             |
+-------------------+
|1201505319257403392|
|1201505319282565121|
+-------------------+


Extract Tweet Language

Single-column DataFrame containing the BCP 47 language identifier corresponding to the machine-detected language.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

language(tweetsDF).show(5, false)


Output:

+----+
|lang|
+----+
|tl  |
|ja  |
|ar  |
|ja  |
|ja  |
+----+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.language(df).show(5, False)


Output:

+----+
|lang|
+----+
|tl  |
|ja  |
|ar  |
|ja  |
|ja  |
+----+


Extract User Information

Multi-column DataFrame containing the following columns: favourites_count, followers_count, friends_count, id_str, location, name, screen_name, statuses_count, and verified.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

userInfo(tweetsDF).show(2, false)


Output:

+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|favourites_count|followers_count|friends_count|id_str             |location|name               |screen_name |statuses_count|verified|
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|8302            |101            |133          |1027887558032732161|nct🌱   |车美               |M_chemei    |3720          |false   |
|2552            |73             |218          |2548066344         |null    |ひーこ☆禿げても愛せ|heeko_gr_029|15830         |false   |
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
only showing top 2 rows


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.userInfo(df).show(2, False)


Output:

+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|favourites_count|followers_count|friends_count|id_str             |location|name               |screen_name |statuses_count|verified|
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+
|8302            |101            |133          |1027887558032732161|nct🌱   |车美               |M_chemei    |3720          |false   |
|2552            |73             |218          |2548066344         |null    |ひーこ☆禿げても愛せ|heeko_gr_029|15830         |false   |
+----------------+---------------+-------------+-------------------+--------+-------------------+------------+--------------+--------+


Extract Tweet Text

Single-column or two columns (text, and full-text) containing Tweet text.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

text(tweetsDF).show(2, false)


Output:

+---------------------------------+
|text                             |
+---------------------------------+
|Baket ang pogi mo???             |
|今日すげぇな！#安元江口と夜あそび|
+---------------------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.text(df).show(2, False)


Output:

+---------------------------------+
|text                             |
+---------------------------------+
|Baket ang pogi mo???             |
|今日すげぇな！#安元江口と夜あそび|
+---------------------------------+


Extract Tweet Times

Single-column DataFrame containing the tweet time.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

times(tweetsDF).show(2, false)


Output:

+------------------------------+
|created_at                    |
+------------------------------+
|Mon Dec 02 14:16:05 +0000 2019|
|Mon Dec 02 14:16:05 +0000 2019|
+------------------------------+
only showing top 2 rows


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.times(df).show(2, False)


Output:

+------------------------------+
|created_at                    |
+------------------------------+
|Mon Dec 02 14:16:05 +0000 2019|
|Mon Dec 02 14:16:05 +0000 2019|
+------------------------------+


Extract Tweet Sources

Single-column DataFrame containing the source of the tweet.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

sources(tweetsDF).show(10, false)


Output:

+------------------------------------------------------------------------------------+
|source                                                                              |
+------------------------------------------------------------------------------------+
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>             |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
+------------------------------------------------------------------------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.sources(df).show(10, False)


Output:

+------------------------------------------------------------------------------------+
|source                                                                              |
+------------------------------------------------------------------------------------+
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>             |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
|<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>|
|<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>  |
+------------------------------------------------------------------------------------+


Extract Hashtags

Single-column DataFrame containg Hashtags.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

hashtags(tweetsDF).show


Output:

+------------------+
|          hashtags|
+------------------+
|安元江口と夜あそび|
+------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.hashtags(df).show()


Output:

+------------------+
|          hashtags|
+------------------+
|安元江口と夜あそび|
+------------------+


Extract Urls

Single-column DataFrame containing urls.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/10-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

urls(tweetsDF).show(10, false)


Output:

+-----------------------------------------------------------+
|url                                                        |
+-----------------------------------------------------------+
|https://t.co/hONLvNozJg                                    |
|https://twitter.com/komsakaddams/status/1198868305668296705|
+-----------------------------------------------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.urls(df).show(10, False)


Output:

+-----------------------+
|url                    |
+-----------------------+
|https://t.co/hONLvNozJg|
|https://t.co/mI5HYXLXwy|
|https://t.co/OLcG6Vu6dI|
|https://t.co/6KANFBbSa2|
|https://t.co/NRKfpoyubk|
|https://t.co/K66y3z4o8U|
|https://t.co/k0k7VndNzw|
|https://t.co/Vf7qICr4v0|
|https://t.co/fqRf3Og2qw|
|https://t.co/kC3XMKWBR8|
+-----------------------+


Extract Animated Gif Urls

Single-column DataFrame containing animated gif urls.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

animatedGifUrls(tweetsDF).show(10, false)


Output:

+-----------------------------------------------------------+
|animated_gif_url                                           |
+-----------------------------------------------------------+
|https://pbs.twimg.com/tweet_video_thumb/EKyat33U4AEpVFf.jpg|
|https://pbs.twimg.com/tweet_video_thumb/EKyQ1fAU8AM7r1I.jpg|
|https://pbs.twimg.com/tweet_video_thumb/EKyau1OU8AAD_OZ.jpg|
+-----------------------------------------------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.animatedGifUrls(df).show(10, False)


Output:

+-----------------------------------------------------------+
|animated_gif_url                                           |
+-----------------------------------------------------------+
|https://pbs.twimg.com/tweet_video_thumb/EKyat33U4AEpVFf.jpg|
|https://pbs.twimg.com/tweet_video_thumb/EKyQ1fAU8AM7r1I.jpg|
|https://pbs.twimg.com/tweet_video_thumb/EKyau1OU8AAD_OZ.jpg|
+-----------------------------------------------------------+


Extract Image Urls

Single-column DataFrame containing image urls.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

imageUrls(tweetsDF).show(5, false)


Output:

+-----------------------------------------------+
|image_url                                      |
+-----------------------------------------------+
|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|
|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|
|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|
|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|
|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|
+-----------------------------------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.imageUrls(df).show(5, False)


Output:

+-----------------------------------------------+
|image_url                                      |
+-----------------------------------------------+
|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|
|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|
|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|
|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|
|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|
+-----------------------------------------------+


Extract Media Urls

Single-column DataFrame containing animated gif urls, image urls, and video urls.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

mediaUrls(tweetsDF).show(5, false)


Output:

+-----------------------------------------------+
|image_url                                      |
+-----------------------------------------------+
|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|
|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|
|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|
|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|
|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|
+-----------------------------------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.mediaUrls(df).show(5, False)


Output:

+-----------------------------------------------+
|image_url                                      |
+-----------------------------------------------+
|https://pbs.twimg.com/media/EKjNNRFXsAANHyQ.jpg|
|https://pbs.twimg.com/media/EKvWq8LXsAE_HhV.jpg|
|https://pbs.twimg.com/media/EKx9va5XUAEKcry.jpg|
|https://pbs.twimg.com/media/EKyNK0-WoAMDou3.jpg|
|https://pbs.twimg.com/media/EKyHOyZVUAE3GX6.jpg|
+-----------------------------------------------+


Extract Video Urls

Single-column DataFrame containing video urls.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

videoUrls(tweetsDF).show(5, false)


Output:

+---------------------------------------------------------------------------------------------------+
|video_url                                                                                          |
+---------------------------------------------------------------------------------------------------+
|https://video.twimg.com/ext_tw_video/1201113203125583872/pu/pl/mLQJE9rIBSE6DaQ_.m3u8?tag=10        |
|https://video.twimg.com/ext_tw_video/1201113203125583872/pu/vid/460x258/o5wbkNtC_yVBiGvM.mp4?tag=10|
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/pl/1LRDIgIbWofMDpOa.m3u8?tag=10        |
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/vid/360x638/KrMl6qgy_8ugHBW-.mp4?tag=10|
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/vid/320x568/lMOyqZH6fnCoDGzI.mp4?tag=10|
+---------------------------------------------------------------------------------------------------+


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

SelectTweet.videoUrls(df).show(5, False)


Output:

+---------------------------------------------------------------------------------------------------+
|video_url                                                                                          |
+---------------------------------------------------------------------------------------------------+
|https://video.twimg.com/ext_tw_video/1201113203125583872/pu/pl/mLQJE9rIBSE6DaQ_.m3u8?tag=10        |
|https://video.twimg.com/ext_tw_video/1201113203125583872/pu/vid/460x258/o5wbkNtC_yVBiGvM.mp4?tag=10|
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/pl/1LRDIgIbWofMDpOa.m3u8?tag=10        |
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/vid/360x638/KrMl6qgy_8ugHBW-.mp4?tag=10|
|https://video.twimg.com/ext_tw_video/1200729524045901825/pu/vid/320x568/lMOyqZH6fnCoDGzI.mp4?tag=10|
+---------------------------------------------------------------------------------------------------+


Remove Sensitive Tweets

Filters outs tweets labeled as sensisitive.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

removeSensitive(tweetsDF).count


Output:

res0: Long = 246


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

FilterTweet.removeSensitive(df).count()


Output:

246


Remove Retweets

Filters out retweets.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

removeRetweets(tweetsDF).count


Output:

res0: Long = 230


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

FilterTweet.removeRetweets(df).count()


Output:

230


Remove Non-verified User Tweets

Filters out tweets from non-verified users.

Scala DF

import io.archivesunleashed._

val tweets = "src/test/resources/500-sample.jsonl"
val tweetsDF = spark.read.json(tweets)

removeNonVerified(tweetsDF).count


Output:

res0: Long = 5


Python DF

from twut import *

path = "src/test/resources/500-sample.jsonl"
df = spark.read.json(path)

FilterTweet.removeNonVerified(df).count()


Output:

5


Work with DataFrame Results

Most script snippets in the documentation end with .show(20, false) (in Scala) and .show(20, False) (in Python). This prints a table of results with 20 rows and doesn't truncate the columns (the false or False). You can change the second parameter to true (Scala) or True (Python) if the columns get too wide for better display. If you want more or fewer results, change the first number.

Scala

If you want to return a set of results, the counterpart of .take(10) with RDDs is .head(10).
So, something like (in Scala):

  ids(tweetDF)
  // more transformations here...
  .head(10)


In the Scala console, the results are automatically assigned to a variable, like the following:

res1: Array[org.apache.spark.sql.Row] = Array(...)


Scala automatically numbers the variables, starting at 0, so that the number will increment with each statement.
You can then manipulate the variable, for example res0(0) to access the first element in the array.

Don't like the variable name Scala gives you?

You can do something like this:

val r = ids(tweetDF)
  // more transformations here...
  .head(10)


Scala assigns the results to r is this case, which you can then subsequently manipulate, like r(0) to access the first element.

If you want all results, replace .take(10) with .collect().
This will return all results to the console.

WARNING: Be careful with .collect()! If your results contain ten million records, TWUT will try to return all of them  to your console (on your physical machine).
Most likely, your machine won't have enough memory!

Alternatively, if you want to save the results to disk, replace .show(20, false) with the following:

  .write.csv("/path/to/export/directory/")


Replace /path/to/export/directory/ with your desired location.
Note that this is a directory, not a file.

Depending on your intended use of the output, you may want to include headers in the CSV file, in which case:

  .write.option("header","true").csv("/path/to/export/directory/")


If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:

  .write.parquet("/path/to/export/directory/")


Replace /path/to/export/directory/ with your desired location.
Note that this is a directory, not a file.

Later, as in a completely separate session, you can read the results back in and continuing processing, as follows:

val results = spark.read.parquet("/path/to/export/directory/")

results.show(20, false)


Parquet encodes metadata such as the schema and column types, so you can pick up exactly where you left off.
Note that this works even across languages (e.g., export to Parquet from Scala, read back in Python) or any system that supports Parquet.

Python

If you want to return a set of results, the counterpart of .take(10) with RDDs is .head(10).
So, something like (in Python):

  SelectTweet.ids(df)
  # more transformations here...
  .head(10)


In the PySpark console, the results are returned as a List of rows, like the following:

[Row(id_str='1201505319257403392'), Row(id_str='1201505319282565121'), Row(id_str='1201505319257608197'), Row(id_str='1201505319261655041'), Row(id_str='1201505319261597696'), Row(id_str='1201505319274332165'), Row(id_str='1201505319261745152'), Row(id_str='1201505319270146049'), Row(id_str='1201505319286755328'), Row(id_str='1201505319286984705')]


You can assign the tranformations to a variable, like this:

tweet_ids = SelectTweet.ids(df)
  # more transformations here...
  .head(10)


If you want all results, replace .head(10) with .collect().
This will return all results to the console.

WARNING: Be careful with .collect()! If your results contain ten million records, TWUT will try to return all of them  to your console (on your physical machine).
Most likely, your machine won't have enough memory!

Alternatively, if you want to save the results to disk, replace .show(20, false) with the following:

tweet_ids.write.csv("/path/to/export/directory/")


Replace /path/to/export/directory/ with your desired location.
Note that this is a directory, not a file.

Depending on your intended use of the output, you may want to include headers in the CSV file, in which case:

tweet_ids.write.csv("/path/to/export/directory/", header='true')


If you want to store the results with the intention to read the results back later for further processing, then use Parquet format:

tweet_ids.write.parquet("/path/to/export/directory/")


Replace /path/to/export/directory/ with your desired location.
Note that this is a directory, not a file.

Later, as in a completely separate session, you can read the results back in and continuing processing, as follows:

tweet_ids = spark.read.parquet("/path/to/export/directory/")

tweet_ids.show(20, false)


Parquet encodes metadata such as the schema and column types, so you can pick up exactly where you left off.
Note that this works even across languages (e.g., export to Parquet from Scala, read back in Python) or any system that supports Parquet.


        Go

    
    	© 2021 GitHub, Inc.
	Terms
	Privacy
	Security
	Status
	Docs


    	Contact GitHub
	Pricing
	API
	Training
	Blog
	About


    You can’t perform that action at this time.
  

    You signed in with another tab or window. Reload to refresh your session.
    You signed out in another tab or window. Reload to refresh your session.