Assignment 1: Corpus collection from Twitter

You need a twitter developer account for this assignment. Please act as quickly as possible, approval process (by Twitter) may take time.

Deadline: May 15, 2020 12:00

All modern NLP systems make use of linguistic data, or corpora. In fact, most of the recent success of NLP systems are probably due to abundance of corpora that is obtained relatively easily from the Internet. In this assignment, we will build a small collaborative corpus from the micro-blogging site Twitter.

For this purpose, we will use Twitter’s API to connect and retrieve the tweets we are interested. The use of Twitter API requires a Twitter developer account, which you can apply here. The application process may take time. So, please apply as soon as possible, indicating clearly that your use case is academic/educational. In case of long delays or rejection of your application, please contact the instructor.

General description

For this assignment, each team will collect tweets from public figures, such as politicians, actors, athletes, scientists (more likely to happen in the time of COVID-19), or Internet celebrities of any other type you can think of as a public figure.

Each team will build a small corpus of tweets, calculate some statistics on the collected data, and store the corpus for later use.

The exercise

Follow the instruction below for completing this assignment.

1.1 Choose your public figures

Chose five public figures who tweet frequently in English. The person does not need to be a native English speaker. However, he/she should have posted at least 100 tweets (excluding re-tweets) in English. Please try to introduce as much variation as possible. For example, by including people from different countries, fields or ages. Note that you are also required to provide some information about your public figures. You can “reserve” your public figures without entering this information at this step. However, make sure to complete all the information before the assignment deadline.

Record the information on people you chose at https://github.com/snlp2020/snlp/blob/master/a1-people.csv.

Pick a person, not an organizational account.
Make sure that no other team picked one of your choices before you. If so, please find another public figure.
Keep this list sorted by the screen name, and make sure not to break the format of the file.

Note that this file is in our common course repository, not in your assignment repository.

1.2 Downloading “timeline” of a twitter account

Implement the function download_tweets in the provided template a1.py. Please pay attention to the following:

You should download only the tweets posted by the account indicated. No responses or re-tweets should be included.
You should download tweets only in English.
Make sure to download full “extended” tweet texts, not only the first 140 characters.
Make sure your code does not fail because of the rate limits of Twitter API.

This exercise requires authentication tokens that you need to obtain through Twitter’s developer portal. You should not record these tokens in your repository, which means you should not have them defined in your main Python file. A common practice is to place the keys in an external Python file not maintained in the version management system, and import thise file from the main script. Alternatively, you can store the authentication tokens in a configuration file (e.g., a JSON/yaml formatted file), pass them on the command line (not a very secure idea), or through operating system environment variables (not very secure either). In general, you should never save authentication tokens of any type in your source code.

For this exercise you are recommended to use Python tweepy library.

1.3 Print out some statistics

Implement the print_statistics function in the template, printing out the following information about a list of tweets:

The number of tweets.
Minimum, maximum, and average number of tokens. Tokens can be obtained using NLTK’s TweetTokenizer.
Average number of ‘mentions’
Average number ‘hash tags’
Average number of retweets
Average number of times the tweets were favorited
The date span of the tweets (dates of the first and the last tweets in the list)

Bonus (+1):

Plot two histograms into a single PDF file called <user>-histograms.pdf, where <user> is the Twitter screen name of the person:

A histogram of number of tokens.
A histogram the times of tweets in UTC.

For plotting data, you can use the mathplotlib library.

1.4 Save the tweets

Implement the save_tweets function as described in the template. The function should save the given list of tweets as a JSON file. Make sure that the resulting file is a valid JSON file.

1.5 Implement a basic command line interface

Implement the following command-line specification for your Python script.

The “screen_name” of the person should be the only required (positional) argument. E.g., python3 a1.py realDonaldTrump should download tweets by screen name realDonaldTrump, and save them to the file realDonaldTrump.json.
The optional argument --stats should print out the statistics implemented in 1.3 to the standard output (screen, not to the data file).

You are strongly recommended to use argparse library.

1.6 Wrapping up

Make sure you pushed all your source code and the json files with the tweets of your public figures in your repository.

Also double check that all the information required for your public figures are entered in a1-people.csv