Homework 2: Text classification (CS 2731 Spring 2024)

Due 2024-02-15, 11:59pm. Instructions last updated 2024-01-30.

Learning objectives

After completing this assignment, students will be able to:

Part 1: Learning weights in logistic regression

You are training a classifier for reviews of a new product recently released by a company. You design a couple of features, \(x_1\) and \(x_2\). You will be using logistic regression. With an initialization of the weights \(w_1\), \(w_2\) and \(b\) (the bias) all set = 0 and a learning rate \(\eta=0.2\), calculate the weights after processing each of the following 3 inputs:

  1. \[x_1 = 2, x_2 = 1, y = 1\]
  2. \[x_1 = 1, x_2 = 3, y = 0\]
  3. \[x_1 = 0, x_2 = 4, y = 0\]

During calculations, keep at least 3 significant digits for values. Points will not be taken off for slight differences due to rounding.

Deliverables for Part 1

In the report:

Part 2: Implement a politeness classifier

In this portion, you will design and implement a program to classify if an online comment is polite or not. You can use any packages you want for this (scikit-learn, spaCy, NLTK, Gensim, code from Homework 1, etc). Any packages used should be specified in the README.txt file, along with version numbers for Python and all packages. If you will be using a language other than Python, please let us know before submitting. Your script should be able to take the name of a dataset as a single keyword argument.

Dataset

Here is the dataset that you should download for this assignment:

This dataset consists of requests among Wikipedia editors posted on user talk pages, as well as posts on the coding help forum Stack Exchange (see the Stanford Politeness Corpus, Danescu-Niculescu-Mizil et al. 2013).

2.1 Feature-based logistic regression models

In this section, you will build a logistic regression model based on bag-of-word features and/or features of your own design. You can do whatever preprocessing you see fit. You will report performance using 5-fold cross-validation on the dataset, which you will set up. Make sure to just extract features (bag-of-words, etc) from the training set and not the test folds within cross-validation.

Tasks for section 2.1

Implement and try the following feature and model combinations:

You will thus have 3 total logistic regression models: one using bag-of-word features and 2 with your own selected features or preprocessing changes.

In the report, please provide:

  1. A table of 5-fold cross-validation performance scores for models trained on each set of features. Include accuracy as well as precision, recall, and f1-score for the positive (polite) class.
  2. For each feature or change in input text processing:
    1. Describe your motivation for including the feature
    2. Discussion of results: Did it improve performance or not? (Either result is fine. It is not necessary to beat logistic regression with unigram features.)
  3. For a feature-based model of your choice:
    1. List the top 2 most informative features that are mostly strongly positively and negatively associated with politeness. Discuss if you find these surprising and any other comments you might have. You may adapt code provided by the instructor in the Naive Bayes example (notebook here), use another source online, or write your own. Give specific informative features, such as particular words (e.g. “actually”) for bag-of-words features, instead of sets of features like “tf-idf unigram features”.
    2. Do an error analysis. Provide a confusion matrix and sample multiple examples from both false negatives and false positives. Do you see any patterns in these errors? How might these errors be addressed with different features or if the system could understand something else? (You don’t have to implement these, just speculate.)

2.2 Neural network-based approaches

In this section, you will build and evaluate neural network-based classifier on politeness classification. For example, you could implement a feedforward neural network that uses pre-trained static word embeddings (word2vec, GloVe, FastText, etc) as input. To represent the document, you could take the average word embeddings of the input sentence or choose another function. You can choose which activation function to use and other hyperparameters. You are also welcome to try other methods we haven’t yet covered in class, such as LSTMs, convolutional neural networks, BERT, or other LLMs. As long as the technique uses neural networks at some point in its architecture and involves some sort of training or fine-tuning of a model, it will be accepted. Simply prompting a pre-trained LLM to classify the instances (“zero-shot” or “in-context” learning) will not be sufficient. If you have questions about what is acceptable, ask the instructor or TA.

You will again use 5-fold cross validation on the dataset. There is no need for this model to outperform the logistic regression model you made.

Tasks for section 2.2

In the report, please provide:

2.3 (optional) Submit your classifier in the class challenge

Optionally, you can submit your classifier to run on a hidden held-out test set as part of a class competition. Bonus points will be awarded in the competition as follows, as measured by accuracy on our held-out test set.

How to submit your classifier

Please see the Kaggle competition page for instructions on how to submit for the challenge competition.

You will need to create a Kaggle account to submit. Please provide your Kaggle username used in the competition in your report so we can assign any bonus points. Note that this username will be visible in a leaderboard to other challenge competition participants.

Notes

Deliverables

Please submit all of this material on Canvas. We will grade your report and look over your code.

Grading

See rubric on Canvas.

Acknowledgments

This assignment is inspired from a homework assignment by Prof. Diane Litman. Data is from Danescu-Niculescu-Mizil et al. 2013.