Homework 2: Text classification (CS 2731 Fall 2023)

Due 2023-10-05, 11:59pm. Instructions last updated 2023-10-03.

Learning objectives

After completing this assignment, students will be able to:

Part 1: Learning weights in logistic regression

You are training a classifier for reviews of a new product recently released by a company. You design a couple of features, \(x_1\) and \(x_2\). You will be using logistic regression. With an initialization of the weights \(w_1\), \(w_2\) and \(b\) (the bias) all set = 0 and a learning rate \(\eta=0.2\), calculate the weights after processing each of the following 3 inputs:

  1. \[x_1 = 2, x_2 = 1, y = 1\]
  2. \[x_1 = 1, x_2 = 3, y = 0\]
  3. \[x_1 = 0, x_2 = 4, y = 0\]

During calculations, keep at least 3 significant digits for values. Points will not be taken off for slight differences due to rounding.

Deliverables for Part 1

Part 2: Implement a politeness classifier

In this portion, you will design and implement a program to classify if an online comment is polite or not. You can use any packages you want for this (scikit-learn, spaCy, NLTK, Gensim, code from Homework 1, etc), but these must be specified in the README.txt file, along with version numbers for Python and all packages. We will attempt to run your code on a held-out test set, so the exact environment must be specified. If you will be using a language other than Python, please let us know before submitting. Your script should be able to take the name of a dataset as a single keyword argument.

Dataset

Here is the dataset that you should download for this assignment:

2.1 Feature-based logistic regression models

In this section, you will build a logistic regression model based on bag-of-word features and/or features of your own design. You can do whatever preprocessing you see fit. You will report performance using 5-fold cross-validation on the dataset, which you will set up. Make sure to just extract features (bag-of-words, etc) from the training set and not the test folds within cross-validation.

Tasks for section 2.1

Implement and try the following feature and model combinations:

In the report, please provide:

2.2 Static word embeddings with feedforward neural network

In this section, you will build and evaluate a feedforward neural network that uses pre-trained static word embeddings (word2vec, GloVe, FastText, etc) as input. To represent the document, you can take the average word embeddings of the input sentence or choose another function. You can choose which activation function to use and other hyperparameters. You will again use 5-fold cross validation on the dataset. There is no need for this model to outperform the logistic regression model you made.

Tasks for section 2.2

In the report, please provide:

Notes

Deliverables

Please submit all of this material on Canvas. We will grade your report and attempt to run your code.

Grading

See rubric on Canvas.

Acknowledgments

This assignment is inspired from a homework assignment by Prof. Diane Litman. Data is from Danescu-Niculescu-Mizil et al. 2013.