Homework 1: Vector space word similarity (CS 2731 Fall 2023)

Due 2023-09-17, 11:59pm (extended from 2023-09-14 initial deadline). Instructions last updated 2023-09-13.

In this assignment, you’ll build representations for documents and words based on the bag-of-words model. You’ll implement 2 popular weighting schemes for these vectors: tf-idf and PPMI, both discussed in Chapter 6 of the textbook. Then you’ll compare these weighting schemes on learning word similarity and apply one of them, PPMI, to examine social bias in an NLP corpus.

Learning objectives

After completing this assignments, students will be able to:

Datasets and skeleton code

Here are the materials that you should download for this assignment:

Part 1: Vector spaces

1.1 Term-document matrix

Write code to compile a term-document matrix for Shakespeare’s plays, following the description in the textbook:

In a term-document matrix, each row represents a word in the vocabulary and each column represents a document from some collection. The figure below shows a small selection from a term-document matrix showing the occurrence of four words in four plays by Shakespeare. Each cell in this matrix represents the number of times a particular word (defined by the row) occurs in a particular document (defined by the column). Thus clown appeared 117 times in Twelfth Night

  As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 5 117 0 0

The dimensions of your term-document matrix will be the number of documents \(D\) (in this case, the number of Shakespeare’s plays that we give you in the corpus) by the number of unique word types \(\vert V \vert\) in that collection. The columns represent the documents, and the rows represent the words, and each cell represents the frequency of that word in that document.

Tasks for section 1.1

1.2 Term-Context Matrix

Instead of using a term-document matrix, a more common way of computing word similarity is by constructing a term-context matrix (also called a term-term or word-word matrix), where columns are labeled by words rather than documents. The dimensionality of this kind of a matrix is \(\vert V \vert\) by \(\vert V \vert\). Each cell represents how often the word in the row (the target word) co-occurs with the word in the column (the context) in a training corpus. You can decide when it makes sense for a word to co-occur with itself in the term-context matrix. That is, will the cell for when the same word is target and context always stay 0?

Tasks for section 1.2

1.3 Evaluating vector spaces

So far we have created 2 vector spaces for the words in Shakespeare, one with a dimension of \(D\) and another of dimension \(\vert V \vert\). Now we will try to evaluate how good our vector spaces are. We can do this with an intrinsic evaluation approach by seeing what words within the vocab are most similar to each other/are synonyms with each other and assessing if the output is reasonable. Implement the rank_words function which will take a target word index and return a list sorted from most similar to least similar using the cosine similarity metric. For the purposes of the assignment, let’s just look at the top 10 words that are most similar to a target word between both the term-document matrix and the term-context matrix (with a window size of your choice). Are those 10 words good synonyms? The skeleton code provides an example of using rank_words and looking at similar words using the word ‘juliet’. One example won’t be enough so pick out at least 4 more words from the vocab as you answer these questions.

Tasks for section 1.3

For the report:

1.4 Weighting terms with tf-idf and PPMI

Your term-context matrix contains the raw frequency of the co-occurrence of two words in each cell and your term-document matrix contains the raw frequency of words in each of the documents. Raw frequency turns out not to be the best way of measuring the association between words. There are several methods for weighting words so that we get better results.

Tasks for section 1.4

For the report:

Part 2

In this part, you will measure associations between words in a commonly used NLP corpus, SNLI, and comment on the potential for encoding problematic social biases. There is no skeleton code for this section, but you can reuse code from Part 1.

Tasks for Part 2

For the report:

Representational harms arise when a system (e.g., a search engine) represents some social groups in a less favorable light than others, demeans them, or fails to recognize their existence altogether.

Types of representational harms from Blodgett et al. 2020 include:

1. Stereotyping that propagates negative generalizations about particular social groups
2. Differences in system performance for different social groups, language that misrepresents the distribution of different social groups in the population, or language that is denigrating to particular social groups.

Deliverables

Please submit all of this material on Canvas. We will grade your report and attempt to run your code.

Grading

See rubric on Canvas.

Acknowledgments

This assignment is adapted from Prof. Diane Litman and Prof. Mark Yatskar, as well from Rudinger et al. 2017.