Homework 1: Vector space word similarity (CS 2731 Spring 2024)

Due 2024-02-01, 11:59pm. Instructions last updated 2024-01-16.

In this assignment, you’ll build representations for documents and words based on the bag-of-words model. You’ll implement 2 popular weighting schemes for these vectors: tf-idf and PPMI, both discussed in Chapter 6 of the textbook. Then you’ll compare these weighting schemes on learning word similarity and apply one of them, PPMI, to examine social bias in an NLP corpus.

Learning objectives

After completing this assignments, you will be able to:

Datasets and skeleton code

Here are the materials that you should download for this assignment:

Part 1: Vector spaces

1.1 Term-document matrix

Write code to compile a term-document matrix for Shakespeare’s plays, following the description in the textbook:

In a term-document matrix, each row represents a word in the vocabulary and each column represents a document from some collection. The figure below shows a small selection from a term-document matrix showing the occurrence of four words in four plays by Shakespeare. Each cell in this matrix represents the number of times a particular word (defined by the row) occurs in a particular document (defined by the column). Thus clown appeared 117 times in Twelfth Night

  As You Like It Twelfth Night Julius Caesar Henry V
battle 1 1 8 15
soldier 2 2 12 36
fool 37 58 1 5
clown 5 117 0 0

The dimensions of your term-document matrix will be the number of documents \(D\) (in this case, the number of Shakespeare’s plays that we give you in the corpus) by the number of unique word types \(\vert V \vert\) in that collection. The columns represent the documents, and the rows represent the words, and each cell represents the frequency of that word in that document.

Tasks for section 1.1

1.2 Term-Context Matrix

Instead of using a term-document matrix, a more common way of computing word similarity is by constructing a term-context matrix (also called a term-term or word-word matrix), where columns are labeled by words rather than documents. The dimensionality of this kind of a matrix is \(\vert V \vert\) by \(\vert V \vert\). Each cell represents how often the word in the row (the target word) co-occurs with the word in the column (the context) in a training corpus. You can decide when it makes sense for a word to co-occur with itself in the term-context matrix. That is, will the cell for when the same word is target and context always stay 0? Note that there may be vectors where all elements are 0 if a word only appears in documents where it is the only word in the document.

Tasks for section 1.2

1.3 Evaluating vector spaces

So far we have created 2 vector spaces for the words in Shakespeare, one with a dimension of \(D\) and another of dimension \(\vert V \vert\). Now we will try to evaluate how good our vector spaces are. We can do this with an intrinsic evaluation approach by seeing what words within the vocab are most similar to each other/are synonyms with each other and assessing if the output is reasonable. Implement the rank_words function which will take a target word index and return a list sorted from most similar to least similar using the cosine similarity metric. For the purposes of the assignment, let’s just look at the top 10 words that are most similar to a target word between both the term-document matrix and the term-context matrix (with a window size of your choice). Are those 10 words good synonyms? The skeleton code provides an example of using rank_words and looking at similar words using the word ‘juliet’. One example won’t be enough so pick out at least 4 more words from the vocab as you answer these questions.

Tasks for section 1.3

For the report:
1.3.1. In our term-document matrix, the rows are word vectors of \(D\) dimensions. Do you think that’s enough to represent the meaning of words?
1.3.2. Provide the top 10 associated words (the output from rank_words) with juliet and at least 2 other target words of your choice for both term-document and term-context vector spaces.
1.3.3. Which vector space (term-document or term-context) produce similar words that make more sense than others and why do you think that is the case? Back up your conclusions by referring to the top associated term lists you provided.
1.3.4. Consider any decisions you made in the prior sections when implementing your functions, such as whether you allowed a target word to co-occur with itself as a context word, and which window size you chose for the term-context matrix. How might any decisions you make impact our results now?

1.4 Weighting terms with tf-idf and PPMI

Your term-context matrix contains the raw frequency of the co-occurrence of two words in each cell and your term-document matrix contains the raw frequency of words in each of the documents. Raw frequency turns out not to be the best way of measuring the association between words. There are several methods for weighting words so that we get better results.

Tasks for section 1.4

For the report:
1.4.1. Provide the top 10 associated words (the output from rank_words) with juliet and at least 2 other target words of your choice for tf-idf-weighted term-document matrices and PPMI-weighted term-context matrices.
1.4.2. How does weighting with tf-idf compare to using the unweighted term-document matrix?
1.4.3. How does weighting with PPMI compare with using the unweighted term-context matrix?
1.4.4. How does term-context/PPMI compare to term-document/TF-IDF?
1.4.5. Overall, do some approaches appear to work better than others, i.e produce better synonyms? Do any interesting patterns emerge? Discuss and point to specific examples.

Part 2

In this part, you will measure associations between words in a commonly used NLP corpus, SNLI, and comment on the potential for encoding problematic social biases. There is no skeleton code for this section, but you can reuse code from Part 1.

2.1 Find words associated with identity labels in SNLI

Here you will examine which words are highly associated with identity labels in SNLI with PPMI. You will look for any associations that may reflect social stereotypes or possible representational harms in machine learning, defined below from Blodgett et al. 2020:

Representational harms arise when a system (e.g., a search engine) represents some social groups in a less favorable light than others, demeans them, or fails to recognize their existence altogether.

Types of representational harms from Blodgett et al. 2020 include:

1. Stereotyping that propagates negative generalizations about particular social groups
2. Differences in system performance for different social groups, language that misrepresents the distribution of different social groups in the population, or language that is denigrating to particular social groups.

Tasks for section 2.1

For the report:
2.1.1. Provide the top 10 associated context words (by PPMI) for at least 4 identity labels of your choice.
2.1.2. Do you see any associations that may reflect social stereotypes? It is helpful to compare the top PMI words for certain identity terms with other related ones (such as men compared with women). Discuss and provide selected results. If you don’t find any social stereotypes (that’s okay), provide examples of what you examined and how you interpreted those associations.
2.1.3. Do you see any associations that could be defined as representational harms (see below) learned by a bag-of-words model of this SNLI corpus? If so, which type do you see? Provide examples that support your conclusions. If you don’t find any potential harms, provide examples of what you examined and how you interpreted those associations.

2.2 Qualitative analysis

In this section, you will explore the contexts in the dataset that lead to high PMI association with context words, especially for any words that show social bias (if you found any). 1st-order similarity is when a target word (in this, case, an identity label) occurs in the same document with a top-associated term. This might not be very informative to see how these words are related in the dataset. If not, look at 2nd-order similarity, in which the two words occur with similar context words. This can also be examined by looking at the vectors for the identity term and the highly associated other term in the term-context matrix. These vectors may share high values in dimensions that correspond to certain context words.

Tasks for section 2.2

For the report:
2.2.1. For at least 4 pairs of identity terms and highly associated words, provide the document contexts in the SNLI dataset in which they occur together (1st-order similarity) or occur separately with similar context words (2nd-order similarity). Provide selected results and discuss findings.

Deliverables

Please submit all of this material on Canvas. We will grade your report and look over your code.

Grading

See rubric on Canvas. This assignment is worth 56 points.

Acknowledgments

This assignment is adapted from Prof. Diane Litman and Prof. Mark Yatskar, as well from Rudinger et al. 2017.