Homework 3: Character-based language modeling (CS 2731 Spring 2024)

Due 2024-03-10, 11:59pm. Instructions last updated 2024-03-07.

In this assignment, you will build unigram, bigram, and trigram character language models (both unsmoothed and smoothed versions) for three languages, score a test document with each, and determine the language it is written in based on perplexity. You will also use your English language models to generate texts. You will critically examine all results. The learning goals of this assignment are to:

Data

The data for this project is available here: hw3_data.zip. It consists of:

1. Train character n-gram language models

To complete the assignment, you will need to write a program that:

You may make any additional assumptions and design decisions, but state them in your report (see below). For example, some design choices that could be made are how you want to handle uppercase and lowercase letters or how you want to handle digits. The choice made is up to you, we only require that you detail these decisions in your report and consider any implications of them in your results. There is no wrong choice here, and these decisions are typically made by NLP researchers when pre-processing data.

You may write your program in any TA-approved programming language (Python, Java, C/C++).

You are allowed to use any resources or packages to extract the character ngrams from text, such as scikit-learn or NLTK. You may not use any package to directly train/compute language model probabilities; that portion of the program should be from scratch.

You will be training a total of 15 character ngram models:

Deliverables for part 1

In your report, include:

  1. a description of how you wrote your program, including all assumptions and design decisions
  2. an excerpt of both the unsmoothed and interpolated trigram language models for English, displaying all n-grams and their probability with the two-character history t h

2. Use character n-gram language models

import math

def perplexity(text, model, n):
""" Args:
            text: a string of characters
            model: a matrix or df of the probabilities with rows as prefixes, columns as suffixes.
			You can modify this depending on how you set up your model.
            n: n-gram order of the model

        Acknowledgment: 
	  https://towardsdatascience.com/perplexity-intuition-and-derivation-105dd481c8f3 
	  https://courses.cs.washington.edu/courses/csep517/18au/
	  ChatGPT with GPT-3.5
    """

    # FILL IN: Remove any unseen characters from the text that have no unigram probability in the language

    N = len(text)
    char_probs = []
    for i in range(n-1, N):
	prefix = text[i-n+1:i]
	suffix = text[i]
	# FILL IN: look up the probability in the model of the suffix given the prefix
	prob = 
	char_probs.append(math.log2(prob))
    neg_log_lik = -1 * sum(char_probs) # negative log-likelihood of the text
    ppl = 2 ** (neg_log_lik/(N - n + 1)) # 2 to the power of the negative log likelihood of the words divided by #ngrams
    return ppl

Deliverables for part 2

In your report, include:

  1. for the bigram and trigram interpolated models of all 3 languages (total of 6 models), the average perplexity score across all lines in the test document, as well as which language each model estimates the test document is written in (has the lowest perplexity).
  2. for unsmoothed and interpolated English bigram and trigram models (total of 4 models), generated text outputs for the following inputs: 4 prior contexts (single letters) for bigrams and 4 prior contexts (sequences of 2 letters) for trigrams. You can choose which prior contexts to start the generation.
  3. critical analysis of your language identification results: e.g., why do your perplexity scores tell you what language the test data is written in? what does a comparison of your unigram, bigram, and trigram scores tell you about which performs best? etc.
  4. critical analysis of your generation results: e.g., are there any difference between the sentences generated by bigrams and trigrams, or by the unsmoothed versus smoothed models? How does a value of \(k=1\) compare with other values? Give examples to back up your conclusions.

Submission

Please submit the following items on Canvas:

Grading

This homework assignment is worth 56 points. The grading rubric will be posted on Canvas.

Acknowledgments

This assignment is based on a homework assignment by Prof. Diane Litman.