Project

Last revised 2026-02-06.

A major component of this course is a hands-on final project guided by students’ own interests. In this project, students will demonstrate an ability to build and evaluate an NLP system that takes in language data and automatically produces some sort of output.

Projects will be done in groups of ~5 students. Groups will be formed during an in-class project match day based on interest in the same project ideas.

Project idea form

Due 02-05.

In this form, you can submit project ideas you might be interested in working on. You can fill out ideas from the example projects listed below or one of your own ideas. For your own ideas, consider what computer system you’d like to build that processes language in some form, interesting text datasets you’d like to work on, really anything! It is best if your idea has a dataset in mind, but this is not required.

You can fill out as many ideas as you’d like with this form. Ideas do not have to be fully sketched out. Submitting an idea does mean you will necessarily work on it. These ideas will be presented to all students anonymously. Each student must submit at least one idea for credit on this assignment, even if it’s just chosen from the example projects.

Example projects

Some of these projects are drawn from “shared tasks” where NLP researchers compete for the best performance on certain datasets. The instructor will provide data for these projects, though it may still require further preprocessing for use.

1. Text classification

Classify adversarial prompts for LLMs based on attack type, using publicly available red-teaming datasets.
Given a review of a restaurant, determine what type of restaurant it is from this Yelp dataset
Given a short essay in response to a troubling news article, predict the level of empathy. See WASSA 2024 shared task Track 3.
Predict emotion labels from tweets across many languages. See WASSA 2024 shared task
Given a news article and a list of “entities” (people, organizations, etc), predict roles such as protagonist, antagonist, and innocent. See SemEval 2025 Task 10, Subtask 1 on entity framing
Predict news genre or media “frames” such as morality, economic, or crime and punishment from news articles in multiple languages. See SemEval 2023 Task 3, Subtasks 1 or 2
Predict whether text was written by humans or generated by AI. Tasks include predicting for data across languages and for academic essays. See GenAI Content Detection Workshop, Task 1 or 2
Classify tweets as sexist or not, or predict the “intent” of sexist tweets as direct, reported or judgemental. See EXIST 2024 Task 1 or Task 2
Predict if similar words are redundant or not with the Semantic Pleonasm corpus developed right here at Pitt.

2. Machine translation

Train translation models for literary text and evaluate on a dataset of Korean-English webnovels developed by a former student in this class.
Translate customer service chats in between languages. See the WMT 2024 Chat Shared Task
Translate code-mixed Hinglish to English. See the WMT 2022 Code-mixed Machine Translation Task
Create a system to automatically correct (post-edit) machine translations. See the WMT 2022 Automatic Post-Editing Shared Task

3. Information retrieval and extraction

Given a query, retrieve the most relevant passages from regulatory documents: https://www.codabench.org/competitions/3527/
Extract important entities from scientific articles with the SCIRex dataset

4. Summarization

Automatically summarize movies based on their subtitles from this dataset developed by former students in the class.

5. Analysis and annotation of datasets

Improve part-of-speech tagging and other linguistic annotation for spontaneous speech in the Archive of Pittsburgh Language and Speech (APLS) with collaborator Prof. Dan Villarreal in the Linguistics Department. An evaluation dataset for parts of speech has already been manually annotated and so this project is ready to go to evaluate different systems! This work would help linguistics researchers study specific linguistic phenomena by speakers here in Pittsburgh.
Visualize similarities in US state legislature bill texts and predict bill passage using data from LegiScan (example repo here).
Develop an annotation guide and start annotating a new dataset of online gaming voice chat for hate speech, abusive, and offensive language.
Hate speech is culturally specific, yet the majority of NLP work focuses on English in North American and European contexts. A quantitative analysis of different features of datasets annotated for hate speech in multiple languages and from multiple cultural contexts would illuminate global similarities and culturally specific contexts.
Quantitative analysis of hateful, white supremacist narratives usually centers on contemporary online discourse. Yet many white supremacist language and narratives has its roots before online discourse. Compare narratives, topics and themes presented in historic and contemporary white supremacist discourse with data provided by the instructor.
Explore similarities and differences between language in podcasts and Reddit communities based on those podcasts using a dataset assembled by former Pitt students.
Computational analysis of Palestinian Nakba narratives. See workshop and datasets.
Examine the framing of different entities in police Facebook posts from the Plain View Project.
Analyze how different newspapers cover topics differently in English-language editorials from Sri Lankan newspapers. Data is provided by the instructor and a collaborator at Carnegie Mellon University.

Project group match day

In class 02-11.
Students will form groups of ~5 people around the following list of potential projects Note that this list of project ideas is much greater than the final number of groups will be, so not all project ideas will have groups. Come to match day with multiple project options!

Project idea list

Automatically summarize movies based on their subtitles from this dataset developed by former students in the graduate NLP class here at Pitt. This project is the same as project 4.1 in the example projects list above.
Given a review of a restaurant, determine what type of restaurant it is from this Yelp dataset. This project is the same as project 1.2 in the example projects list above.
Predict whether text was written by humans or generated by AI. Tasks include predicting for data across languages and for academic essays. See GenAI Content Detection Workshop, Task 1 or 2 and dataset. This project is the same as project 1.7 in the example projects list above.
Train translation models for literary text and evaluate on a dataset of Korean-English webnovels developed by a former student in this class. This project is the same as project 2.1 in the example projects list above.
Translate customer service chats in between languages. See the WMT 2024 Chat Shared Task. This project is the same as project 2.2 in the example projects list above.
Create a system to automatically correct (post-edit) machine translations. See the WMT 2022 Automatic Post-Editing Shared Task. This project is the same as project 2.4 in the example projects list above.
Classify tweets as sexist or not, or predict the “intent” of sexist tweets as direct, reported or judgemental. See EXIST 2024 Task 1 or Task 2. This project is the same as project 1.8 in the example projects list above.
Develop an annotation guide and start annotating a new dataset of online gaming voice chat for hate speech, abusive, and offensive language. This project is the same as project 5.3 in the example projects list above.
Improve part-of-speech tagging and other linguistic annotation for spontaneous speech in the Archive of Pittsburgh Language and Speech (APLS) with collaborator Prof. Dan Villarreal in the Linguistics Department, with data already available to evaluate systems. This work would help linguistics researchers study specific linguistic phenomena by speakers here in Pittsburgh. This project is the same as project 5.1 in the example projects list above.
Predict stock price change or other market behavior from public datasets of financial news (like this one) or Reddit forums.
Track language change in TV show characters over time, such as with Mitchell Pritchett and Cameron Tucker, the two gay characters in Modern Family, with quantitative approaches to supplement an existing qualitative linguistic analysis.
Pre-train and post-train LLMs from scratch, potentially comparing performance on specific tasks or datasets of interest.
Summarize Electronic Health Reports (EHRs).
Extract and categorize sentiment from students learning quantum information science (as part of an existing research project).
Classify where formal or informal you is used across Romance languages with a T-V distinction (tu-vous, tú-usted, etc) in order to surface context clues for appropriate choice of formality.
Build a speech-to-speech live translator using data such as CVSS, or build models to identify what language is being spoken from speech data.
Predict genre or artist from song lyrics, or surface emotional cues and sentiment in song lyrics across genres.
Predict whether posts came from Moltbook or Reddit.
Evaluate “distilling” intelligence of large, high-performance models into smaller models and quantify the reduction in computational cost.

Project proposal

Due 02-27.
Please submit one per group on Canvas. There is no required length or format for this report. This proposal will be a report with answers to a series of questions. It will include a peer review where you will rate your own performance and the performance of other group members through a form.

Task: What is the problem or task you are focusing on?
Input and output: What is the format of the input and output of this task? For example, each input could be a sentence of text and the output could be a label from a discrete set of possible labels. Provide at least one example of input and output from your data (ideally actual input and output, but it’s fine if they are made up).
Data: What data are you using?
1. How many rows (datapoints) are in the dataset and what does each datapoint correspond to?
2. How many columns are in the dataset and what does each correspond to?
3. Provide a very small subset of the data in a table.
4. Please explain where the dataset came from and how it was constructed, if known.
5. Provide links to any URLs if the data is hosted online or links to papers if the dataset is published somewhere.
6. If the data has annotated labels or “gold” text that you are predicting or generating, where do those labels come from?
Methods: What approach are you taking to building a NLP system to handle this task? What models and if appropriate, what methods of extracting features from text will be used? What software packages are you planning to use to build this system? Except in some cases, the approach should draw on statistical approaches we’ve covered in class so far, such as n-gram representations of text. Talk to the instructor if you are not sure about this.
Evaluation: How are you evaluating your approach? What performance metrics are you going to use?
Ethics: What kinds of ethical issues may be raised by your model or data?
Steps: What are the proposed steps needed for completion of (your proposed part) of the project? This should be in some detail, for example, loading and potentially cleaning the data, training models, trying different parameters, evaluating models, etc.
Roles: What are roles and tasks of each person in the group? Though group members will contribute in various capacities, it is best if each person is responsible for at least one aspect of the project.

Project proposal presentation

In class 03-04.
Groups will make a brief presentation to the class outlining their proposed project, with Q&A and opportunities for feedback from other students. Please plan for maximum 5-minute presentations not including Q&A, which will be held right afterward for each group. A shared PowerPoint presentation will be provided for you to add your slides to. Presentations are not graded. Cover at least these key points:

Project motivation
What data you are planning to use
What approach/methods you plan to take
How you will evaluate your approach

Progress report

Due 03-26.

The progress report will contain a substantive update on your group’s progress using traditional (usually n-gram based) approaches on your task, as well as a description of how you will use LLMs for your task. Please provide a specification of your problem/task and input and output. You do not have to repeat information from the project proposal except for that basic description of the project. Here are the details:

Part 1: Basic data analysis

In this part, please provide the following information about your dataset. It’s fine to be working with multiple datasets; just complete this for each one or for a final dataset you will be using if you are combining datasets.

If it has been updated from the proposal, provide the number of rows (datapoints) and columns in the dataset and what each datapoint and column corresponds to. If you are splitting the dataset into a training, test, and possible dev sets, how many rows are in each?
If applicable, the distribution of the target labels you are predicting. So for a binary sentiment classification task, how many rows in each set (except the test set) are marked negative or positive sentiment? This can be in a table or graph format.
Optionally, any other distribution or data visualization that you think is helpful for understanding your dataset or task.

Part 2: A result from baseline (traditional) approach

In your proposal, you described an initial baseline approach to your task, which for most groups was using n-gram features in some way. Please provide one (hopefully quantitative) result from your work so far in this direction. Ideally this would be a performance metric result from your baseline approach on a dev or test set. But if you’re not that far yet, you can also provide an example of working input and output from your system or part of a system, some sort of plot or other output. You can be up front about challenges you are facing for which you might need help; to get a good grade, I’ll just be looking for some sort of output from a working system or part of a system. If you are confused what this means for your project, contact the instructor.

Part 3: LLM proposal

In the project, you will be comparing your baseline system’s performance to that of an LLM. Please describe how you might use an LLM programmatically to attempt your task. The simplest way to do this would be in a “zero-shot” setting where you simply ask the LLM to do the task, but even that requires setting up and passing your data to the LLM and evaluating it. Please describe what you plan to do and which LLM you plan on using. You can also propose using more advanced approaches such as in-context learning (few-shot prompting), chain-of-thought prompting, prompt optimization or fine-tuning. Not all groups have to use an LLM here if you have already talked to the instructor; in that case, please describe the rest of the approach you will be taking to complete the project.

Part 4: Open questions and challenges

Please describe any open questions or challenges your group has at this point. Will you need any resources other than the ones provided in class (OpenAI API access, CRCD access) or have any other questions? Also describe if the roles for each of your team members have changed since the proposal and if so, what the new roles are.

Deliverable

Assemble your results and writing for each part in a document to submit as a PDF on Canvas. There is no required format for this document other than being in PDF format.

Final report

Due 04-28.
At the end of the course, groups will provide a written report of their project. This project includes a quantitative comparison between at least two NLP systems on a clearly specified task or tasks. One of these is generally a more traditional NLP approach and the other involves LLMs, though your group’s project may vary if you have discussed this with the instructor.

This report will be in the ACL format found here. Feel free to use the Word or LaTeX templates (which is also available as an Overleaf template). The report should be a maximum of 8 pages, not including limitations, ethics, group member task breakdown, references sections or appendices. Feel free to include content from the project proposal and progress report. There is flexibility in section names, but please provide information about the following aspects of the project:

Abstract: a brief overview of your entire project, including what approaches you took on which datasets and any findings.
Introduction: should include motivation for the project and more detail on the approaches you take and your final results.
Data: should include the final number of datapoints and columns in the data (can be copied from the proposal)
Methods. Please clearly specify which techniques are novel/your own versus methods directly or indirectly from prior work (which is also fine). For LLM approaches, provide what exact prompt template was used for LLM approaches (in an appendix if needed for space), how examples were selected for few-shot prompting if used, as well as exact LLM model names.
Results: Include examples of input and output (predicted output from your system as well as the correct “gold” output, if applicable).
Discussion: Please discuss the significance of the results that you see and any other comments about what these results indicate to you. Provide an analysis of common errors from different systems.
Future work. This is a good place to describe things you thought about but never had time to complete!
Limitations (doesn’t count toward page limit)
Ethical issues (doesn’t count toward page limit)
Group member task breakdown (doesn’t count toward page limit). This section details the high-level tasks that each group member completed.
References (doesn’t count toward page limit). If you are able to, please fill in full references instead of just URLs. The references can use any format.
Appendices (optional, doesn’t count toward page limit). Additional figures or explanation in one or more appendices is allowed, but they will not necessarily be considered in grading.

A rubric used in grading will be provided.

Final presentation

In class TBD.
Groups will present their finished work to the group, with Q&A and feedback opportunities from students. Please prepare a maximum 7-minute presentation. A shared PowerPoint presentation will be provided to add your group’s slides to. Cover at least these key points:

Project motivation (briefly)
Task description, including example input and output
Data
Methods, including your baseline system and your contemporary LLM-based approach (or whatever approaches you took)
Results or findings from your baseline system and your contemporary LLM-based approach