Project

Last revised 2025-11-04.

A major component of this course is a hands-on final project guided by students’ own interests. In this project, students will demonstrate an ability to summarize current approaches and challenges in a subfield of NLP and implement some sort of contribution (however small) to this NLP area of research or practice.

Projects will be done in groups of 2-4 students. Groups will be formed during an in-class project match day, largely based on interest in the same project ideas.

Project idea form

Due 09-11.

Fill out project ideas you might be interested in working on in this form. You can fill out ideas from the example projects listed below or one of your own ideas. For your own ideas, consider what what research you’re interested in, what system you’d like to build that processes language in some form, interesting text datasets you’d like to work on, really anything! It is best if your idea has a dataset in mind, but this is not required.

You can fill out as many ideas as you’d like with this form. Ideas do not have to be fully sketched out. Submitting an idea does not mean you will necessarily work on it. These ideas will be presented to all students anonymously. Each student must submit at least one idea for credit on this assignment, even if it’s just chosen from the example projects.

Example projects

Some of these projects are drawn from “shared tasks” where NLP researchers compete for the best performance on certain datasets. Others are based on ideas and projects from prior students and from the instructor.

1. Text classification

Classify adversarial prompts for LLMs based on attack type, using publicly available red-teaming datasets.
Classify discourse relations, the relationship between pieces of text within longer narratives (such as the reason or supporting evidence behind a claim). See DISRPT 2025 shared task 3, data repo here in collaboration with Prof. Janet Liu in the Linguistics Department.
Given a review of a restaurant, determine what type of restaurant it is from this Yelp dataset
Given a short essay in response to a troubling news article, predict the level of empathy. See WASSA 2024 shared task Track 3.
Predict emotion labels from tweets across many languages. See WASSA 2024 shared task
Given a news article and a list of “entities” (people, organizations, etc), predict roles such as protagonist, antagonist, and innocent. See SemEval 2025 Task 10, Subtask 1 on entity framing
Predict news genre or media “frames” such as morality, economic, or crime and punishment from news articles in multiple languages. See SemEval 2023 Task 3, Subtasks 1 or 2
Predict whether text was written by humans or generated by AI. Tasks include predicting for data across languages and for academic essays. See GenAI Content Detection Workshop, Task 1 or 2
Classify tweets as sexist or not, or predict the “intent” of sexist tweets as direct, reported or judgemental. See EXIST 2024 Task 1 or Task 2
Predict if similar words are redundant or not with the Semantic Pleonasm corpus developed right here at Pitt.
From a set of descriptions of characters, develop a classifier to predict which ones will generate the most fanfiction. This could be a lens into online community and media norms.
Predict “speech acts”, intentions behind utterances, based on emojis with a dataset assembled by former students in the class.

2. Machine translation

Train translation models for literary text and evaluate on a dataset of Korean-English webnovels developed by a former student in this class.
Translate customer service chats in between languages. See the WMT 2024 Chat Shared Task
Translate code-mixed Hinglish to English. See the WMT 2022 Code-mixed Machine Translation Task
Create a system to automatically correct (post-edit) machine translations. See the WMT 2022 Automatic Post-Editing Shared Task

3. Information retrieval and extraction

Given a query, retrieve the most relevant passages from regulatory documents: https://www.codabench.org/competitions/3527/
Extract important entities from scientific articles with the SCIRex dataset

4. Analysis and annotation of datasets

Improve part-of-speech tagging and other linguistic annotation for spontaneous speech in the Archive of Pittsburgh Language and Speech (APLS) with collaborator Prof. Dan Villarreal in the Linguistics Department. An evaluation dataset for parts of speech has already been manually annotated and so this project is ready to go to evaluate different systems! This work would help linguistics researchers study specific linguistic phenomena by speakers here in Pittsburgh.
Visualize similarities in US state legislature bill texts and predict bill passage using data from LegiScan (example repo here).
Develop an annotation guide and start annotating a new dataset of online gaming voice chat for hate speech, abusive, and offensive language.
Hate speech is culturally specific, yet the majority of NLP work focuses on English in North American and European contexts. A quantitative analysis of different features of datasets annotated for hate speech in multiple languages and from multiple cultural contexts would illuminate global similarities and culturally specific contexts.
Fanfiction, online writing by fans of media works, is known for celebrating queer identity but still may center the experiences of white authors and characters. Use FanfictionNLP to compare representations of characters of color to white characters in fanfiction at scale.
Quantitative analysis of hateful, white supremacist narratives usually centers on contemporary online discourse. Yet many white supremacist language and narratives has its roots before online discourse. Compare narratives, topics and themes presented in historic and contemporary white supremacist discourse with data provided by the instructor.
Explore similarities and differences between language in podcasts and Reddit communities based on those podcasts using a dataset assembled by former students in the class.
Computational analysis of Palestinian Nakba narratives. See workshop and datasets.
Examine the framing of different entities in police Facebook posts from the Plain View Project.
Analyze how different newspapers cover topics differently in English-language editorials from Sri Lankan newspapers. Data is provided by the instructor and a collaborator at Carnegie Mellon University.

5. Survey papers

Survey how NLP is used and applied in other fields before and after LLMs. What has been our most useful contributions to scholars in the social sciences, physical sciences, or humanities? This survey would assemble papers across disciplines for mentions of NLP and summarize what is most useful, what is lacking, and what approaches from NLP could be helpful to others.
Computational social science using NLP generally relies on data from online communities. But this is missing non-online interactions and the practices of those who are not active online. Survey datasets and approaches that use quantitative and computational techniques on recordings of offline linguistic interaction.
A growing area of research in computational social science aims to capture the framing and portrayal of entities across large text corpora (such as in news media). Survey existing approaches and challenges.

6. Other

Evaluate LLMs for their factuality in summarization of class reflections using a dataset provided by the instructor and Prof. Diane Litman.
Evaluate the fairness of quality scores automatically assigned to sutdent reflections using a dataset provided by the instructor and Prof. Diane Litman.
New identity terms are commonly developed in online communities, some of them hateful. Develop methods to find in-group hate jargon and identity terms.
Build networks of characters and predict relations among characters in fiction using this dataset.
Stancetaking, a concept from sociolinguistics, is when speakers take an evaluative position toward the concept (which are often nuanced, e.g. “No, I actually don’t like Taylor Swift’s music that much, but she’s great as a person”). Develop automated methods for identifying the “stance object”, who the speaker is evaluating, likely from Reddit data.
Automatically summarize movies based on their subtitles from this dataset developed by former students in the class.

Project group match day

In class 09-17.
Students will form groups of 2-4 people around a list of potential projects submitted by the class in the project idea form. Note that this list of project ideas is much greater than the final number of groups will be, so not all project ideas will have groups.

Project idea list

Classify adversarial prompts for LLMs based on attack type, using publicly available red-teaming datasets. This project is the same as project 1.1 in the example list.
Given a review of a restaurant, determine what type of restaurant it is from this Yelp dataset. This project is the same as project 1.3 in the example list.
Improve part-of-speech tagging and other linguistic annotation for spontaneous speech in the Archive of Pittsburgh Language and Speech (APLS) with collaborator Prof. Dan Villarreal in the Linguistics Department. An evaluation dataset for parts of speech has already been manually annotated and so this project is ready to go to evaluate different systems! This work would help linguistics researchers study specific linguistic phenomena by speakers here in Pittsburgh. This project is the same as project 4.1 in the example list.
Classify discourse relations, the relationship between pieces of text within longer narratives (such as the reason or supporting evidence behind a claim). See DISRPT 2025 shared task 3, data repo here in collaboration with Prof. Janet Liu in the Linguistics Department. This project is the same as project 1.2 in the example list.
Given a short essay in response to a troubling news article, predict the level of empathy. See WASSA 2024 shared task Track 3. This project is the same as project 1.4 in the example list.
Predict whether text was written by humans or generated by AI. Tasks include predicting for data across languages and for academic essays. See GenAI Content Detection Workshop, Task 1 or 2 and dataset. This project is the same as project 1.8 in the example projects list above.
Predict emotion labels from tweets across many languages. See WASSA 2024 shared task. This project is the same as project 1.5 in the example projects list above.
Predict “speech acts”, intentions behind utterances, based on emojis with a dataset assembled by former students in the class. This project is the same as project 1.12 in the example projects list above.
Build networks of characters and predict relations among characters in fiction using this dataset. This project is the same as project 6.4 in the example projects list above.
Link extracted clinical entity with standardized clinical coding with the SNOMED CT challenge.
Extract and organize crowd-sourced media tropes from TV Tropes and analyze their sources or identify them in media texts, other fandom wikis, or fanfiction.
Evaluate LLMs for their factuality in summarization of class reflections, or the fairness of summarization scores, using a dataset provided by the instructor and Prof. Diane Litman. This project is the same as project 6.4 in the example projects list above.
Extracting claims from social media posts for the intent of claim fact-checking.
Build a Pokémon chatbot or classifier for Pokémon based on text descriptions from Bulbapedia and entries from Pokédex.
Build a speech-to-speech translation system between Indian languages with data from Project Vaani.
Train an LLM to play a Text Adventure Game (https://writtenrealms.com/). Reinforcement learning or an LLM agent with a few examples (few-shot prompting) are some example approaches to play the game.

Project peer group feedback

In class 10-15.
In class before the proposal is due, you will be matched with another group who will review your proposal and provide guided feedback in class.

Project proposal

Due 10-17.
Please submit one per group on Canvas. There is no required length or format for this report, but it is recommended to use the ACL format that the final report will be formatted in. This proposal will contain answers to a series of questions. It will include a peer review where you will rate your own performance and the performance of other group members through the form here.

Task: What is the problem or task you are focusing on?
Input and output: What is the format of the input and output of this task? For example, each input could be a sentence of text and the output could be a label from a discrete set of possible labels. Provide at least one example of input and output from your data (ideally actual input and output, but it’s fine if they are made up).
Literature review: How does your contribution build on or extend prior work? How have others approached your task or similar tasks? What are other NLP papers that use the same dataset or domain as your project? This literature review will be of at least 3 papers relevant to your project area. It will group and summarize relevant papers into types of tasks, datasets, and/or approaches. Good places to look for NLP papers include the ACL Anthology, Semantic Scholar, and Google Scholar.
Data: What data are you using? How many datapoints does the dataset contain and what is the composition of each datapoint? Please explain where these datasets are from and how they were constructed. Has any other research been published using these datasets? Provide links to any URLs if the data is hosted online or cite papers if the dataset is published somewhere. Does the dataset have annotated labels or “gold” text that you are predicting or generating provided? If so, where do those labels come from?
Methods: What approaches are you planning on taking to address the task? What models and if appropriate, what methods of extracting features from text will be used? Prompting strategies for decoder-only LLMs are good, but there should be some comparison with another strategy, such as fine-tuning an encoder-only LLM such as BERT or comparing with classical statistical and n-gram approaches. Talk to the instructor if you have more questions about this, as there is flexibility depending on your specific project.
Evaluation: How are you evaluating your approach? What performance metrics are you going to use?
Ethics: What kinds of ethical issues may be raised by your model or data?
Steps: What are the proposed steps needed for completion of (your proposed part) of the project? This should be in some detail, for example, loading and potentially cleaning the data, training models, trying different parameters, evaluating models, etc.
Roles: What are roles and tasks of each person in the group? Though group members will contribute in various capacities, it is best if each person is responsible for at least one aspect of the project.

Project proposal presentation

In class 10-20.
Groups will make a brief presentation to the class outlining their proposed project, with Q&A and opportunities for feedback from other students. Please plan for maximum 7-minute presentations not including Q&A, which will be held right afterward for each group. Slides will be added to this shared PowerPoint presentation. Presentations are not graded. Cover at least these key points:

Project motivation
Briefly, what 1-2 other related papers have done (1 slide max)
What data you are planning to use
What approach/methods you plan to take
How you will evaluate your approach

Progress report

Due 11-13.

A brief progress report of a basic working system. This report should be in the ACL format that the final report will be in.

Part 1: Task and dataset

In this part, please provide the following information about your dataset. It’s fine to be working with multiple datasets; just complete this for each one or for a final dataset you will be using if you are combining datasets.

What is the problem/task you are working on? What is the input and output? Just a brief refresher is fine.
How will the dataset you have (or that you will build) be used to implement a system to address this task?
The number of rows (datapoints) in the dataset and what each datapoint corresponds to. If you are splitting the dataset into a training, test, and possible dev sets, how many rows are in each?
The number of columns in the dataset you will be using and what each corresponds to.
If applicable, the distribution of the target labels you are predicting. So for a binary sentiment classification task, how many rows in each set (except the test set) are marked negative or positive sentiment? This can be in a table or graph format.
Please explain any decisions made in selecting or preprocessing data.
Optionally, any other distribution or data visualization that you think is helpful for understanding your dataset or task. An example table with a small part of the data can be helpful.

Part 2: Some kind of result

Please provide one (hopefully quantitative) result from your work so far. A good example would be a performance metric result from your baseline approach on a dev or test set, but it could also be some sort of other finding you have so far. But if you’re not that far yet, you can also provide an example of working input and output from your system or part of a system, some sort of plot or other output. You can be up front about challenges you are facing for which you might need help. To get a good grade, I’ll just be looking for some sort of output from a working system or part of a system. If you are confused what this means for your project, contact the instructor.

Part 3: Open questions and challenges

Please describe any open questions or challenges your group has at this point. Will you need any resources other than the ones provided in class (LLM APIs, CRCD access) or have any other questions? Also describe if the roles for each of your team members have changed since the proposal and if so, what the new roles are.

Final presentation

In class 12-08.
Groups will present their finished work to the group, with Q&A and feedback opportunities from students. Please prepare a maximum 7-minute presentation. Add your slides to this shared PowerPoint presentation. Cover at least these key points:

Project motivation (briefly)
Task description, including example input and output
Data
Methods
Results or findings

Final report

Due 12-09.
At the end of the course, groups will provide a written report of their project. This report will be in the ACL format found here (Overleaf template here). The report should be a maximum of 8 pages, not including limitations, ethics, group member task breakdown, references sections or appendices. Outstanding reports would be of a quality and structure that could be submitted to an NLP workshop or conference, but other types of projects can also achieve an A. It is fine to copy information from the project proposal if it hasn’t changed. There is flexibility in section names, but please provide information about the following aspects of the project:

Project motivation. Please provide examples of input and output in the task that you tackle.
Literature review. Please provide full citations in a references sections for works cited throughout the paper (not just URLs).
Data. Please provide the size of the dataset, explanation of what each datapoint (input passed to the system) is.
Methods. Please clearly specify which techniques are novel/your own versus methods directly or indirectly from prior work (which is also fine). If LLM prompting techniques were used, please provide exact LLM names and versions and full prompts or prompt templates (in an appendix if necessary for space). If few-shot prompting was used, specify how examples were selected.
Results. Along with quantitative results, an error analysis is helpful to include.
Discussion
Future work. This is a good place to describe things you thought about but never had time to complete!
Limitations (doesn’t count toward page limit)
Ethical issues (doesn’t count toward page limit)
Group member task breakdown (doesn’t count toward page limit). This section details the high-level tasks that each group member completed.
References (doesn’t count toward page limit)
Appendices (optional, doesn’t count toward page limit). Additional figures or explanation in one or more appendices is allowed, but they will not necessarily be considered in grading.
Sharing permission. Are you okay with the instructor sharing your report after the course ends? Sometimes project reports are helpful for students and faculty working on similar problems. It’s fine to say no or that the instructor needs to ask permission before sharing!

Here is the rubric that will be used in grading:

Rubric category	Points
Clear motivation for the work is provided	5
Research questions and/or task definition is clear. Format of the input and output is provided.	10
Sufficient grounding in relevant related literature	15
Applicable dataset/s are chosen and preprocessed	20
Methods are relevant and evaluation appropriate. For approach-based projects, multiple methods are compared. For dataset-based projects, selection and annotation methodology is explained.	45
Results are provided. For new approach contributions, results from multiple methods (at least one baseline) are presented. For dataset contributions, this may be a single set of results from a simple classifier, or other results if discussed with the instructor.	30
Discussion is provided of the results and implications	15
Potential future work is discussed	5
Limitations of your approach or dataset are sufficiently discussed	5
Ethical issues that may be raised by your system or dataset are sufficiently discussed	5
Group member task breakdown is provided	5
*Project content total*	*160*
Meets all formatting requirements. Is maximum 8 pages, not including limitations, ethics, references or group member task breakdown	12
Writing is clear	13
*Writing total*	25
Group member had a sufficient amount of workload in the project	25
Task and roles assigned to this group member were completed sufficiently	25
*Individual contribution total*	50
*Grand total*	*235*

How your project will be graded

To get an A, your group’s project should make progress toward an achievable, concrete contribution specified in your project proposal. The project does not necessarily need to be successful in the sense that it outperforms baselines or contributes to our knowledge of a phenomenon. Sometimes ideas don’t work, and that’s okay. But you need to provide evidence of progress toward that contribution. If you are building a dataset, for example, the dataset needs to be built in some form, even if it is as not as large or as useful as you may have hoped. If you are evaluating a new method for a task, you must have an implementation that tests that method against other baselines, even if it doesn’t perform as well as you would have hoped or you didn’t get to evaluate it against all the baselines you wanted to. If you are doing a survey, you must distill a sufficient number of papers into themes that comprehensively describe a research area, even if you don’t end up finding groundbreaking gaps in knowledge that must be addressed. Feel free to take on more risky ideas, but only if you know you’ll have something to show for it at the end. Teaching staff will guide you toward scoping projects that should fulfill this goal in the planning phase through the proposal.