CS 2731 Introduction to Natural Language Processing

School of Computing and Information, University of Pittsburgh
Fall 2024

Peter Miller, Communication, 1940s
Time MW 2:30-3:45pm
Location Sennott Square 5313
Instructor Michael Miller Yoder, PhD
Please call me “Michael”
Instructor contact mmyoder@pitt.edu or through Canvas messages
Instructor office hours By appointment in person at IS 604B or on Zoom
Book an appointment
TA Dhanush Binu
TA contact dhb51@pitt.edu
TA office hours By appointment
Textbook (free online) [J+M] Jurafsky and Martin, Speech and Language Processing, 3e draft, 2024-08-20

Schedule

Subject to change. Last revised 2024-09-20. All due dates are at 11:59pm ET except when indicated.

Session Date Topic Readings Assignments
Module 1: Introduction and text processing
1 08-26 M Course, NLP intro
2 08-28 W Text normalization J+M 2-2.3, 2.5-2.7
09-02 M Labor Day. No class.
Module 2: Text classification and representation learning
3 09-04 W Bag-of-words, tf-idf, PPMI J+M 6.3-6.7 Reading quiz due;
HW1 out 09-05
4 09-09 M Logistic regression part 1 J+M 5-5.2 Reading quiz due
5 09-11 W Logistic regression part 2 J+M 5.3-5.9, 5.11 Reading quiz due;
Project idea submission form out
6 09-16 M Classifier evaluation, CRC intro J+M 4 (intro), 4.7-4.10,
Bender & Friedman 2018
Mitchell et al. 2019
Reading quiz due
7 09-18 W Vector semantics, word2vec J+M 6-6.2, 6.8-6.13,
Blodgett et al. 2020
HW1 due 09-19;
HW2 out 09-19;
Project idea submission form due 09-20
8 09-23 M Feedforward neural networks J+M 7-7.1, 7.3-7.5, 7.8 Discussion post due 1pm;
Project idea ranking form out
Module 3: Language models and conditional language models
9 09-25 W N-gram language models part 1 J+M 3-3.3 Reading quiz due;
Project idea ranking form due 09-26
10 09-30 M N-gram language models part 2, RNNs part 1 J+M 3.4-3.6, 3.9, 8-8.2 Reading quiz due
11 10-02 W RNNs part 2, encoder-decoder J+M 8.3, 8.6-8.9 Reading quiz due;
HW2 due 10-03
12 10-07 M Transformers J+M 9, 10-10.1 Reading quiz due;
HW3 out 10-08
13 10-09 W LLMs, BERT and GPT J+M 10.2-10.3, 10.5.3-10.7, 11-11.4 Reading quiz due
10-14 M Fall Break. No class.
14 10-16 W Project proposal presentations Project proposal and literature review due 10-18
15 10-21 M Probabilistic Commonsense Knowledge Evaluation (guest lecture, Lorraine Li) Optional:
Cheng et al. 2024
Zhao et al. 2024
16 10-23 W LLM discussion and lab day J+M 12
Yiu et al. 2023
Discussion post due 1pm;
Bring a laptop to class;
HW4 out 10-24
Module 4: Sequence labeling and parsing
17 10-28 M POS tagging, NER, HMMs part 1 J+M 17-17.4.4 Reading quiz due;
HW3 due
18 10-30 W HMMs part 2, Viterbi alg, neural sequence labeling J+M 17.4.5-17.4.6, 8.3.1, 11.5 Reading quiz due
19 11-04 M Dependency parsing J+M 19-19.2, 19.4-19.5 Reading quiz due
Module 5: Application areas
20 11-06 W Machine translation part 1 J+M 13-13.3,
Bender 2019
Discussion post due 1pm;
HW4 due 11-07
21 11-11 M Machine translation part 2 J+M 13.4-13.8
22 11-13 W Speech technologies, ASR, TTS J+M 16-16.3, 16.5-16.8 Project progress report due 11-14;
Project peer review due 11-14
23 11-18 M Dialogue systems J+M 15-15.3
24 11-20 W Chatbots J+M 15.4-15.6
Thanksgiving Break 11-24 to 12-01
25 12-02 M Information retrieval, RAG J+M 14-14.3.1, 14.5
26 12-04 W Project work time
27 12-11 W Final project presentations Final projects due 12-12

Assessments

Description Points Percentage of final grade
Homework assignments total 224 44.8
 Each homework of 4 total 56 11.2
Final project total 203 40.6
Project idea submission response 5 1.0
 Project idea ranking response 5 1.0
 Proposal and literature review 40 8.0
Peer review 2 0.4
 Progress report 30 6.0
 Final report 121 24.2
Reading quizzes total 33 6.6
 Each reading quiz of 13 total, 2 lowest scores dropped 3 0.6
Discussion posts total 15 3.0
 Each discussion post of 3 total required 5 1.0
Participation total 25 5.0
 Attendance 15 3.0
 Engagement 10 2.0
Grand total 500 100

Participation grade

In-class, collaborative activities are better learning experiences when students come to class and participate. To encourage participation, there is a participation grade (5% of the total course grade). The majority of that grade comes from attendance, which will be taken via TopHat on randomly selected class sessions. The rest of the grade will be assigned based on whether a student asked questions in class or otherwise (such as during office hours),or partipated in in-class activites. If you did any of this basic engagement, full credit will be awarded.

Course description

Computer programs that automatically process human language, such as chatbots, translation systems, and speech recognition systems, have become a part of everyday life. This course provides an introduction to the artificial intelligence research field that brought about these systems: natural language processing (NLP). Students will become familiar with foundational tasks in NLP such as language modeling, text classification, and sequence modeling. The course will cover both classic and contemporary approaches to these tasks, as well as how they are applied in language technologies. Topics of ethics, fairness, and bias in AI are incorporated throughout the course.

Learning objectives

The overarching learning objective of this course is for students to be able to structure an NLP system to get a desired outcome from language data that may be required in a future job or research problem. This ability requires the development of many constituent skills. At the end of the course, students will be able to:

  • Relate a new problem to the most relevant existing NLP tasks, such as text classification, text generation, sequence modeling, language modeling, information retrieval, machine translation, dialogue systems, etc.
  • Choose relevant baseline machine learning approaches to try on a new task
  • Explain the basics of language structure that are relevant to NLP. These include syntax and semantics from linguistics
  • Preprocess text data into a machine-readable format
  • Define and scope an objective in terms of a machine learning or NLP system. This includes determining if human annotation is needed and if machine learning is needed.
  • Extract features from text that are required for running machine learning models
  • Choose suitable ML algorithms for a new NLP task
  • Evaluate machine learning algorithms, choices of training data and other NLP system decisions
  • Identify potential ethical pitfalls (such as imbalanced training data, model amplification of biases) in an NLP system and ways to address them
  • Communicate motivation, key components, and implications of an approach to NLP tasks in writing

Prerequisites

  • CS 1501: Algorithms, or the consent of the instructor
  • Basic Python knowledge

Learning resources

Textbook: Dan Jurafsky and James H. Martin, Speech and Language Processing, 3rd edition draft, 2024-02-03. Available completely free online: https://web.stanford.edu/~jurafsky/slp3/

Software and programming languages: Python and associated data science libraries (pandas, numpy, scipy) are the preferred software for completing coding portions of homework assignments. Basic knowledge of Python is a prerequisite of the course, as some of the homework assignments require Python. Students wishing to use non-Python tools for homeworks should ask the instructor first. Final projects may be completed with any programming language or tools.

Tutorials on Python and data science:

Course infrastructure and communication

The most recent syllabus, including a schedule, is posted here on the course website. This syllabus will contain links to homework and final project descriptions. Homeworks and the final project should be submitted through Canvas. Quizzes and discussion boards (including prompts) will be on Canvas. Course announcements will be given on Canvas, and questions should be submitted through Canvas (or over email to the instructor or TA).

Feel free to email or send a Canvas message to the instructor or TA about any concerns or questions at any time. Teaching staff will respond during hours that work best for them; please feel no obligation to respond to them outside of your regular working hours.

Policies

Grading scale

Range Letter grade
93.0 – 100% A
90.0 – <93.0% A-
86.7 – <90.0% B+
83.3 – <86.7% B
80.0 – <83.3% B-
76.7 – <80.0% C+
73.3 – <76.7% C
70.0 – <73.3% C-
66.7 – <70.0% D+
63.3 – <66.7% D
60.0 – <63.3% D-
< 60% F

The instructor reserves the right to change the grading scale depending on class performance, but only in the direction of raising grades for students. Feel free to stop by the instructor’s office hours or make an additional appointment anytime to talk about any issues you might have with your grade.

Late work policy

Students are granted 5 total late days across all homework assignments without penalty. After those five late days, you will be penalized 20% for each day that your submission is late except in extreme unforeseen circumstances. Group project work will be penalized 20% for each day late. No late work will be accepted for the final project report. Late days cannot be used for reading quizzes, as no late work is accepted for reading quizzes.

Assignment resubmission policy

If you are unsatisfied with your grade on an assignment and wish to resubmit work, talk with the instructor. Resubmissions are handled case by case, but are generally accepted in cases where parts of the assignment are missing (sections of the rubric are 0). Updated or added text in resubmitted reports must be highlighted in yellow. Resubmissions are subject to an automatic 10% deduction. Only 1 resubmission per homework assignment will be accepted.

Academic integrity policy

Students in this course will be expected to comply with the University of Pittsburgh’s Policy on Academic Integrity. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Academic Integrity Modules.

Generative AI policy

You are allowed to use generative AI programs (ChatGPT, DALL-E, etc.) as a student in this course in limited circumstances. Since much of this course is about developing such tools in NLP, using currently available tools can expose you to the current capabilities and limitations of such systems.

However, your ethical responsibilities as a student remain the same. You must follow the University of Pittsburgh’s Policy on Academic Integrity. Here are some principles to keep in mind that can help you determine whether or not a specific use of generative AI is acceptable in this course (for all forms of generation: writing, code, images or other forms). Please ask the instructor if you are not sure about a specific use. You will not be blamed or retaliated against for asking.

  • Use as an aid, not for a finished product. LLMs could be used in this course to generate ideas, draft bibliographies, study guides, etc. Use for drafting entire homework or project reports is not acceptable, even if students revise this draft, since being able to communicate NLP procedures and research is a learning objective. Also keep in mind that language models have no notion of reality and will hallucinate facts and citations.

  • Cite its use. The University of Pittsburgh’s academic integrity policy applies to all uncited or improperly cited use of content, whether that work is created by human beings alone or in collaboration with a generative AI. If you use a generative AI tool to develop content for an assignment, you are required to cite the tool’s contribution to your work. In practice, cutting and pasting content from any source without citation is plagiarism. Likewise, paraphrasing content from a generative AI without citation is plagiarism. Similarly, using any generative AI tool without appropriate acknowledgement will be treated as plagiarism. See the APA guidelines on how to cite ChatGPT. Publicly available LLMs are very new, and so best practices in education are still being worked out. Citing your use of LLMs will also inform the instructor on how such tools are being used in education for developing better future policies.

  • You are responsible for the work you turn in. As we will discuss in this course, LLMs and other generative AI systems can and do generate biased, socially problematic language and assert unfounded claims. Ultimately the text you submit will be treated as reflecting your own work, and you are responsible for it.

Adapted from faculty in the Carnegie Mellon University Heinz College of Information Systems and Public Policy, with guidance from the Carnegie Mellon University Eberly Center for Teaching Excellence.

Disability rights

The teaching staff of this course view disabilities as deficits not in disabled people but in the institutions and societies that are structured to disadvantage disabled people. If you have a disability (visible or invisible), please let us know as soon as possible (you don’t need to tell us the nature of the disability). You are encouraged to work with Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648-7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will work with you to determine reasonable accommodations for this course. This might include lecture materials that are usable by people with visual disabilities, sign language interpretation, captioning, flexible due dates, etc.

Adapted from policies by David Mortensen and Lori Levin at Carnegie Mellon University.

Religious Observances

The observance of religious holidays (activities observed by a religious group of which a student is a member) and cultural practices are an important reflection of diversity. As your instructor, I am committed to providing equivalent educational opportunities to students of all belief systems. At the beginning of the semester, you should review the course requirements to identify foreseeable conflicts with assignments, exams, or other required attendance. Please contact me as early as possible to allow time for us to discuss and make fair and reasonable adjustments to the schedule and/or tasks.

Statement on scholarly discourse

In this course we will be discussing some complex issues on which all of us have strong feelings and, in many cases, unfounded attitudes. It is essential that we approach this endeavor with our minds open to evidence that may conflict with our presuppositions. Moreover, it is vital that we treat each other’s opinions and comments with courtesy even when they diverge and conflict with our own. We must avoid personal attacks and the use of ad hominem arguments to invalidate each other’s positions. Instead, we must develop a culture of civil argumentation, wherein all positions have the right to be defended and argued against in intellectually reasoned ways. It is this standard that everyone must accept in order to stay in this class; a standard that applies to all inquiry in the university, but whose observance is especially important in a course whose subject matter is so emotionally charged.

Adapted from a California State University course: Race, Racism and Critical Thinking.

Student wellness

College/Graduate school can be an exciting and challenging time for students. Taking time to maintain your well-being and seek appropriate support can help you achieve your goals and lead a fulfilling life. It can be helpful to remember that we all benefit from assistance and guidance at times, and there are many resources available to support your well-being while you are at Pitt. You are encouraged to visit Thrive@Pitt to learn more about well-being and the many campus resources available to help you thrive.

If you or anyone you know experiences overwhelming academic stress, persistent difficult feelings and/or challenging life events, you are strongly encouraged to seek support. In addition to reaching out to friends and loved ones, consider connecting with a faculty member you trust for assistance connecting to helpful resources.

The University Counseling Center is also here for you. You can call 412-648-7930 at any time to connect with a clinician. If you or someone you know is feeling suicidal, please call the University Counseling Center at any time at 412-648-7930. You can also contact Resolve Crisis Network at 888-796-8226.

Equity and inclusion

The University of Pittsburgh does not tolerate any form of discrimination, harassment, or retaliation based on disability, race, color, religion, national origin, ancestry, genetic information, marital status, familial status, sex, age, sexual orientation, veteran status or gender identity or other factors as stated in the University’s Title IX policy. The University is committed to taking prompt action to end a hostile environment that interferes with the University’s mission. For more information about policies, procedures, and practices, visit the Civil Rights & Title IX Compliance web page.

I ask that everyone in the class strive to help ensure that other members of this class can learn in a supportive and respectful environment. If there are instances of the aforementioned issues, please contact the Title IX Coordinator, by calling 412-648-7860, or emailing titleixcoordinator@pitt.edu. Reports can also be filed online. You may also choose to report this to a faculty/staff member; they are required to communicate this to the University’s Office of Diversity and Inclusion. If you wish to maintain complete confidentiality, you may also contact the University Counseling Center (412-648-7930).