CS 2731 / ISSP 2230 Introduction to Natural Language Processing
University of Pittsburgh, Spring 2024Time | MW 3:00-4:15pm |
Location | Sennott Square 6110 |
Instructor | Michael Miller Yoder, PhD Please call me "Michael" |
Instructor contact | mmy29@pitt.edu |
Instructor office hours | W 1-2pm and by appointment, Sennott Square 6505 |
TA | Bhiman Kumar Baghel |
TA office hours | M 9-11am online (see Zoom link on Canvas under 'Syllabus'), and by appointment |
Textbook (free online) | [J+M] Jurafsky and Martin, Speech and Language Processing, 3e draft, 2023-01-07 [J+M2024] Jurafsky and Martin, Speech and Language Processing, 3e draft, 2024-02-03 |
Schedule
Subject to change. Last revised 2024-03-28. All due dates are at 11:59pm ET except when indicated.
Session | Date | Topic | Readings | Assignments |
---|---|---|---|---|
Module 1: Introduction and text processing | ||||
1 | 01-08 M | Course, NLP intro | Project survey out | |
2 | 01-10 W | Text normalization | J+M 2-2.4, 2.6 | |
01-15 M | MLK Day. No class. | |||
Module 2: Text classification and representation learning | ||||
3 | 01-17 W | Bag-of-words, tf-idf, PPMI | J+M 6.3-6.7 | Reading quiz due 1pm; HW1 out; Project survey due 01-18 |
4 | 01-22 M | Naive Bayes | J+M 4-4.5 | Reading quiz due 1pm; Project teams matched |
5 | 01-24 W | Classifier evaluation | J+M 4.7-4.10, Bender & Friedman 2018 (data statements) Mitchell et al. 2019 (model cards) |
Discussion post due 1pm; Project pre-proposal form out |
6 | 01-29 M | Logistic regression part 1 | J+M 5-5.3 | Reading quiz due 1pm; HW2 out 01-30 |
7 | 01-31 W | Logistic regression part 2 | J+M 5.4-5.9, 5.11 | Reading quiz due 1pm; HW1 due 02-01 |
8 | 02-05 M | Vector semantics, static word embeddings | J+M 6-6.2, 6.8-6.13, Blodgett et al. 2020 |
Discussion post due 1pm; Project pre-proposal form due |
9 | 02-07 W | Feedforward neural networks | J+M 7-7.1, 7.3-7.4, 7.6, 7.8 | Reading quiz due 1pm |
Module 3: Language models and conditional language models | ||||
10 | 02-12 M | N-gram language models part 1 | J+M 3-3.2 | Reading quiz due 1pm |
11 | 02-14 W | N-gram language models part 2, RNNs part 1 | J+M 3.3-3.6, 3.9 | Reading quiz due 1pm; HW2 due 02-15 |
12 | 02-19 M | RNNs part 2, encoder-decoder | J+M 9-9.2, 9.6-9.9 | Reading quiz due 1pm; HW3 out 02-20 |
13 | 02-21 W | Transformers part 1, beam search | J+M 10-10.2, 10.4 | Reading quiz due 1pm; Project proposal and literature review due 02-22 |
14 | 02-26 M | Transformers part 2, pretraining, BERT and GPT | J+M 10.7, 11-11.3.2, Yiu et al. 2023 |
Discussion post due 1pm |
15 | 02-28 W | BERT/LLMs discussion and lab day | Bring a laptop to class | |
16 | 03-04 M | Project proposal presentations | HW4 out 03-05 | |
17 | 03-06 W | Project work time | Bring a laptop to class; HW3 due 03-10 |
|
Spring Break 03-10 to 03-17 | ||||
Module 4: Sequence labeling | ||||
18 | 03-18 M | POS tagging, NER, HMMs part 1 | J+M 8-8.4.4 | Reading quiz due 11:59pm |
19 | 03-20 W | HMMs part 2, Viterbi alg, neural sequence labeling | J+M 8.4.5-8.4.6, 9.3.1, 11.3.3-11.3.4 | Reading quiz due 11:59pm |
Module 5: Parsing | ||||
20 | 03-25 M | Constituency parsing, CFGs | J+M 17-17.3, 17.8.1 | Reading quiz due 11:59pm; HW4 due |
21 | 03-27 W | Dependency parsing | J+M 18-18.2, 18.4-18.5 | Reading quiz due 11:59pm; Project peer review due |
Module 6: Application areas | ||||
22 | 04-01 M | Machine translation part 1 | J+M 13-13.2, Bender 2019 |
|
23 | 04-03 W | Machine translation part 2 | J+M 13.3-13.7 J+M2024 13.3, 13.5-13.8 |
Project basic working systems due 04-04 |
24 | 04-08 M | Speech technologies, ASR, TTS | J+M 16-16.3, 16.5-16.8 J+M2024 16-16.3, 16.5-16.8 |
|
25 | 04-10 W | Dialogue, chatbots part 1 | J+M 15-15.2 J+M2024 15-15.1, 15.4 |
|
26 | 04-15 M | Dialogue, chatbots part 2 | J+M 15.3-15.7 J+M2024 15.2-15.3, 15.5-15.6 |
|
27 | 04-17 W | Computational social science, digital humanities | ||
28 | 04-24 W | Final project presentations | Final projects due 04-25 |
Assessments
Description | Points | Percentage of final grade |
---|---|---|
Final project total | 222 | 44.4 |
Survey response | 5 | 1 |
Project pre-proposal form | 10 | 2 |
Proposal and literature review | 35 | 7 |
Peer review | 2 | 0.4 |
Basic working system report | 30 | 6 |
Final report | 140 | 28 |
Homework assignments total | 224 | 44.8 |
Each homework of 4 total | 56 | 11.2 |
Reading quizzes total | 33 | 6.6 |
Each reading quiz of 13 total, 2 lowest scores dropped | 3 | 0.6 |
Discussion posts total | 21 | 4.2 |
Each discussion post of 3 total required | 7 | 1.4 |
Grand total | 500 | 100 |
Course description
Computer programs that automatically process human language, such as chatbots, translation systems, and speech recognition systems, have become a part of everyday life. This course provides an introduction to the artificial intelligence research field that brought about these systems: natural language processing (NLP). Students will become familiar with foundational tasks in NLP such as language modeling, text classification, and sequence modeling. The course will cover both classic and contemporary approaches to these tasks, as well as how they are applied in language technologies. Topics of ethics, fairness, and bias in AI are incorporated throughout the course.
Learning objectives
The overarching learning objective of this course is for students to be able to structure an NLP system to get a desired outcome from language data that may be required in a future job or research problem. This ability requires the development of many constituent skills. At the end of the course, students will be able to:
- Relate a new problem to the most relevant existing NLP tasks, such as text classification, text generation, sequence modeling, language modeling, information retrieval, machine translation, dialogue systems, etc.
- Choose relevant baseline machine learning approaches to try on a new task
- Explain the basics of language structure that are relevant to NLP. These include syntax and semantics from linguistics
- Preprocess text data into a machine-readable format
- Define and scope an objective in terms of a machine learning or NLP system. This includes determining if human annotation is needed and if machine learning is needed.
- Extract features from text that are required for running machine learning models
- Choose suitable ML algorithms for a new NLP task
- Evaluate machine learning algorithms, choices of training data and other NLP system decisions
- Identify potential ethical pitfalls (such as imbalanced training data, model amplification of biases) in an NLP system and ways to address them
- Communicate key components of an approach to NLP tasks in writing
Learning resources
Textbook: Dan Jurafsky and James H. Martin, Speech and Language Processing, 3rd edition draft, 2023-01-07 or 2024-02-03. Available completely free online: https://web.stanford.edu/~jurafsky/slp3/
Software and programming languages: Python and associated data science libraries (pandas, numpy, scipy) are the preferred software for completing coding portions of homework assignments. Students wishing to use non-Python tools for homeworks should ask the instructor first. Final projects may be completed with any programming language or tools.
Tutorials on Python and data science:
- Official Python tutorial
- Sebastian Raschka’s notebook on intro to scientific computing
- Python Data Science Handbook
- David Bamman’s computational social science training program materials
Course infrastructure
The most recent syllabus, including a schedule, will be posted on the course website. This syllabus will contain links to homework and final project descriptions. Homeworks and the final project should be submitted through Canvas. Quizzes and discussion boards (including prompts) will be on Canvas. Course announcements will be given on Canvas, and questions should be submitted through Canvas (or over email to the instructor or TA).
Policies
Grading scale
Range | Letter grade |
---|---|
93.0 – 100% | A |
90.0 – <93.0% | A- |
86.7 – <90.0% | B+ |
83.3 – <86.7% | B |
80.0 – <83.3% | B- |
76.7 – <80.0% | C+ |
73.3 – <76.7% | C |
70.0 – <73.3% | C- |
66.7 – <70.0% | D+ |
63.3 – <66.7% | D |
60.0 – <63.3% | D- |
< 60% | F |
The instructor reserves the right to change the grading scale depending on class performance, but only in the direction of raising grades for students. Feel free to stop by the instructor’s office hours or make an additional appointment anytime to talk about any issues you might have with your grade.
Late work and assignment resubmission policy
Please contact the instructor and TA before the deadline if you need an extension due to unforeseen circumstances. We are happy to extend deadlines for deaths and funerals, illnesses, mental health crises or episodes, weddings, important religious and national holidays, job interviews, and other circumstances. There is no shame in asking; we care about your well-being more than we care about deadlines.
Unless you let us know beforehand (or an adverse event occurred very close to the deadline), the late penalty is 2.5% per day, including weekend days and holidays, for all assignments. The latest you may turn assignments in is 2 weeks after the deadline, excluding the final project report, which must be turned in by the deadline.
If you are unsatisfied with your grade on an assignment and wish to resubmit work, talk with the instructor. Resubmissions are handled case by case, but are generally accepted in cases where parts of the assignment are missing.
Academic integrity policy
Students in this course will be expected to comply with the University of Pittsburgh’s Policy on Academic Integrity. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Academic Integrity Modules.
Generative AI policy
You are welcome to use generative AI programs (ChatGPT, DALL-E, etc.) as a student in this course. Since much of this course is about developing such tools in NLP, using currently available tools could not only aid you in the coursework but also expose you to the current capabilities and limitations of such systems.
However, your ethical responsibilities as a student remain the same. You must follow the University of Pittsburgh’s Policy on Academic Integrity. Here are some principles to keep in mind that can help you determine whether or not a specific use of generative AI is acceptable in this course (for all forms of generation: writing, code, images or other forms). Please ask the instructor if you are not sure about a specific use. You will not be blamed or retaliated against for asking.
-
Use as an aid, not for a finished product. LLMs could be used in this course to generate ideas, draft bibliographies, study guides, etc. Use for drafting entire homework or project reports is not acceptable, even if students revise this draft, since being able to communicate NLP procedures and research is a learning objective. Also keep in mind that language models have no notion of reality and will hallucinate facts and citations.
-
Cite its use. The University of Pittsburgh’s academic integrity policy applies to all uncited or improperly cited use of content, whether that work is created by human beings alone or in collaboration with a generative AI. If you use a generative AI tool to develop content for an assignment, you are required to cite the tool’s contribution to your work. In practice, cutting and pasting content from any source without citation is plagiarism. Likewise, paraphrasing content from a generative AI without citation is plagiarism. Similarly, using any generative AI tool without appropriate acknowledgement will be treated as plagiarism. See the APA guidelines on how to cite ChatGPT. Publicly available LLMs are very new, and so best practices in education are still being worked out. Citing your use of LLMs will also inform the instructor on how such tools are being used in education for developing better future policies.
-
You are responsible for the work you turn in. As we will discuss in this course, LLMs and other generative AI systems can and do generate biased, socially problematic language and assert unfounded claims. Ultimately the text you submit will be treated as reflecting your own work, and you are responsible for it.
Adapted from faculty in the Carnegie Mellon University Heinz College of Information Systems and Public Policy, with guidance from the Carnegie Mellon University Eberly Center for Teaching Excellence.
Disability rights
The teaching staff of this course view disabilities as deficits not in disabled people but in the institutions and societies that are structured to disadvantage disabled people. If you have a disability (visible or invisible), please let us know as soon as possible (you don’t need to tell us the nature of the disability). You are encouraged to work with Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648-7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will work with you to determine reasonable accommodations for this course. This might include lecture materials that are usable by people with visual disabilities, sign language interpretation, captioning, flexible due dates, etc.
Adapted from policies by David Mortensen and Lori Levin at Carnegie Mellon University.
Religious Observances
The observance of religious holidays (activities observed by a religious group of which a student is a member) and cultural practices are an important reflection of diversity. As your instructor, I am committed to providing equivalent educational opportunities to students of all belief systems. At the beginning of the semester, you should review the course requirements to identify foreseeable conflicts with assignments, exams, or other required attendance. Please contact me as early as possible to allow time for us to discuss and make fair and reasonable adjustments to the schedule and/or tasks.
Statement on scholarly discourse
In this course we will be discussing some complex issues on which all of us have strong feelings and, in many cases, unfounded attitudes. It is essential that we approach this endeavor with our minds open to evidence that may conflict with our presuppositions. Moreover, it is vital that we treat each other’s opinions and comments with courtesy even when they diverge and conflict with our own. We must avoid personal attacks and the use of ad hominem arguments to invalidate each other’s positions. Instead, we must develop a culture of civil argumentation, wherein all positions have the right to be defended and argued against in intellectually reasoned ways. It is this standard that everyone must accept in order to stay in this class; a standard that applies to all inquiry in the university, but whose observance is especially important in a course whose subject matter is so emotionally charged.
Adapted from a California State University course: Race, Racism and Critical Thinking.
Student wellness
College/Graduate school can be an exciting and challenging time for students. Taking time to maintain your well-being and seek appropriate support can help you achieve your goals and lead a fulfilling life. It can be helpful to remember that we all benefit from assistance and guidance at times, and there are many resources available to support your well-being while you are at Pitt. You are encouraged to visit Thrive@Pitt to learn more about well-being and the many campus resources available to help you thrive.
If you or anyone you know experiences overwhelming academic stress, persistent difficult feelings and/or challenging life events, you are strongly encouraged to seek support. In addition to reaching out to friends and loved ones, consider connecting with a faculty member you trust for assistance connecting to helpful resources.
The University Counseling Center is also here for you. You can call 412-648-7930 at any time to connect with a clinician. If you or someone you know is feeling suicidal, please call the University Counseling Center at any time at 412-648-7930. You can also contact Resolve Crisis Network at 888-796-8226.
Equity and inclusion
The University of Pittsburgh does not tolerate any form of discrimination, harassment, or retaliation based on disability, race, color, religion, national origin, ancestry, genetic information, marital status, familial status, sex, age, sexual orientation, veteran status or gender identity or other factors as stated in the University’s Title IX policy. The University is committed to taking prompt action to end a hostile environment that interferes with the University’s mission. For more information about policies, procedures, and practices, visit the Civil Rights & Title IX Compliance web page.
I ask that everyone in the class strive to help ensure that other members of this class can learn in a supportive and respectful environment. If there are instances of the aforementioned issues, please contact the Title IX Coordinator, by calling 412-648-7860, or emailing titleixcoordinator@pitt.edu. Reports can also be filed online. You may also choose to report this to a faculty/staff member; they are required to communicate this to the University’s Office of Diversity and Inclusion. If you wish to maintain complete confidentiality, you may also contact the University Counseling Center (412-648-7930).