CS 5434: Trustworthy AI

Instructors

Vitaly Shmatikov
Jack Morris - Office Hours 4:15 – 5:30 (Wednesdays)

Teaching Assistants

Rishi Jha - Office Hours Tuesday, 2-3p. Bloomberg 338. Zoom link on Canvas (only open if announced on Slack).
Tingwei Zhang - Office Hours Monday, 1:45-2:45p. Bloomberg 338. Zoom link on Canvas (only open if announced on Slack).

NOTE: Please message us via Slack with course-related questions.

Course Schedule

Date	Topic	Readings	Resources
Aug 25	Course logistics. AI pipelines and threats.	-	Slides
Aug 27	Survey: AI in 2025.	-	Slides Puzzle 🧩
Sep 3	Definitions: AI security, safety, privacy, and trustworthiness.	Trustworthy AI (Wing, 2021) Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems (Dalrymple et al., 2024)	Slides
Sep 8	Adversarial examples and adversarial robustness	Intriguing properties of neural networks (Szegedy et al., 2013)	Slides Puzzle 🧩
Sep 10	Data poisoning	Required: Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning (Jagielski et al., 2018) Poisoning the Unlabeled Dataset of Semi-Supervised Learning (Carlini, 2021) Recommended: Poisoning Attacks against Support Vector Machines (Biggio et al., 2012) Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks (Shafahi et al., 2018) Poisoning Web-Scale Training Datasets is Practical (Carlini et al., 2023)	Slides
Sep 15	Backdoor attacks	Required: BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain (Gu et al., 2017) You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion (Schuster et al., 2021) Poisoning Language Models During Instruction Tuning (Wan et al., 2023) Recommended: Black-Box Adversarial Attacks on LLM-Based Code Completion (Jenko et al., 2024)
Sep 17	Membership inference	Required: Membership Inference Attacks against Machine Learning Models (Shokri et al., 2017) Membership Inference Attacks From First Principles (Carlini et al., 2021) Recommended: Do Membership Inference Attacks Work on Large Language Models? (Duan et al., 2024)	Slides
Sep 22	Model stealing	Required: Stealing Machine Learning Models via Prediction APIs (Tramer et al., 2016) Imitation Attacks and Defenses for Black-box Machine Translation Systems (Wallace et al., 2020) Stealing Part of a Production Language Model (Carlini et al., 2024) Recommended: Adversarial Learning (Lowd and Meek, 2005) High Accuracy and High Fidelity Extraction of Neural Networks (Jagielski et al., 2019)	Slides Puzzle 🧩
Sep 24	Model inversion	Required: Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures (Fredrikson et al., 2015) Text Embeddings Reveal (Almost) As Much As Text (Morris et al., 2023) Recommended: Deep Leakage from Gradients (Zhu et al., 2019)	Slides
Sep 29	Memorization	The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks (Carlini et al., 2018) Quantifying Memorization Across Neural Language Models (Carlini et al., 2022)	Slides
Oct 1	Explainability and interpretability guest: Chandan Singh
Oct 6	Watermarking data and models	Required: A Watermark for Large Language Models (Kirchenbauer et al., 2023) Radioactive data: tracing through training (Sablayrolles et al., 2020) Recommended: Scalable watermarking for identifying large language model outputs (Dathathri et al., 2024)
Oct 8	Fairness and bias in AI guest: Angelina Wang	Required: Gender Shades (Buolamwini and Gebru, 2018) Data Feminism for AI (Klein and D'Ignazio, 2024)
Oct 15	Indirect prompt injection and defenses guest: Sizhe Chen	Required: Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (Greshake et al., 2023) SecAlign: Defending Against Prompt Injection with Preference Optimization (Chen et al., 2024)	Slides
Oct 20	Hallucinations and uncertainty in LLMs guest: Polina Kirichenko	Required: AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions (Kirichenko et al., 2025)	Slides
Oct 22	Midterm
Oct 27	AI and copyright guest: James Grimmelmann	Required: Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain (Lee et al., 2024)
Oct 29	Alignment	Required: Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)	Slides
Nov 3	LLM safety alignment and jailbreaking	Required: Jailbroken: How Does LLM Safety Training Fail? (Wei et al., 2023) Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023) Recommended: Are aligned neural networks adversarially aligned? (Carlini et al., 2023)
Nov 5	Contextual integrity guest: Helen Nissenbaum	Required: A Contextual Approach to Privacy Online (Nissenbaum, 2011) Recommended: Contextual Integrity Up and Down the Data Food Chain (Nissenbaum, 2019) No Cookies For You!: Evaluating The Promises Of Big Tech’s ‘Privacy-Enhancing’ Techniques (Martin et al., 2025)
Nov 10	Hacking AI agents guest: Rishi Jha
Nov 12	Unlearning (and why it's hard)	Required: Machine Unlearning in 2024 (Liu) Recommended: Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice (A. Feder Cooper et al., 2024)
Nov 17	Training data extraction
Nov 19	Differentially private machine learning	Required: Deep Learning with Differential Privacy (Abadi et al., 2016) VaultGemma (Google Research, 2025) Recommended: Evaluating Differentially Private Machine Learning in Practice (Jayaraman and Evans, 2019)
Nov 24	Fine-tuning risks
Dec 1	Reasoning models. Reward hacking and deception.
Dec 3	Deepfakes and other abuses of AI guest: Alexios Mantzarlis
Dec 8	AGI and ASI. The existential risks debate. Governance of AI.

Course Overview and Learning Outcomes

This course is about safety, security, privacy, alignment, and adversarial robustness of modern AI and ML technologies. Topics include threats and risks specific to these technologies, understanding vulnerabilities and state-of-the-art defenses, and how to build and use trustworthy AI/ML systems.

Learning Outcomes

Understand what it means for an AI/ML system to be safe, secure, and privacy-preserving
Be able to describe the threats and risks faced by AI/ML systems, and technologies that are available to defend against these threats
Be able to build adversarially robust AI/ML systems

Prerequisites

Fluency in Python. Experience managing Python code across multiple files of a large codebase
Background in machine learning. Experience training neural networks on simple tasks such as classification and regression
Basic data visualization ability in Python, including data science with Pandas and the matplotlib library

Course Materials

Lecture notes and occasional course readings will be available through links on the course schedule. Lectures will cover some material that is not in the notes or readings. Attendance is mandatory and tests will include this material.

Assignments and Grading Criteria

Assignment	Weight	Due Date
Assignment 1	15%	9/22
Assignment 2	15%	10/20
In-class midterm exam	15%	10/22
Assignment 3	15%	11/17
Assignment 4	15%	12/8
In-class final exam	15%	12/12
Attendance and participation	10%	-

Note: Due dates subject to change, please check back frequently.

Policies

Collaboration Policy

Assignments can be done in teams of 2. All exams are in-class and strictly individual.

Policy on Late Submissions

You have 3 late days for the entire semester to use any way you want (submit one assignment 3 days late, 3 assignments 1 day late each, etc.). Partial days are rounded up to the next full day.

After you use up your late days, you get 0 points for each late assignment.

Policy on LLMs and Other Generative AI Tools and Technologies

We discourage the use of LLMs and similar AI tools. For any assignment where you opted to use AI in spite of this discouragement, you must disclose what you used, how, and your specific prompts in a dedicated document called AI.txt. Failure to disclose AI use is a serious violation of academic integrity and will be treated as such.

You are responsible for completely understanding all code you submit. We will be performing random checks to test your understanding. TAs will not help debug LLM-generated code. When asking TAs for help, you must disclose all uses of LLMs and be able to explain how every part of the code is intended to work.

The use of LLMs is strictly prohibited for the in-class exams.

Academic Integrity

We expect you to abide by Cornell's Code of Academic Integrity at all times. Please note that the Code specifically states that a "Cornell student's submission of work for academic credit indicates that the work is the student's own. All outside assistance should be acknowledged, and the student's academic position truthfully reported at all times." Please contact us if you have any questions or concerns about appropriately acknowledging others' work in your submitted assignments. You should expect that we will rigorously enforce the Code.