Trustworthy AI

Trustworthy AI

Fall 2025: MW 2:55-4:10p


Instructors

Teaching Assistants

NOTE: Please message us via Slack with course-related questions.

Course Schedule

Date Topic Readings Resources
Aug 25 Course logistics.
AI pipelines and threats.
- Slides
Aug 27 Survey: AI in 2025. - Slides
Puzzle 🧩
Sep 3 Definitions: AI security, safety, privacy, and trustworthiness. Trustworthy AI (Wing, 2021)
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems (Dalrymple et al., 2024)
Slides
Sep 8 Adversarial examples and adversarial robustness Intriguing properties of neural networks (Szegedy et al., 2013) Slides
Puzzle 🧩
Sep 10 Data poisoning Required:
Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning (Jagielski et al., 2018)
Poisoning the Unlabeled Dataset of Semi-Supervised Learning (Carlini, 2021)
Recommended:
Poisoning Attacks against Support Vector Machines (Biggio et al., 2012)
Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks (Shafahi et al., 2018)
Poisoning Web-Scale Training Datasets is Practical (Carlini et al., 2023)
Slides
Sep 15 Backdoor attacks Required:
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain (Gu et al., 2017)
You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion (Schuster et al., 2021)
Poisoning Language Models During Instruction Tuning (Wan et al., 2023)
Recommended:
Black-Box Adversarial Attacks on LLM-Based Code Completion (Jenko et al., 2024)
Sep 17 Membership inference Required:
Membership Inference Attacks against Machine Learning Models (Shokri et al., 2017)
Membership Inference Attacks From First Principles (Carlini et al., 2021)
Recommended:
Do Membership Inference Attacks Work on Large Language Models? (Duan et al., 2024)
Slides
Sep 22 Model stealing Required:
Stealing Machine Learning Models via Prediction APIs (Tramer et al., 2016)
Imitation Attacks and Defenses for Black-box Machine Translation Systems (Wallace et al., 2020)
Stealing Part of a Production Language Model (Carlini et al., 2024)
Recommended:
Adversarial Learning (Lowd and Meek, 2005)
High Accuracy and High Fidelity Extraction of Neural Networks (Jagielski et al., 2019)
Slides
Puzzle 🧩
Sep 24 Model inversion Required:
Model Inversion Attacks That Exploit Confidence Information and Basic Countermeasures (Fredrikson et al., 2015)
Text Embeddings Reveal (Almost) As Much As Text (Morris et al., 2023)
Recommended:
Deep Leakage from Gradients (Zhu et al., 2019)
Slides
Sep 29 Memorization The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks (Carlini et al., 2018)
Quantifying Memorization Across Neural Language Models (Carlini et al., 2022)
Slides
Oct 1 Explainability and interpretability
guest: Chandan Singh
Oct 6 Watermarking data and models Required:
A Watermark for Large Language Models (Kirchenbauer et al., 2023)
Radioactive data: tracing through training (Sablayrolles et al., 2020)
Recommended:
Scalable watermarking for identifying large language model outputs (Dathathri et al., 2024)
Oct 8 Fairness and bias in AI
guest: Angelina Wang
Required:
Gender Shades (Buolamwini and Gebru, 2018)
Data Feminism for AI (Klein and D'Ignazio, 2024)
Oct 15 Indirect prompt injection and defenses
guest: Sizhe Chen
Required:
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (Greshake et al., 2023)
SecAlign: Defending Against Prompt Injection with Preference Optimization (Chen et al., 2024)
Slides
Oct 20 Hallucinations and uncertainty in LLMs
guest: Polina Kirichenko
Required:
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions (Kirichenko et al., 2025)
Slides
Oct 22 Midterm
Oct 27 AI and copyright
guest: James Grimmelmann
Required:
Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain (Lee et al., 2024)
Oct 29 Alignment Required:
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
Slides
Nov 3 LLM safety alignment and jailbreaking Required:
Jailbroken: How Does LLM Safety Training Fail? (Wei et al., 2023)
Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023)
Recommended:
Are aligned neural networks adversarially aligned? (Carlini et al., 2023)
Nov 5 Contextual integrity
guest: Helen Nissenbaum
Required:
A Contextual Approach to Privacy Online (Nissenbaum, 2011)
Recommended:
Contextual Integrity Up and Down the Data Food Chain (Nissenbaum, 2019)
No Cookies For You!: Evaluating The Promises Of Big Tech’s ‘Privacy-Enhancing’ Techniques (Martin et al., 2025)
Nov 10 Hacking AI agents
guest: Rishi Jha
Nov 12 Unlearning (and why it's hard) Required:
Machine Unlearning in 2024 (Liu)
Recommended:
Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice (A. Feder Cooper et al., 2024)
Nov 17 Training data extraction
Nov 19 Differentially private machine learning Required:
Deep Learning with Differential Privacy (Abadi et al., 2016)
VaultGemma (Google Research, 2025)
Recommended:
Evaluating Differentially Private Machine Learning in Practice (Jayaraman and Evans, 2019)
Nov 24 Fine-tuning risks
Dec 1 Reasoning models.
Reward hacking and deception.
Dec 3 Deepfakes and other abuses of AI
guest: Alexios Mantzarlis
Dec 8 AGI and ASI.
The existential risks debate.
Governance of AI.

Course Overview and Learning Outcomes

This course is about safety, security, privacy, alignment, and adversarial robustness of modern AI and ML technologies. Topics include threats and risks specific to these technologies, understanding vulnerabilities and state-of-the-art defenses, and how to build and use trustworthy AI/ML systems.

Learning Outcomes

Prerequisites

Course Materials

Lecture notes and occasional course readings will be available through links on the course schedule. Lectures will cover some material that is not in the notes or readings. Attendance is mandatory and tests will include this material.

Assignments and Grading Criteria

Assignment Weight Due Date
Assignment 1 15% 9/22
Assignment 2 15% 10/20
In-class midterm exam 15% 10/22
Assignment 3 15% 11/17
Assignment 4 15% 12/8
In-class final exam 15% 12/12
Attendance and participation 10% -

Note: Due dates subject to change, please check back frequently.

Policies

Collaboration Policy

Assignments can be done in teams of 2. All exams are in-class and strictly individual.

Policy on Late Submissions

You have 3 late days for the entire semester to use any way you want (submit one assignment 3 days late, 3 assignments 1 day late each, etc.). Partial days are rounded up to the next full day.

After you use up your late days, you get 0 points for each late assignment.

Policy on LLMs and Other Generative AI Tools and Technologies

We discourage the use of LLMs and similar AI tools. For any assignment where you opted to use AI in spite of this discouragement, you must disclose what you used, how, and your specific prompts in a dedicated document called AI.txt. Failure to disclose AI use is a serious violation of academic integrity and will be treated as such.

You are responsible for completely understanding all code you submit. We will be performing random checks to test your understanding. TAs will not help debug LLM-generated code. When asking TAs for help, you must disclose all uses of LLMs and be able to explain how every part of the code is intended to work.

The use of LLMs is strictly prohibited for the in-class exams.

Academic Integrity

We expect you to abide by Cornell's Code of Academic Integrity at all times. Please note that the Code specifically states that a "Cornell student's submission of work for academic credit indicates that the work is the student's own. All outside assistance should be acknowledged, and the student's academic position truthfully reported at all times." Please contact us if you have any questions or concerns about appropriately acknowledging others' work in your submitted assignments. You should expect that we will rigorously enforce the Code.