COS 597R: Deep Dive into Large Language Models

Instructors	Danqi Chen (danqic AT cs.princeton.edu) and Sanjeev Arora (arora AT cs.princeton.edu)
Teaching assistants	Adithya Bhaskar (adithyab AT princeton.edu) and Tyler Zhu (tylerzhu AT princeton.edu)
Lectures	Monday/Wednesday 10:30-11:50am
Location	CS Building 105
Office hours	Danqi's office hour: Tuesday 10-11, COS 412 (by appointment) Sanjeev's office hour: Wednesday 4-5pm, COS 407 Adithya's office hour: Thursday 3-4pm, Friend 010B Tyler's office hour: Monday 4-5pm, Friend 010C
Feedback form	https://forms.gle/vUD1RieC1YcBSugw7

We will use a Slack team for most communications this semester. You will be added to the Slack team after the first week. If you join the class late, just email us, and we'll add you. Once you're on Slack, we prefer Slack messages over emails for all logistical questions. We also encourage students to use Slack for discussions related to lecture content and projects.

Large language models (LLMs) have revolutionized natural language processing by enabling machines to generate, understand, and interact with human language in more sophisticated ways than ever before. Beyond technical advancements, LLMs are shaping societal interactions with technology, from enhancing accessibility for underserved communities to transforming education, healthcare, and creative industries. This course aims to provide a rigorous survey of current LLM research, including model architecture, data preparation, pre-training, post-training, alignment, and model deployment. The course focuses on conceptual understanding and research rather than engineering, and it is expected to be highly interactive. Students are expected to read cutting-edge research papers regularly, participate in class discussion, and also complete a major project (in groups of 2-3) at the end, for which computational resources will be arranged.

Prerequisites: COS484 or equivalent background (i.e., familiarity with fundamentals of deep learning/machine learning, Transformers, PyTorch). Open to all graduate students. Undergraduates need instructors' permission.

Course structure

Class participation (30%): In each class, we will cover 1-2 papers (see "required reading" in the schedule). You are required to read these papers in depth beforehand, and answer a pre-lecture question form before the class (there is a Google form linked in the schedule). These are due at 11:59pm on the day before the lecture. Some questions are designed to test your understanding of the reading materials, and some questions are open-ended and prompt you to read the paper critically and write down your thoughts. This counts towards class participation - we will not grade the correctness but we will expect you to do the work, and submit reasonable answers.
Debate (15%): We will schedule 12 debate panels in the class from Week 4 to Week 9, with each panel consisting of 5 students and lasting 30 minutes (the lectures will be reduced to 50 minutes). Each panel will focus on one research paper (or two) related to the topics that have been taught so far, and will comprise of the following structure:
- Each panel will be composed of 1 presenter, 2 critics, and 2 proponents.
- The presenter will start with a short presentation (8 minutes) of the paper.
- The 2 critics will then critique the paper, similar to how reviewers assess conference papers—highlighting limitations, weaknesses, and any claims that are not well supported by the experiments.
- The 2 proponents will explain why they believe the problem does not exist or is not serious.
- There will be multiple rounds of interaction. critics are asked to send their major criticisms to the proponents at least 2 days before the lecture day, so the proponents have time to research and prepare their responses.
- The group will write a 2-page summary of the debate later and submit it.
Lecture scribing (10%): For each lecture, we will ask 3 students to scribe the lecture content, covering the technical content and research questions.
- You can find the Overleaf scribe template here. Make a shared copy between all the scribes for a given lecture. It is up to you how to divide up the work so that it is equal. Send your completed Overleaf link + PDF to Adithya and Tyler on Slack by 11:59pm three days after the lecture. For Monday lectures, this is 11:59pm on Thursday. For Wednesday Lectures, this is 11:59pm on Saturday.
- Please do not add the four course instructors on the Overleaf, but instead share the editable link with Adithya and Tyler.
- New to the template is a contributions section, please do fill this out when you submit with an overview of each scribe's split.
Final project (35% + 10% for presentations): At the end of the semester, everyone is required to do a class project related to modern LLMs and submit a final paper. You should work as a team of 2 or 3. Everyone is required to submit a proposal to Gradescope by Oct 13th (Sunday) 11:59pm, and the final paper on the Dean's Date (Dec 13th 11:59 pm). In-class project presentations will be scheduled in the last three lectures. The template for the final report is here. Feel free to use it for the proposal as well, but you can also use any template you like.

Schedule

Date	Instructor	Topic/required reading	Recommended reading	Reading response	Panel discussion	Scribes
Sep 4 (Wed)	Sanjeev	Introduction [slides]			N/A
Sep 9 (Mon)	Danqi	Pretraining 1 [slides] Language Models are Few-Shot Learners (GPT-3)	Transformers The Annotated Transformer GPT-2 BERT What happened to BERT & T5? (Yi Tay)	[link]	N/A	Yinghui He Haichen Dong Brendan Y. Wang
Sep 11 (Wed)	Danqi	Pretraining 2 [slides] Language Models are Few-Shot Learners (cont'd) The Llama 3 Herd of Models , Sections 1-2, Section 3.1-3.2, 3.4, and 5.1	Mistral 7B OLMo Qwen2 Data annealing (Databricks)	[link]	N/A	Jiaxin Xiao Dillon Lue Ziyu Xiong
Sep 16 (Mon)	Sanjeev	Scaling laws [slides] Training Compute-Optimal Large Language Models (Chinchilla) Scaling Data-Constrained Language Models	Scaling Laws for Neural Language Models Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws	[link]	N/A	Wuwei Zhang Simran Kaur Keerthana Nallamotu
Sep 18 (Wed)	Sanjeev	Emergent behavior [slides] Emergent Abilities of Large Language Models A Theory for Emergence of Complex Skills in Language Models, Sections 1-3 and 6-8. No need to understand the math.	Wikipedia entry on Emergence Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models	[link]	N/A	Erich Liang Heyu Guo Benedikt P. Stroebl
Sep 23 (Mon)	Danqi	Data curation [slides] Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research	FineWeb RefinedWeb DataComp-LM QuRating	[link]	Paper: Phi-1.5 "More data or better data?" Presenter: Victor Chu Critics: Erich Liang Tanvi Namjoshi Proponents: Simran Kaur Tedi Zadouri	Sijia Liu Iain D. Campbell Elizabeth A. Mieczkowski
Sep 25 (Wed)	Danqi	Post-training: Instruction tuning [slides] Scaling Instruction-Finetuned Language Models	FLAN The Flan Collection Tülu Tulu 2 LESS Sebastian Ruder's blog posts: [1][2]	[link]	Paper: Schaeffer et al 2023 "Are emergent abilities a mirage?" Presenter: Mingqian Xue Critics: Lekang Yuan Heyu Guo Proponents: Qishuo Yin Lihan Zha	Jane E. Castleman Kylie Zhang Yingqing Guo
Sep 30 (Mon)	Danqi	Post-training: learning from preferences [slides] Training language models to follow instructions with human feedback Direct Preference Optimization: Your Language Model is Secretly a Reward Model	Unpacking DPO and PPO Llama 3 , Section 4 SimPO	[link]	Paper: Scaling Laws for Data Filtering Presenter: Tamjeed Azad Critics: Elizabeth A. Mieczkowski Nimra Nadeem Proponents: Iain D. Campbell Zhicheng Zheng	Kincaid MacDonald Amey P. Pasarkar Nobline Yoo
Oct 2 (Wed)	Sanjeev	Alignment [slides] A General Language Assistant as a Laboratory for Alignment	The RL probabilist blog on forward and reverse KL	[link]	Paper: LIMA: Less Is More for Alignment Presenter: ~~Mahsa Bastankhah~~ Critics: Niusha Moshrefi Zeyu Shen Proponents: Jiaxin Xiao Wuwei Zhang	Nimra Nadeem Stanley Wei Cyrus Vachha
Oct 7 (Mon)	Sanjeev	Constitutional AI [slides] Constitutional AI: Harmlessness from AI Feedback	HHH Dataset (just look at some examples)	[link]	Paper: Is DPO Superior to PPO for LLM Alignment? Presenter: Boyi Wei Critics: Xingyu Zhu Cyrus Vachha Proponents: Benedikt P. Stroebl Kincaid MacDonald	Juhyun Park Wentao Guo ~~Mahsa Bastankhah~~
Oct 9 (Wed)	Sanjeev	LLM Metacognition [slides] Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving	AI-Assisted Generation of Difficult Math Questions Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning	[link]	Paper: Inverse Constitutional AI: Compressing Preferences into Principles Presenter: Zixuan Wang Critics: ~~Rafael Pastrana Jimenez~~ Dillon Lue Proponents: Sreemanti Dey Jane E. Castleman	Zixuan Wang Mingqian Xue
Oct 21 (Mon)	Tianyu Gao	Long-context models [slides] How to Train Long-Context Language Models (Effectively) RoFormer: Enhanced Transformer with Rotary Position Embedding	A Controlled Study on Long Context Extension and Generalization in LLMs RULER Effective Long-Context Scaling of Foundation Models Data Engineering for Scaling Language Models to 128K Context StreamingLLM	[link]	Paper: Language Models (Mostly) Know What They Know Presenter: Arin J. Mukherjee Critics: Seth Karten Veniamin Veselovskyy Proponents: Yuka Shu Keerthana Nallamotu	Victor Chu Yijun Yin Lihan Zha
Oct 23 (Wed)	Sanjeev	Advanced topics in alignment [slides] OpenAI o1 System Card (skim this and note anything interesting) Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (Read through section 4.2 + skim the rest)	The AI through debate blog post and interview.	[link]	Paper: The Impact of Positional Encoding on Length Generalization in Transformers Presenter: Ambri Ma Critics: Colin Wang Jiahao Qiu Proponents: Brendan Y. Wang David B. Braun	Zeyu Shen Tedi Zadouri Lekang Yuan
Oct 28 (Mon)	~~Danqi~~ Sanjeev	LLM Reasoning 1 [slides] Let's Verify Step by Step Improve Mathematical Reasoning in Language Models by Automated Process Supervision	Common 7B Language Models Already Possess Strong Math Capabilities Math-Shepherd	[link]	Paper: Transcendence: Generative Models Can Outperform The Experts That Train Them Presenter: Jiayi Zhang Critics: Catherine Cheng Juhyun Park Proponents: Wentao Guo Sijia Liu	Niusha Moshrefi Zhicheng Zheng Wenzhe Li
Oct 30 (Wed)	Danqi	LLM Reasoning 2 [slides] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters	Large Language Monkeys Inference Scaling Laws STaR DeepSeekMath	[link]	Paper: Stream of Search (SoS): Learning to Search in Language Presenter: Constantin Schesch Critics: Yinghui He Yijun Yin Proponents: Haichen Dong Amey P. Pasarkar	Creston A. Brooks Jiayi Zhang Qishuo Yin
Nov 4 (Mon)	Mengzhou Xia	Small models [slides] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning Gemma 2: Improving Open Language Models at a Practical Size	MiniCPM Llama 3.2 blog post OpenELM Mojan Javaheripi: The Surprising Power of Small Language Models LLM Pruning and Distillation in Practice: The Minitron Approach	[link]	Paper: Information-Theoretic Distillation for Reference-less Summarization Presenter: Ziyu Xiong Critics: Nobline Yoo Creston A. Brooks Proponents: Stanley Wei Lucy He	David B. Braun Boyi Wei Arin J. Mukherjee
Nov 6 (Wed)	~~Danqi~~	~~Retrieval-augmented LMs~~ Improving language models by retrieving from trillions of tokens		[link]	Paper: To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning Presenter: Alexandre Kirchmeyer Critics: Wenzhe Li Kylie Zhang Proponents: Yingqing Guo Joie Y . Zhang	Sreemanti Dey Xingyu Zhu Colin Wang
Nov 11 (Mon)	Yu Su (OSU)	A Holistic and Critical Look at Language Agents [slides]	Language agents: a critical evolutionary step of artificial intelligence HippoRAG LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error		N/A	Alexandre Kirchmeyer Lucy He Jiahao Qiu
Nov 13 (Wed)	Danqi	Retrieval-augmented language models [slides] Improving language models by retrieving from trillions of tokens	ACL 2023 tutorial REALM kNN-LM TRIME REPLUG FLARE Self-RAG		N/A	Veniamin Veselovskyy Tanvi Namjoshi Ambri Ma
Nov 18 (Mon)	Tri Dao	Hardware-aware Algorithms for Language Modeling	FlashAttention Mamba		N/A	Tamjeed Azad Seth Karten Catherine Cheng
Nov 20 (Wed)	Saining Xie (NYU)	Language Models Need Better Visual Grounding for Meaning and Understanding [slides]	LLaVA Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs Cambrian-1 Molmo and PixMo (AI2) MM1 (Apple)		N/A	Constantin Schesch Yuka Shu Joie Y . Zhang
Nov 25 (Mon)	Students	Project presentations			N/A
Dec 2 (Mon)	Students	Project presentations			N/A
Dec 4 (Wed)	Students	Project presentations			N/A

COS 597R (Fall 2024): Deep Dive into Large Language Models

Course structure

Schedule