648. Andrej Karpathy about Autoresearch

Reader86 · 发表于 5 天前

本帖最后由 Reader86 于 2026-3-10 06:42 PM 编辑

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project.

This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.:

- It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work.
- It found that the Value Embeddings really like regularization and I wasn't applying any (oops).
- It found that my banded attention was too conservative (i forgot to tune it).
- It found that AdamW betas were all messed up.
- It tuned the weight decay schedule.
- It tuned the network initialization.

This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism.

Reader86 · 发表于 5 天前

github.com/karpathy/nanoc…

All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges.

And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

Reader86 · 发表于 5 天前

Summary of Karpathy’s Breakthrough
The core of his update centers on the transition from manual hyperparameter tuning to "agentic" research swarms:

The Experiment: Karpathy let an AI agent run autonomously for about two days. It performed approximately 700 experiments, each lasting 5 minutes, on a smaller "depth=12" model.

The Findings: The agent discovered ~20 specific code and hyperparameter changes that he had missed during years of manual tuning. These included fixing "messed up" AdamW betas, tuning network initialization, and adding scalers to parameterless QKnorm.

The Results: These changes were additive and transferred perfectly to the larger "depth=24" model. This resulted in the "Time to GPT-2" (the time required to reach GPT-2 level performance on a single 8xH100 node) dropping from 2.02 hours to 1.80 hours—a significant 11% improvement as noted in recent reports.

The "Final Boss": Karpathy noted that this type of automated research is the "final boss battle" for AI labs. He believes the future of AI development will involve swarms of agents collaborating to tune models, while humans shift toward "prompt engineering" the research goals rather than writing the training code themselves.

How it relates to my previous answer
While my previous answer focused on Mira Murati's new venture, Thinking Machines Lab (which also focuses on making AI more customizable and understandable), Karpathy's work is the open-source equivalent of that philosophy. Both are moving toward a world where AI systems are not just "built" by humans, but are iteratively refined and researched by other AI agents.

Would you like me to find the specific GitHub commit Karpathy mentioned that contains these 20 improvements, or more details on the Muon optimizer used in these runs?

Reader86 · 发表于 5 天前

转译：Harness 工程就是控制论

读 OpenAI 那篇 Harness 工程的文章[1]时，我一直有种说不上来的感觉。然后突然想通了：这个模式我见过，不是一次——是三次。

第一次是 18 世纪 80 年代瓦特的离心式调速器[2]。在它出现之前，得有个工人站在蒸汽机旁边，用手调节阀门。有了它之后，一个带配重飞球的机械装置能自动感知转速并调节阀门。工人并没有消失，但工作变了：从亲手拧阀门，变成设计调速器。

第二次是 Kubernetes[3]。你声明期望状态——三个副本、这个镜像、这些资源限制。一个控制器持续观察实际状态。当两者出现偏差，控制器就会去协调：重启崩溃的 Pod、扩缩副本、回滚有问题的部署。工程师的工作从重启服务，变成了编写系统据以协调的规格说明。

第三次就是现在。OpenAI 描述了这样一批工程师：他们不再写代码。取而代之的是设计环境、构建反馈回路、将架构约束编成规则——然后由 AI 智能体来写代码。五个月，一百万行代码[1]，没有一行是手写的。他们管这叫"Harness 工程"（Harness Engineering，意为为 AI 智能体搭建"缰绳"和"马具"般的约束框架）。

三次，同一个模式。诺伯特·维纳[4]在 1948 年就给它起了名字：控制论（Cybernetics），来自希腊语 κυβερνήτης——舵手。你不再亲手拧阀门，而是掌舵。

每一次这个模式出现，都是因为有人造出了足够强大的传感器和执行器，在那个层面闭合了反馈回路。

为什么代码库是最后的堡垒

代码库并非没有反馈回路，只是只在较低层面有。编译器在语法层面闭合回路。测试套件在行为层面闭合回路。代码检查工具在风格层面闭合回路。这些都是真正的控制论式控制——但它们只能检查那些可以机械验证的属性。能编译吗？能通过测试吗？符合规则吗？

而在这之上的一切——这个改动符合系统架构吗？这个方案是不是正确的思路？这个抽象随着代码库增长会不会埋下隐患？——既没有传感器，也没有执行器。只有人类能在那个层面运作，而且是两侧同时运作：判断质量，编写修复。

大语言模型同时改变了这两端。它们能在过去只有人类才能把控的层面进行感知——也能在同一层面采取行动：重构一个模块、重新设计一个不一致的接口、围绕真正重要的契约重写整个测试套件。反馈回路第一次可以在做出关键决策的层面闭合了。

但闭合回路是必要条件，不是充分条件。瓦特的调速器需要调校。Kubernetes 的控制器需要正确的规格说明。而让大语言模型在你的代码库上工作，需要提供一样更难的东西。

校准传感器和执行器

让基本的反馈回路运转起来——智能体可以运行的测试、能输出可解析结果的 CI、能指向修复方向的错误信息——这只是基本门槛。Carlini 已经展示过这一点[5]：他让 16 个并行智能体构建了一个 C 编译器，用的是简单到令人惊讶的提示词[6]，但测试基础设施是精心设计的。"我的大部分精力都花在了为 Claude 设计周围的环境——测试、环境、反馈机制。"

更难的问题是用你的系统特有的知识来校准传感器和执行器。大多数人卡在这里，然后把问题归咎于智能体。

"它老是做错。它不懂我们的代码库。"这个诊断几乎总是错的。智能体失败不是因为能力不够，而是因为它需要的知识——什么叫"好"、你的架构鼓励哪些模式、回避哪些模式——锁在你脑子里，你从没把它外化出来。智能体不会靠耳濡目染来学习。如果你不写下来，它在第一百次运行时犯的错和第一次一模一样。

这项工作的本质是让你的判断力变得机器可读。描述实际分层和依赖方向的架构文档。内置修复指引的自定义代码检查规则。编码了你团队审美标准的黄金准则。OpenAI 也发现了这一点[1]：他们每个周五花 20% 的时间清理"AI 垃圾代码"——直到他们把标准编进了 Harness 本身。

唯一的出路

这些实践所要求的一切——文档、自动化测试、编码化的架构决策、快速反馈回路——一直都是正确的。过去三十年出版的每一本软件工程书籍都在推荐它们。大多数人跳过这些步骤，因为跳过的代价是缓慢而弥散的：质量缓慢下滑、新人上手痛苦、技术债务悄悄累积。

智能体化工程让这个代价变得极端。跳过文档，智能体就会无视你的规范——不是在一个 PR 上，而是在每一个 PR 上，以机器的速度，全天候地。跳过测试，反馈回路就根本无法闭合。跳过架构约束，漂移的速度会快过你修复的速度。而陷阱在于：如果智能体不知道"干净"长什么样，你也没法用智能体来收拾这个烂摊子。没有校准，制造问题的机器同样无法解决问题。

实践没有变。忽视它们的代价已经变得无法承受。

生成-验证不对称性——P vs NP[7] 背后的直觉，被 Cobbe 等人用大语言模型实证验证[8]——指明了未来的方向。生成一个正确的解比验证一个解要难。你不需要在实现能力上超越机器，你需要在评判能力上超越它：定义"正确"是什么样子，识别输出哪里不对，判断方向是否正确。

那些设计了瓦特调速器的工人再也没有回去拧阀门。不是因为他们做不到，而是因为那已经没有意义了。

引用链接
[1] Harness 工程的文章: openai.com/index/harness-…
[2] 瓦特的离心式调速器: en.wikipedia.org/wiki/Centrifug…
[3] Kubernetes: kubernetes.io/docs/concepts/…
[4] 诺伯特·维纳: en.wikipedia.org/wiki/Cyberneti…
[5] Carlini 已经展示过这一点: anthropic.com/engineering/bu…
[6] 简单到令人惊讶的提示词: github.com/anthropics/cla…
[7] P vs NP: en.wikipedia.org/wiki/P_versus_…
[8] 用大语言模型实证验证: arxiv.org/abs/2110.14168

https://x.com/dotey/status/2031231856071372830?s=46

		自动登录	找回密码
密码			注册