Advanced

Karpathy Scripts

The hybrid programming pattern that combines natural language instructions with deterministic code for reliable, repeatable AI operations.

What Is a Karpathy Script?

A karpathy script (kscript) is a hybrid software artifact that combines natural language instructions (interpreted by an AI agent) with traditional deterministic code (executed by a machine) to create a repeatable, reliable operation. The practice centers on identifying, for any given task, the optimal boundary between what should be expressed as prompt and what should be expressed as code, then formalizing both into a cohesive artifact that can be invoked repeatedly with consistent results.

The term draws from Andrej Karpathy’s Software 3.0 framework.¹ In his articulation of software paradigms, Karpathy identified three eras: Software 1.0 (traditional code that programs a CPU), Software 2.0 (neural network weights that program a neural net), and Software 3.0 (natural language prompts that program an LLM). “Prompts are programs that program the LLM,” he observed, and remarkably, we are now programming computers in English.

A karpathy script is, at its core, a Software 3.0 program, but one that has been refined into a reliable, repeatable operation through deliberate methodology. The name also carries a second etymological layer: the cowpath principle from urban planning,² the practice of observing where people naturally walk before deciding where to pave roads.

The phonetic echo in referring to these as “cowpathy” scripts due to this construction paradigm is whimsically noted. A kscript is built by observing where the agent naturally reaches for deterministic code, then paving those paths.

The key principle is simple: don’t roll the dice on what elements can be fixed. When an agent encounters a subtask that requires deterministic processing — data transformation, file parsing, structured computation — it will typically generate a helper script on the fly. This works well enough in isolation. But when the same operation is repeated across multiple invocations, each invocation may generate a slightly different helper script. Most will be correct. Some will not. The agent is rolling the dice on something that could be fixed, introducing unnecessary variance into an operation that should be reliable.

A karpathy script solves this by identifying which parts of a task belong in the stochastic domain (agent interpretation) and which belong in the deterministic domain (traditional code), then formalizing both into a single artifact.

The Anatomy

A karpathy script consists of two types of components working in concert.

The prompt component contains natural language instructions that the agent interprets and executes. This is where judgment, semantic processing, contextual reasoning, and natural language tasks live. The prompt component may include conditional logic, checkpoint directives, input/output specifications, and references to the code components. It is Software 3.0: English as a programming language.

The code component encompasses anything “fixed” (deterministic) that the agent references rather than reinventing. This most commonly takes the form of traditional scripts — Python, shell, JavaScript — but it also includes schemas, configuration files, templates, data fixtures, or even well-defined directory structures and naming conventions. What unifies these elements is that they are pre-verified, permanent, and not subject to the agent’s stochastic interpretation. A JSON schema that defines the expected output format is as much a code component as a Python script that performs data extraction — both remove a degree of freedom from the agent and replace it with something fixed.

Neither component is sufficient alone. The prompt component without the code component forces the agent to regenerate deterministic logic on every invocation, introducing variance where none is needed. The code component without the prompt component lacks the semantic reasoning and NLP capabilities that motivated the approach in the first place. The karpathy script is their union: prompt guidance directing the agent’s judgment, fixed elements anchoring the agent’s execution. Together they create a “leash” that keeps the operation reliable while preserving the agent’s ability to exercise judgment where judgment is actually needed.

A karpathy script may be simple (a single prompt file referencing a single helper script) or complex, with a multi-step prompt orchestrating several code artifacts with intermediate judgment steps. The complexity should match the task, not an abstract ideal.

The Inversion

The industry has abundant examples of traditional code that calls LLMs. Agent frameworks like LangChain orchestrate LLM calls programmatically. CI pipelines include AI-generated summaries. Scripts hit API endpoints for classification or content generation. In all of these patterns, Software 1.0 is the host and the LLM is a guest, and a service invoked by deterministic orchestration.

A karpathy script inverts this relationship. The natural language instruction is the host (the primary program) and traditional code is the guest, invoked by the agent as a tool when it needs deterministic support. The orchestration lives in Software 3.0, not Software 1.0.

This represents a different mental model of what is in charge. When you write a Python script that calls an LLM API, you are building traditional software with an AI component bolted on. The control flow is deterministic. The AI is one step in a fixed sequence. When you write a karpathy script, you are writing a program in English that happens to invoke code when it needs reliability. The agent interprets your instructions, makes judgment calls about how to proceed, and reaches for deterministic tools when the situation calls for them. The orchestration is stochastic. The code components are anchors within that stochastic flow.

The industry has good language for “code that calls an LLM.” It does not yet have good language for “LLM instructions that call code.” That is the gap the term fills. A karpathy script names the specific artifact that sits at this intersection: natural language as the orchestrator, code as the tool, and a framework for building this relationship well.

When to Reach for a Karpathy Script

Not every task warrants a karpathy script. The concept occupies a specific niche in the solution space, and understanding that niche (what belongs inside it and what does not) helps you choose the right tool for each situation.

Use a pure prompt when the task is one-off and will not be repeated, when the task is exploratory and the approach is not yet understood, when the task is entirely NLP-native with no deterministic components, or when you are still in the early stages of understanding the task and haven’t yet identified where the boundaries lie. A one-off prompt that produces an excellent result is good prompt engineering. It becomes a karpathy script only when it is designed for repeated execution and incorporates fixed elements to stabilize that repetition. The distinction is between a conversation and a program.

Use a karpathy script when the task involves a blend of NLP-native work (judgment, semantic analysis, natural language processing) and deterministic work (data transformation, file operations, structured computations), when development speed matters and a formal feature workflow would be disproportionately expensive, when the task is repeatable (i.e., it will be executed multiple times across similar inputs) and when the acceptable error tolerance is non-zero but bounded. You need reliability, not perfection.

Use a formal feature (traditional code) when the task is entirely deterministic with no NLP component, when the task is critical infrastructure where any stochastic variance is unacceptable, when the task requires comprehensive automated testing and regression prevention, or when the solution must operate without an LLM agent available.

These categories are not rigid. They represent a continuum, and tasks naturally move along it as they mature. A task that starts as a pure prompt during exploration may evolve into a kscript as it becomes repeated and its boundaries crystallize. A kscript that grows sufficiently complex and critical may eventually warrant promotion to a formal feature with full test coverage. The kscript practice is a point on this continuum, not a permanent classification.

Development speed alone can justify the choice. A purely computational task might be expressible in Python with a complicated regex pipeline, but if explaining the desired behavior to an agent in English produces reliable results in a fraction of the development time, that is a valid reason to choose Software 3.0. The paradigm is defined by the tools used to create the solution, not by the domain of the task.

Discovering the Boundary

The most distinctive aspect of karpathy scripting is how the boundary between prompt and code is determined. The core question is always the same: for each aspect of the task, which paradigm — stochastic interpretation or deterministic execution — will produce more reliable results?

The cowpath approach is the most organic method and the one that gives the practice its name. Begin with a pure prompt that describes the entire task. Execute it repeatedly across different inputs. Observe what the agent does consistently: which helper scripts it generates, which procedures it follows, which tools it reaches for. These repeated behaviors are the cowpaths. When the agent repeatedly generates nearly identical code for a deterministic operation, that code is a candidate for extraction: take a verified instance, make it permanent, and rewire the prompt to reference it instead of regenerating it. The boundary reveals itself through accumulated evidence.

Agent-assisted analysis offers a faster alternative. Rather than observing behavior across many executions, consult the agent directly. Present a task description and ask: “Which parts of this would you handle with traditional code? What helper scripts would you create? Where would you reach for deterministic support?” The agent’s self-assessment of where it needs fixed scaffolding can accelerate boundary discovery, particularly for tasks where repeated execution is expensive or impractical. The results should be validated against actual execution, but the approach trades empirical rigor for speed.

Architectural reasoning works when the boundary is apparent from the nature of the work. File format parsing, mathematical computation, data validation against a fixed schema, and cryptographic operations are almost always better served by code. Semantic evaluation, summarization, contextual interpretation, and natural language generation are almost always better served by prompt. When the task decomposes cleanly along these lines, the boundary can be drawn before any execution occurs.

Iterative refinement is how the boundary evolves over time. A kscript deployed with an initial boundary may reveal — through subsequent use — that some prompt-handled tasks would be more reliable as code, or that some code components are unnecessarily rigid and could benefit from agent judgment. The boundary is a living decision, revisited as the task evolves and as agent capabilities change.

The key principle across all approaches is the same: when the agent repeatedly generates nearly-identical code to perform a deterministic operation, each regeneration is a fresh opportunity for subtle errors. Whether that pattern is discovered through observation, self-assessment, or reasoning, the response is the same — extract it, verify it, cement it. Don’t roll the dice on what can be fixed.

Karpathy Scripts in WSD

If you have been using WSD, you have already been using karpathy scripts — you just may not have had a name for them.

WSD’s slash commands are themselves karpathy scripts. Consider /wsd:init. It contains conditional branching (check if the --custom flag was provided), variable generation (create a timestamp-based Workscope ID), subroutine calls to other commands (/wsd:boot), file I/O operations (Work Journal creation via a shell script), defined completion behavior, and — critically — evaluation of the Task-Master agent’s output as a double-check against potential errors. That last capability is the kind of semantic reasoning that makes a karpathy script valuable: a traditional script cannot evaluate whether an agent’s workscope assignment is logically sound, but a prompt component can.

These commands were developed organically, built to solve practical workflow problems, before the term “karpathy script” existed. They illustrate the principle that kscripts often emerge from practice before they are recognized as such. Someone writes a prompt that works. They use it repeatedly. They notice that certain parts are always the same — a helper script written to dev/diagnostics, the timestamp format, the file naming convention — and those parts get extracted into fixed code. What remains is a prompt that orchestrates the fixed elements while providing the judgment and flexibility that only an agent can offer.

Claude Code skills have emerged to provide a well-suited container format for karpathy scripts. A skill is a directory with a SKILL.md entry point (the prompt component) alongside supporting files like scripts, templates, schemas, data fixtures (the code components), all invocable as a single slash command. Skills support argument passing, dynamic context injection (shell commands whose output is injected into the prompt before the agent sees it), and per-skill tool permissions. This makes skills an excellent delivery mechanism for kscripts: both components live in one self-contained directory.

Not every skill is a karpathy script, and not every karpathy script must be packaged as a skill. A simple skill that provides insight into frontend design practices is just a saved prompt with no deterministic component. Conversely, a kscript could be delivered as a custom command with external script references, or even as verbal instructions paired with helper scripts on disk. Skills are an effective packaging mechanism; karpathy scripts are the design pattern. The pattern exists at a different level of abstraction than the mechanism, and it survives independently of any particular tool.

Building your own karpathy script in WSD follows a natural progression:

Start with a pure prompt or custom command. Write the instructions in English. Execute them manually. See what works.
Observe the cowpaths. After several executions, notice what the agent does consistently — the helper scripts it generates, the validation steps it performs, the output formats it produces.
Extract the fixed elements. Take a verified instance of each repeated behavior and make it permanent — a Python script, a JSON schema, a template file.
Rewire the prompt. Reference the extracted elements instead of letting the agent regenerate them. The prompt now says “run scripts/validate.py” instead of “write a validation script.”
Package as a skill. If the kscript has matured to the point of regular use, bundle the prompt and code components into a skill directory for clean invocation.

This progression mirrors the cowpath principle: observe, then pave. You don’t start by designing the karpathy script. You start by using the AI, watch what it does, and then formalize the patterns that emerge.

Limitations

Intellectual honesty about the limitations of karpathy scripts is essential. They are powerful, but they are not without trade-offs.

Stochastic variance is real. Even with deterministic code components extracted, the prompt component introduces irreducible variance. The agent may phrase outputs differently, take slightly different approaches to judgment calls, or occasionally misinterpret instructions. Downstream consumers must be designed to tolerate this variance, or the kscript is the wrong tool. The salient aspects — the parts that downstream processes depend on — should be reliable, but perfect reproducibility is not achievable and should not be expected.

Testing is informal at the moment. Unlike traditional code, which can be covered by automated test suites with deterministic assertions, the prompt component of a kscript is tested primarily through repeated execution and evaluation of results. There is no pytest for natural language instructions. Regression detection depends on noticing degraded output quality, whether through human review or automated consistency checks on the deterministic portions of the output. This is an area where the practice has room to mature.

Model dependency is a factor. A kscript is tuned to the behavior of a particular class of LLM. Model updates, capability changes, or switching between providers may alter how the prompt component is interpreted. The code components are immune to this — a Python script runs the same regardless of which model is orchestrating it — but the prompt components may require re-tuning when the underlying model changes significantly.

Cowpath risk exists. The cowpath methodology can cement suboptimal patterns if the initial observation period is too short or the input diversity is too narrow. Just as the classic warning says “don’t pave the cowpaths” when the paths themselves are wrong, a kscript developer must exercise judgment about whether the observed agent behavior represents a good solution or merely a habitual one. The boundary between prompt and code should be drawn based on what produces reliable results, not just what the agent happened to do the first few times.

These limitations are manageable in practice. They are the trade-offs accepted in exchange for the speed, capability, and NLP-native power that karpathy scripts provide. The key is understanding the trade-offs clearly and choosing karpathy scripts for tasks where the benefits outweigh the costs — which, for many of the tasks that arise in AI-assisted development, they do.

A. Karpathy, “Software is changing (again),” presented at AI Startup School, San Francisco, CA, USA, Jun. 2025. [Online]. Available: https://youtu.be/LCEmiRjPEtQ ↩
W. Lidwell, K. Holden, and J. Butler, Universal Principles of Design, rev. ed. Beverly, MA: Rockport, 2010. The “Desire Line” entry defines the principle as “a trace of use or wear that indicates preferred methods of interaction with an object or environment.” ↩