Autonomous Software Generation with Large Language Models

Andrej Karpathy recently noted that "LLM agent capabilities have crossed some kind of threshold of coherence around December 2025 and caused a phase shift in software engineering." For many practitioners, the question is no longer whether AI coding agents can capably generate code, but how to configure and prompt them effectively. Wholesail's engineering team ran an experiment in autonomous end-to-end implementation to explore that question. Here's what we learned.

Part I: Methodology and Prompt Engineering

Based on research conducted August — December 2025

Summary

This post presents the first of three parts documenting an experimental project exploring the use of large language models (LLMs) for autonomous end-to-end software development at Wholesail, a B2B payment and credit management platform serving over 400 vendors and 100,000 buyers. We describe the project approach, LLM selection criteria, and novel prompt engineering techniques developed during the research. Central to our methodology was the insight that prompt quality could be significantly improved through a multi-stage generation process, including the integration of professional qualification standards as "pep talk" preambles to influence token generation toward higher-quality code. This work provides practical insights for engineering teams considering "agentic" LLM workflows for software development automation.

Key Findings

Let the model help write the prompt. Having the LLM expand human-written outlines into detailed prompts consistently produced higher-quality outputs than direct prompting alone.
Prime the model like a professional. Including job expectations, role definitions, and code-review standards in system-level prompts improved code generation quality.
Maintaining shared context improves coherence. Generating multi-part prompts within a single session produced more coherent results than isolated generation.
Break the workflow into phases. Creating specialized prompts for discrete development stages consistently outperformed monolithic, end-to-end prompts.

1 Building a Customer List Export Feature

Wholesail operates a unified platform for credit risk management, accounts receivable, and payment processing in the wholesale distribution and manufacturing industries. Our engineering team regularly implements features spanning backend Java services and React-based frontend applications. In August 2025, we initiated an exploratory project to evaluate whether LLM technology (specifically Anthropic's Claude) could autonomously implement software features end-to-end.

The project involved the implementation of a deliberately simple feature: a customer list export functionality consisting of a new API endpoint and a corresponding user interface button. The simplicity of the chosen feature was intentional. It represented the minimal viable scope that would still exercise the full software development lifecycle. This approach avoided a circumstance in which a more complex feature might have made it difficult to determine the cause of any implementation difficulties, whether they stemmed from the feature's inherent complexity or from limitations in the LLM technology generating it.

2 Our Approach to Autonomous End-to-End Generation

2.1 Autonomous End-to-End Generation

Our fundamental design decision was to pursue autonomous, end-to-end software generation rather than interactive assistance. This approach reflected our belief that understanding the technology's true functional ceiling required pushing it to operate independently across all phases of development.

The rationale was straightforward: If the goal is to understand how LLMs might truly accelerate software development, then the most ambitious use case of fully autonomous implementation would reveal both capabilities and limitations more clearly than partial automation scenarios where human intervention might mask underlying weaknesses.

This decision had significant implications for the project's structure. We needed to develop mechanisms for the LLM to:

Consume and interpret feature specifications from existing project management systems
Plan implementation approaches based on codebase context
Generate both application and test code
Interact with code review processes
Execute operational procedures such as deployment to testing environments

This workflow is essentially the same read-think-act cycle that powers other tools like OpenAI's Codex, but instead of operating at tool call scale, it must operate at feature scope.

2.2 Model Selection

We selected Claude Code as our primary LLM interface, utilizing the Opus model variant for complex reasoning tasks and the Sonnet variant when required by Anthropic. This selection was based on several factors:

Agentic capabilities: Claude Code provides built-in support for file system operations, command execution, and multi-step task orchestration essential for autonomous development workflows.
Context window: The available context window was sufficient to consume meaningful portions of the existing codebase during implementation planning.
Extensibility: The Model Context Protocol (MCP) support enabled integration with systems including Jira, GitLab, and CI/CD infrastructure.

3 Why Multi-Stage Prompt Generation Drove Better Results

The most significant technical contribution of this project lies in the prompt engineering methodology we developed through iterative experimentation. We discovered that naive prompting produced substantially inferior results compared to carefully structured, multi-stage prompt generation.

3.1 The Prompt Improvement Paradigm

Our first key insight came from observing techniques used with other LLM systems (ChatGPT DeepResearch) for investigative tasks. Rather than directly prompting an LLM with a request, practitioners had found success in first writing an outline of their request, then asking the LLM to generate a more detailed prompt based on that outline.

This approach aligns with theoretical understanding of how LLMs operate: More detailed, specific prompts constrain the token generation space more effectively, leading to outputs that better match the requester's intent. The generated prompt, being more verbose and detailed than a human would typically write, provides this additional constraint.

Anthropic's documentation for Claude Code contained a reference to a "Prompt Improver," which captures exactly this idea in a dedicated workbench tool.

3.2 The "Pep Talk" System Prompt

Our second major prompt engineering discovery was the significant impact of including professional qualification standards in the system prompt. After observing poor-quality code generation in early experiments, we hypothesized that priming the model with tokens associated with high-quality software development might constrain generation toward better outcomes.

The Token Space Theory

The theoretical basis for this approach lies in how transformer-based language models generate text. Each generated token is influenced by the probability distributions shaped by preceding tokens in the context. By including tokens associated with experienced, high-quality software development, we hypothesized that subsequent code generation would draw from similar regions of the model's learned token space.

To implement this, we constructed a hybrid system prompt by combining:

Job requirements: We merged the company's most senior backend and frontend engineering job requirements into a single composite specification describing an exceptionally qualified developer.
Code review standards: We incorporated the company's internal code review guidelines, establishing explicit quality criteria the generated code should satisfy.

The resulting system prompt effectively "cast" the LLM in the role of a highly experienced senior engineer familiar with high-quality development practices. This approach produced measurable improvements in code quality, including:

More appropriate error handling patterns
Better adherence to existing codebase conventions
More thoughtful consideration of edge cases
Reduced incidence of naive or "tutorial-quality" implementations

The assessments backing these conclusions were performed anecdotally by human reviewers examining the generated code in merge requests.

3.3 Telescoped Context Windows

A subtle but important refinement to our prompt generation methodology involved maintaining context continuity across multiple prompt generation stages. We discovered that generating prompts in sequence within the same context window produced superior results compared to generating each prompt in isolation.

Consider the process of generating a multi-part system prompt:

First, generate the "pep talk" component based on job requirements
Then, generate the code review standards component
Finally, generate the task-specific implementation prompt

When each component was generated in a fresh session, the resulting prompts lacked coherence. For example, the review standards component would not be influenced by the engineering qualification context established in the first component.

By generating all components within a single context window, each subsequent generation was influenced by all preceding generations. The review standards prompt, having "consumed" the engineering qualifications prompt, would generate text that complemented rather than duplicated or contradicted the earlier content.

4 Targeted Prompts for Specific Tasks

Beyond the system prompt, we developed a set of task-specific "slash commands"—specialized prompts invoked to execute particular phases of the software development workflow:

/plan-implementation    # Analyze ticket and propose implementation approach
/implement-jira-ticket  # Generate code based on ticket specification
/review-code            # Perform code review on generated changes
/address-feedback       # Respond to and incorporate review comments
/implement-tests        # Generate unit and integration tests
/deploy-qa              # Execute deployment to QA environment

This decomposition reflected the reality that different phases of software development require different approaches and expertise. A single monolithic prompt attempting to cover the entire workflow proved far less effective than targeted prompts optimized for specific tasks.

The review prompt, in particular, emerged as potentially the most immediately valuable artifact of the research. Its quality exceeded that of existing off-the-shelf automated review tools while offering the advantage of full codebase context and company-specific standards integration.

5 State Management Through Existing Systems

Recognizing that LLMs are fundamentally stateless, we developed an approach to state management that leveraged existing systems rather than introducing new infrastructure.

The insight was that software development already involves substantial written communication: ticket updates, merge request comments, and code review threads. These artifacts could serve as persistent state storage that the LLM could read at the beginning of each session to reconstruct context about work completed so far.

This approach offered several advantages:

No new systems or databases required
State storage occurred in natural locations where human developers would expect to find it
The written summaries could be reviewed by human team members
Integration with existing notification and tracking workflows was automatic

However, this approach also revealed an unexpected challenge that we document in Part II of this series: the LLM's tendency toward excessive enthusiasm in its summaries created a feedback loop that degraded subsequent session performance.

6 Conclusions

This post describes practical prompt engineering techniques for autonomous LLM-driven software development, finding that having LLMs expand human-written outlines into detailed prompts, priming with professional qualifications and code review standards, and maintaining context continuity across prompt generation all improve output quality. The methodology also emphasizes that decomposing development workflows into specialized phase-specific prompts outperforms monolithic prompting approaches.

These techniques are applicable to any team exploring LLM-assisted development workflows, regardless of whether they pursue the fully autonomous approach documented in this project. Part II of this series examines the technical architecture of the implementation and the specific challenges encountered during development.

7 Work With Us

The experiments documented in this series represent how we approach engineering at Wholesail: with intellectual curiosity, rigorous analysis, and a willingness to push the boundaries of what's possible. We're a small team of experienced engineers building the payment and credit infrastructure that connects over 400 wholesalers with 100,000+ buyers in the food and beverage industry. If exploring the frontier of emerging technologies while crafting reliable and secure financial systems sounds compelling, we'd love to hear from you. We're hiring engineers who take pride in their craft, think deeply about the products they build, and want to shape both the technology and culture of an early-stage company. Learn more at our jobs page.

Read the Full Paper

This is Part I of a three-part blog series on autonomous software generation with large language models. Part II covers implementation architecture and technical challenges. Part III presents evaluation results and future directions.