Part III: Evaluation, Lessons Learned, and Future Directions
  Based on research conducted August — December 2025
  Summary

This post presents the final installment of a three-part series documenting an experiment in autonomous LLM-driven software development. We assess the overall results of the project, examining the fundamental tension between capability and reliability that defines the current state of the technology. We articulate the "80/20 problem"—the observation that LLMs handle the first 80% of implementation reasonably well while the final 20% proves extraordinarily difficult to traverse autonomously. Based on our findings, we propose a "hybrid orchestration" architecture that may offer a more practical path forward, inverting the "agentic" model to use deterministic code for workflow control while leveraging LLMs for specific creative tasks. This approach draws parallels to recent advances in AI-assisted mathematical theorem proving.

1 Introduction

Parts I and II of this series described the methodology, architecture, and challenges of our experimental project exploring autonomous LLM-driven software development. This concluding post steps back to assess the overall results, compare against traditional development practices, and propose directions for future work.

The central finding of this work is a fundamental tension: "agentic" LLMs demonstrate impressive capability across the full software development workflow, yet their reliability in executing these capabilities consistently remains insufficient for true autonomous operation. This capability-reliability gap represents the primary obstacle to the vision of autonomous software generation.

2 Summary of Results
2.1 Backend Implementation

The LLM produced a backend implementation that we assessed as "probably acceptable": functional code that, with appropriate human review, could proceed to production deployment. Key characteristics:

  • All code and tests were written by the LLM
  • The implementation satisfied the feature requirements
  • The generated code required unusually detailed human review
  • Multiple issues required prompt iteration and explicit correction

The backend success was qualified: while the end result was acceptable, the process of achieving it involved substantial human intervention in the form of review cycles, explicit correction prompts, and implementation guidance.

2.2 Frontend Implementation

The frontend implementation was not completed to a passing state through autonomous LLM operation. Despite multiple attempts and significant prompt iteration:

  • Tests consistently failed to pass
  • The LLM exhibited false confidence about test status
  • End-to-end test debugging proved especially challenging
  • Human intervention was ultimately required to complete the implementation

This outcome was particularly notable given that the chosen feature was deliberately simple (a button invoking an API endpoint). More complex frontend implementations would presumably face greater challenges.

3 The 80/20 Problem

The first 80% of implementation is handled moderately well by the model. The last 20%—the "last mile"—proves extraordinarily difficult to traverse autonomously.

This pattern manifested repeatedly throughout the project:

  1. Initial generation: The LLM would produce a reasonable first-pass implementation covering the major functionality.
  2. Debugging and correction: When issues arose, the LLM's ability to diagnose and correct them was notably weaker than its ability to generate initial code.
  3. False completion: The LLM would frequently declare completion prematurely, requiring human verification to identify remaining issues.

The 80/20 problem has significant implications for how LLMs might be practically deployed. A tool that handles 80% of work but requires intensive human intervention for the remaining 20% offers genuine value—but that value is qualitatively different from autonomous operation.

4 Capability Versus Reliability
4.1 Impressive Capabilities

Throughout the project, the LLM demonstrated genuinely impressive capabilities:

  • Implementation planning: Given a ticket specification, the model could analyze existing codebase structure and propose coherent implementation approaches.
  • Code generation: First-pass implementations were often structurally sound and addressed primary requirements.
  • Code review: The review capability surpassed commercially-available automated review tools, and additionally enjoyed the benefit of full codebase context.
  • Tool discovery: The ability to learn command-line tool usage through iterative help invocation was remarkable.
  • Operational procedures: Following deployment procedures, executing build commands, and interacting with CI/CD systems all worked reasonably well.
4.2 Reliability Shortcomings

However, these capabilities were not reliably exercised:

  • The same prompt might produce high-quality output in one session and poor-quality output in another.
  • The model would deviate from explicit instructions, finding rationalizations for why they no longer applied.
  • Asserted completion status bore little correlation with actual completion.
  • Recovery from errors was weak; the model often compounded problems rather than resolving them.
4.3 The Interactive Escape Hatch

AI companies address the reliability problem by emphasizing interactive workflows. When users remain engaged with generation, they can observe the model departing from intended paths and intervene to correct course.

However, this represents a conceptual retreat from the idea of an intelligent agent capable of autonomous operation. The need for continuous human oversight fundamentally changes the value proposition and appropriate use cases for the technology.

5 Comparison to Traditional Development

How does LLM-assisted development compare to traditional approaches? The answer proved nuanced.

5.1 Time Allocation

Much of the project was spent on prompt iteration rather than feature implementation. Developing effective prompts, debugging prompt failures, and refining the orchestration approach consumed the majority of effort.

Once prompts were reasonably mature, generating a first-pass feature implementation took less than 30 minutes. This represents a potentially significant acceleration, provided the human review overhead does not consume the time savings.

5.2 Review Burden

The character of required review differed qualitatively from traditional code review:

  • Traditional review focuses on logic, maintainability, and edge cases within generally competent code.
  • Review of LLM-generated code must additionally watch for fundamentally flawed approaches, unnecessary complexity, and code that appears correct but fails to accomplish its stated purpose.
  • The reviewer cannot assume the author "knew what they were doing" in the way one might with an experienced developer.

Whether this tradeoff (fast initial generation requiring intensive review) represents a net improvement remains an open question and likely depends on specific team contexts.

6 Future Directions

Based on our experience, we propose two future directions for further exploration.

6.1 Breadth-First Prompt Development

Our experiment focused on a single feature implementation, meaning the prompts we developed are optimized for that feature's specific characteristics. The capability and failure patterns we observed may not generalize.

A more robust approach would involve:

  1. Selecting a diverse set of representative tickets (perhaps a dozen)
  2. Running the current prompt suite against each
  3. Analyzing patterns in success and failure
  4. Iterating prompts to address common failure modes

This "breadth-first" approach would more rapidly reveal the prompts' generalization properties than continued "depth-first" iteration on a single feature.

6.2 The Anti-Agentic Architecture

Our more significant proposal involves a fundamental architectural inversion.

Throughout this project, we pursued an agentic approach: the LLM drives the process, invoking tools and making decisions about next steps. The reliability problems we encountered stem largely from the difficulty of getting the LLM to faithfully execute predefined workflows.

The Proposal: Invert Control

Instead of having the LLM drive the workflow, write deterministic code to orchestrate the development process, invoking the LLM only for specific tasks requiring "creative" capabilities.

Consider the difference:

Agentic approach: Prompt the LLM to "implement this ticket," trusting it to retrieve the ticket, analyze it, plan the implementation, generate code, run tests, and so forth.

Orchestrated approach: Write classical code that:

  1. Retrieves the ticket description via API
  2. Passes the description to the LLM for implementation planning
  3. Receives the plan and presents it for human approval
  4. Passes the approved plan to the LLM for code generation
  5. Receives generated code and writes it to files
  6. Runs tests deterministically
  7. Passes test output to the LLM for analysis
  8. Continues the loop based on LLM analysis

The key insight is that most steps in the software development workflow are deterministic and can be implemented with traditional code that works correctly every time. Only certain steps, such as implementation planning, code generation, and test analysis, require the creative capabilities of an LLM.

By inverting control, we gain:

  • Reliable workflow execution: The orchestration code does not forget steps, ignore instructions, or decide that certain tasks no longer apply.
  • Clear interfaces: The inputs and outputs of each LLM invocation are well-defined and can be validated.
  • Debuggability: When something goes wrong, the failure point is clearly identifiable rather than lost in a stream of autonomous decisions.
  • Composability: Individual LLM capabilities can be developed and tested in isolation.
6.3 Precedent: AlphaGeometry

This hybrid approach has precedent in other AI domains. Google's AlphaGeometry system, which achieved remarkable results on International Mathematical Olympiad geometry problems, employs precisely this architecture:

  • A classical deduction engine handles deterministic reasoning steps
  • A language model is invoked specifically when creative insight is required
  • The classical system maintains control, using the LLM as a capability rather than a driver

The analogy to software development is direct: much of development is deterministic (running tests, executing builds, deploying code), while certain steps benefit from the pattern-matching and generation capabilities of LLMs.

7 Immediately Applicable Takeaways

Beyond the findings, several practical artifacts emerged from this project:

7.1 The Code Review Prompt

The code review prompt we developed is immediately applicable as a supplement or alternative to existing automated review tools. Its advantages include:

  • Full codebase context awareness
  • Integration of company-specific coding standards
  • Ability to reason about architectural implications
  • Natural language explanation of issues

Teams could deploy this prompt today for human-supervised code review augmentation.

7.2 The Planning Prompt

The implementation planning prompt provides value as a pre-implementation analysis tool. Even without proceeding to autonomous implementation, having an LLM analyze a ticket and propose implementation approaches can:

  • Surface ambiguities in ticket specifications
  • Identify potential architectural concerns early
  • Suggest approaches the developer might not have considered
  • Provide a starting point for implementation discussion
7.3 Prompt Engineering Techniques

The prompt engineering techniques documented in Part I (prompt improvement through generation, professional qualification priming, and telescoped context windows) are applicable to any LLM-assisted workflow, not just software development.

8 Conclusion

This project explored the frontier of autonomous LLM-driven software development in a production enterprise context. Our findings can be summarized as follows:

  1. The technology is impressive but unreliable. LLMs demonstrate remarkable capability across the full software development workflow, but cannot yet be trusted to exercise these capabilities consistently.
  2. The 80/20 problem is real. First-pass implementation is handled reasonably well; completing the last mile autonomously remains extraordinarily difficult.
  3. Interactive assistance remains the practical model. Despite aspirations for autonomous operation, the technology's current state is best suited to human-supervised assistance.
  4. Specific artifacts have immediate value. Code review, implementation planning, and prompt engineering techniques developed during this project are deployable today.
  5. Hybrid architectures may be the path forward. Inverting control by using classical orchestration with targeted LLM invocations may offer a more practical approach than pure "agentic" systems.

The question of whether LLMs will eventually achieve the reliability necessary for truly autonomous software development remains open. The technology continues to advance. However, teams considering deployment today should plan for human oversight at all stages and should be skeptical of demonstrations that present the technology's capabilities without acknowledging its reliability limitations.

The most valuable outcome of this work may be the reframing it suggests: rather than asking "Can LLMs write software autonomously?" we might better ask "What specific creative tasks can LLMs perform reliably, and how do we integrate those capabilities into workflows that remain under human control?"

9 Work With Us

The experiments documented in this series represent how we approach engineering at Wholesail: with intellectual curiosity, rigorous analysis, and a willingness to push the boundaries of what's possible. We're a small team of experienced engineers building the payment and credit infrastructure that connects over 400 wholesalers with 100,000+ buyers in the food and beverage industry. If exploring the frontier of emerging technologies while crafting reliable and secure financial systems sounds compelling, we'd love to hear from you. We're hiring engineers who take pride in their craft, think deeply about the products they build, and want to shape both the technology and culture of an early-stage company. Learn more at our jobs page.

This is Part II of a three-part blog series on autonomous software generation with large language models. Part I covers methodology and prompt engineering. Part III presents evaluation results and future directions.