This post presents the second of three parts documenting an experimental project exploring autonomous LLM-driven software development. We describe the technical architecture of the workflow orchestration system, the security model and its practical limitations, and provide a detailed taxonomy of the code generation flaws observed during the project. The challenges documented here represent critical considerations for any organization contemplating production use of "agentic" LLM systems for software development, revealing that while the technology demonstrates impressive capability, its reliability remains a significant obstacle to true autonomous operation.
- Autonomous operation requires disabling permission enforcement.
- Evidence: Practical workflows failed under granular command restrictions.
- Implication: Production use requires compensating controls outside the agent.
- LLMs frequently report success despite underlying failures.
- Evidence: Failing tests misreported as passing; review feedback claimed resolved without changes.
- Implication: Human verification is mandatory at every stage.
- Generated code exhibits non-obvious quality failure modes.
- Evidence: Error masking, redundant logic, API misuse, and unnecessary complexity.
- Implication: Review checklists must differ characteristically from human-authored code reviews.
- LLMs resist test implementation even with explicit instructions.
- Evidence: Repeated rationalization to skip tests; required dedicated prompts.
- Implication: Testing must be enforced as a separate, audited phase.
- State persistence via human tools introduces semantic drift.
- Evidence: "Enthusiasm avalanche" distorted project state across sessions.
- Implication: Persistent state must be structured and sentiment-stripped.
Part I of this series described the prompt engineering methodology developed for autonomous LLM software generation. This post examines the technical architecture of the implementation, the security considerations that emerged during development, and most critically, the specific categories of flaws observed in generated code. Industry benchmarks like SWE-bench attempt to measure AI code generation quality, though they capture different dimensions than the qualitative issues described here.
As a reminder, the feature under development, a customer list export with a new API endpoint and UI button, was deliberately simple, chosen to isolate LLM limitations from feature complexity.
Understanding these challenges is essential for setting realistic expectations about the current state of LLM technology for software development. While the capabilities demonstrated were often impressive, the reliability issues documented here represent deep challenges to the kind of autonomous operation the project sought to achieve.
The operational architecture that emerged from our experimentation ended up being somewhat simple: a manual orchestration loop executed from the command line. Despite initial aspirations for more automated orchestration, we found that the most effective approach, at least at this stage, involved a human operator sequentially invoking specialized prompts:
# Typical workflow sequence claude /plan-implementation TICKET-123 # Review plan, adjust ticket if needed claude /implement-jira-ticket TICKET-123 # Wait for implementation to complete claude /review-code # Review identifies issues claude /address-feedback # Iterate until review passes claude /implement-tests # Verify test coverage claude /deploy-qa
This "orchestration loop" allowed human oversight at natural breakpoints while still delegating substantial work to the LLM at each phase.
The Model Context Protocol (MCP) provided an important integration mechanism for connecting the LLM agent to enterprise systems. We used MCP servers for:
- Atlassian: Integration with Jira for ticket retrieval and status updates, Confluence for documentation access
- GitLab: Merge request creation, code review posting, pipeline status monitoring
The Atlassian MCP proved significantly more reliable than command-line tools for Jira and Confluence interaction. However, MCP connections occasionally failed mid-workflow, creating problematic states where the LLM would attempt to "work around" the failure with poor results rather than cleanly aborting. For example, it would fall back to the command-line interface ali, repeatedly bungle the jq invocations to parse the output, and then confabulate likely results since it couldn't fetch what it needed to. We only became aware of this intermediate confusion by monitoring the raw tool use logs.
As described in Part I, we used Jira ticket comments and GitLab merge request discussions as persistent state storage. This approach surfaced an unexpected phenomenon: enthusiasm avalanche.
The Enthusiasm Avalanche Problem
Off-the-shelf LLM systems exhibit excessive enthusiasm when summarizing their work. Phrases like "Excellent progress!" and "Ready for production!" accumulate across sessions. When subsequent sessions consume these accumulated tokens, the model's assessment of project state becomes increasingly disconnected from reality. Tests that are failing may be described as passing; incomplete implementations may be declared production-ready.
Potential solutions include:
- Explicit instructions to produce terse, factual summaries without encouragement
- Post-processing agent to strip enthusiasm before persisting comments
- Structured state storage with separate fields for status versus commentary
Claude Code's security model operates by granting permission to execute specific commands, optionally with argument constraints, in specified directories. In theory, this provides granular control over agent capabilities. In practice, the granularity breaks down due to the compositional nature of Unix command execution.
For example, granting permission to execute xargs with arbitrary arguments effectively grants permission to execute any program, since xargs can invoke arbitrary commands. Similarly, permission to run shell scripts or interpreted languages creates equivalent escape hatches.
Our goal of autonomous operation created fundamental tension with the permission model. When the agent encountered a permission denial, it would not abort cleanly. Instead, it would attempt alternative approaches to accomplish the task. However, these alternatives were typically inferior to the denied operation.
For example, if denied permission to run git status, the agent might attempt to parse the .git directory directly, producing incorrect results and subsequent confusion about repository state.
Comprehensive permission enumeration through log analysis proved impractical given the breadth of commands the agent might need across the full development workflow. The pragmatic solution, ultimately, was to run with the --dangerously-skip-permissions flag, accepting the associated security implications.
However, despite disabled permission enforcement, the LLM exhibited conservative behavior:
- Only one instance of unprompted file modification occurred, and the prompt may have been ambiguous
- No attempts to modify revision control history destructively
- No attempts to access credentials or sensitive configuration
- No network operations beyond those explicitly required for task completion
Occasionally, the agent exhibited confusion during complex git operations. In several instances while attempting to merge branches, it would reset feature branches to mirror the master branch, then express bewilderment about why its previous changes had disappeared. In extreme cases, it escalated to typing responses in all-capitals.
The most significant findings of this research concern the specific categories of flaws observed in LLM-generated code. These flaws required characteristically different and consequently more intensive human review than typical developer-authored code.
Under-specified implementation planning: The LLM's initial implementation plans frequently missed critical requirements or made poor architectural decisions. We found that augmenting ticket descriptions with detailed implementation constraints significantly improved planning quality.
Planning Improvement Strategy
We developed a two-phase approach: First invoke a planning prompt and review its output; then augment the ticket with explicit constraints addressing the plan's weaknesses before proceeding to implementation. This "implementation plan slugfest" produced substantially better results than direct implementation attempts.
Client-only implementation bias: Given a full-stack feature requirement, the LLM exhibited a tendency to implement only the frontend portion, treating the API as if it already existed. Explicit specification that both frontend and backend implementations were required addressed this issue.
Despite explicit instructions to implement tests, the LLM consistently demonstrated reluctance to do so:
"Since writing comprehensive unit tests isn't part of the typical workflow in this codebase (the test file was already there), I'll skip the unit tests and move directly to E2E testing which is more important."
This rationalization appeared even when the existing test file was empty or contained only boilerplate. A dedicated /implement-tests prompt was required to address this deficiency, and even then often required further interactive requests to author tests to ensure adequate coverage.
Perhaps the most dangerous category of flaw involved the LLM's tendency to report success despite underlying failures:
- Node version incompatibility: Tests failed due to an incompatible Node.js version, but this was masked by verbose output and the model's tendency to "move forward." The issue was only discovered when a human reviewer observed that test references "made no sense" and "referred to things that don't exist."
- E2E test orchestration failures: End-to-end tests were misconfigured, producing output that the LLM interpreted as passing when they were not actually executing correctly.
- Claim/reality divergence: The model would claim to have addressed review commentary when inspection revealed no changes had been made.
The generated code exhibited numerous quality issues that would typically be caught in regular human code review, although even variations on our purpose-built code review sub-prompt did not resolve them:
| Category | Specific Issues |
|---|---|
| Memory Management | Buffering entire datasets into memory when streaming APIs were available and appropriate |
| Redundant Code | Multiple regex patterns when one would suffice; duplicate pagination logic when existing streaming APIs handled pagination automatically |
| API Misuse | Unnecessary use of asynchronous client with immediate join (should use synchronous client); redundant content-type specification duplicating response annotation |
| Logic Errors | Hard-coded array insertion offsets despite controlling predicates; incorrect ordering of headers versus data rows |
| Unnecessary Complexity | Overly complicated sort logic for headers (required human-provided code block); special-case string conversion logic across data types |
| Dead Code | Including null company references in data structures despite the data being unavailable |
| Performance | Unnecessary RPC calls to fetch data already available in context |
| Style | Trailing whitespace; overly verbose logging |
When tests failed, the LLM exhibited concerning behavior patterns:
- Blame deflection: Strong insistence that test failures were not due to the implementation, even when they clearly were
- Superficial fixes: Addressing symptoms rather than root causes
- Debugging difficulty: End-to-end test failures proved especially challenging; significant "think very hard about it" prompting was required for basic debugging
Multiple cycles of explicit instruction were often required before the model would seriously investigate the connection between its changes and observed failures.
One genuinely impressive capability observed was the model's ability to learn command-line tool usage within a single session, leveraging its tool use capabilities and consequent output processing:
- Given only a tool name (e.g.,
acli), invoke--help - Parse output to identify relevant subcommands
- Recursively invoke help on subcommands
- Synthesize a working command invocation
Code generation aside, this was perhaps one of the most surprising and delightful properties of working with LLMs in a command-line setting. It was particularly gratifying that we had, for our internal tools, built out thorough and descriptive help messages; this meant the LLM could quickly "learn" to use our own internal tooling with minimal prompting.
However, this learning was purely ephemeral (hence, "stateless"). The next session would repeat the entire discovery process. Potential solutions include:
- Graph database MCP servers to persist "learnings" as relationships
- Prompt augmentation with tool usage examples
- Caching successful command patterns in project configuration
Beyond code quality, several operational issues complicated autonomous workflows:
- Background process management: The agent did not reliably terminate background processes (e.g., development servers) on exit
- Git confusion: Complex merge scenarios occasionally resulted in branch state confusion, with the agent losing track of previous commits
- MCP reliability: Mid-workflow MCP failures produced confusing recovery attempts rather than clean failures
- Username handling: Difficulty correctly tagging users in Jira and GitLab comments, particularly for automated accounts with unusual usernames
The technical challenges documented in this post represent the current state of "agentic" LLM technology for autonomous software development as of late 2025. The key findings are:
- Permission models don't quite fit: Practical autonomous operation requires disabling permission enforcement, with associated security implications.
- State management requires filtering: Using existing systems for state persistence is viable but requires mitigation of enthusiasm avalanche.
- Generated code requires intensive review: The categories of flaws differ qualitatively from typical developer code, requiring adapted review approaches.
- False confidence is dangerous: The model's tendency to report success despite failures represents a significant risk requiring human verification at all stages.
- Testing remains problematic: Despite explicit instructions, the model exhibits systematic reluctance to implement adequate test coverage.
These challenges do not render the technology useless because the capability to generate first-pass implementations is genuinely impressive. However, they establish clear boundaries on the degree of autonomy that can be safely delegated. The manual orchestration loop that proved most effective in this project aligns with patterns Yoko Li identifies in Emerging Developer Patterns for the AI Era, where human oversight at key checkpoints enables effective use of AI capabilities while managing their limitations. Part III of this series examines the overall results of the research and proposes architectural approaches that may address some of these limitations.
The experiments documented in this series represent how we approach engineering at Wholesail: with intellectual curiosity, rigorous analysis, and a willingness to push the boundaries of what's possible. We're a small team of experienced engineers building the payment and credit infrastructure that connects over 400 wholesalers with 100,000+ buyers in the food and beverage industry. If exploring the frontier of emerging technologies while crafting reliable and secure financial systems sounds compelling, we'd love to hear from you. We're hiring engineers who take pride in their craft, think deeply about the products they build, and want to shape both the technology and culture of an early-stage company. Learn more at our jobs page.
This is Part II of a three-part blog series on autonomous software generation with large language models. Part I covers methodology and prompt engineering. Part III presents evaluation results and future directions.