I have seen many times how a business tried to vibe-code on its own or hired yesterday's students without experience in real programming and design to write applications cheaply and quickly. What could go wrong? You no longer need to know programming — you speak in your own language, AI writes everything for you, you can even ask it to cover it with tests and do everything according to architecture best practices... Right? Or not? =) Those who have written something more complex than a mini-parser or simple integrations (which can generally be found for free and do not need to be written) understand that the free ride ends quickly. As soon as the codebase grows, AI starts hallucinating, duplicating code, creating absurd business processes, getting confused, and the more you write to it "no, do it properly, do it well", the worse it does. Sound familiar? Let's discuss how to write code faster and cheaper, while minimizing the amount of gray hair, dead neurons, and dopamine pits.
Оглавление
- How to properly write tests for Codex, Claude, and other modern agents
- Why testing today matters not only to the developer, but also to the director, owner, and investor
- The main idea: you need to test not code as text, but risk as an event
- Why Codex and Claude often write tests that look smart but make no sense
- What testing methodology modern applications need
- 1. Deterministic domain tests
- 2. Component and integration tests
- 3. Contract tests
- 4. Scenario and end-to-end tests
- 5. Runtime checks
- 6. AI evals and agent evals
- How to choose which tests to run always, and which only sometimes
- On every commit or locally before push
- On pull request
- At night or on schedule
- Before release
- After release
- How to write tests for ordinary algorithmic applications
- What to test here first
- Which invariants are useful for business
- How to test applications with integrations
- A good strategy for external integration
- What is especially important to check
- How to test scraping and parsing
- Layer 1. Fetcher
- Layer 2. Parser
- Layer 3. Normalizer
- Layer 4. Dedupe
- Layer 5. Writer
- Layer 6. Scheduler
- What tests are needed here
- How to test Telegram integrations and similar external channels
- Outbound
- Inbound
- What to test for outbound
- What to test for inbound
- How to test queues, delayed actions, and regular tasks
- What is important to check here
- What helps technically
- How to test real-time data streams, for example voice through a microphone in the browser
- How to reasonably divide such tests
- What is important to check
- How to test external webhooks
- What must be checked for a webhook endpoint
- What not to do
- How to test AI integrations if the tests are paid
- The right division of AI checks by cost
- What must be separated
- How to test AI integrations if the result cannot be checked algorithmically
- 1. Reference-based evals
- 2. Rule-based graders
- 3. LLM-as-a-judge
- 4. Pairwise comparison
- 5. Human review for high-risk slices
- Do you need to run another AI to test AI
- Good practice
- How to test agentic systems if it is not clear at all what normal is
- Level 1. Outcome
- Level 2. Process
- Level 3. Cost and stability
- What is especially useful in agent evals
- How to test complex infrastructure tasks and a cold start of the system
- What needs to be checked here
- How to test this in practice
- What a mature testing system must have
- How to give a task to Codex, Claude, and other agents so they write good tests
- Incorrect task statement
- Correct task statement
- What is worth specifying explicitly to the agent
- A very useful rule
- How we would break this down by typical Ingello Systems business scenarios
- Logistics and booking
- Corporate platforms
- Medicine and the regulatory contour
- Production, accounting, and integrations
- AI and development automation
- Common mistakes that kill the real benefit of tests
- Practical conclusion for business
- What to do if you already have a project, but its tests are chaotic
- Conclusion
How to properly write tests for Codex, Claude, and other modern agents
An article by the Ingello Systems team
Modern development has changed radically. Just yesterday, business lived in a world where people wrote code, people wrote tests, and bugs were also human, sometimes lazy, sometimes talented. Today AI agents have appeared on our team: Codex, Claude and other systems that can generate code, edit architecture, run tests, write pull requests, and say with a very confident face: everything is ready =)
The problem is that speed has increased, and the cost of error has also increased. If previously one developer could accidentally break one function, now an agent can confidently multiply an error across five modules in a short time, wrap it with mocks, cover it with snapshots, and even write tests for it that look solid but guarantee nothing.
For business, this is no longer a technical trifle, but a matter of money, deadlines, and process stability. According to materials from DORA 2024, AI has already widely entered the daily practice of development. Playwright, Pact, Temporal, OpenTelemetry, Prometheus, OpenAI Developers and Anthropic Engineering in different forms say the same thing: if a system has become more complex and more automated, then testing must become not a formality, but a separate engineering discipline.
That is exactly why the question of how to write tests for modern AI agents is critical for business today. Especially if we are not talking about a landing page with three buttons, but about real systems with integrations, queues, scraping, voice streams, Telegram bots, webhooks, AI modules, and complex infrastructure.
At Ingello Systems we design and develop both corporate systems and startup products. So further on, I will break down the topic not in the style of abstract theory, but using real classes of applications that businesses face every day.
Why testing today matters not only to the developer, but also to the director, owner, and investor
To be completely honest, business is not interested in the number of tests at all. Business is interested in something else:
- whether money is being lost
- whether processes break after release
- whether the team is stuck on pause because of regression
- whether every change turns into a small civil war
- whether changes can be released quickly and without fear
- whether part of the work can be entrusted to AI agents without the feeling that you have put an intern in charge of a nuclear power plant
When a company does not have a normal testing strategy, development starts to resemble an old warehouse where boxes are stacked to the ceiling, the light bulb is flickering, and someone insists that the accounting system almost works. Formally, everything is standing in place. In reality, any movement can trigger an avalanche.
For a startup, the absence of tests means that the MVP cannot be properly grown further. For a corporate system, this is even more dangerous: a failure starts hitting not the pretty interface, but real operations, logistics, warehouse, finance, document flow, doctors, managers, support, and executives.
We have seen this in different projects. For example, in Prime EVA and GadFul the cost of an error in accounting and integrations is very far from academic. In Svit BUS or LEX an error in routes, booking, payment, or refunds hits money and user trust. In Evrika, Lita and L-Doc the quality of testing already comes down not simply to convenience, but to the reliability of processes and data.
The main idea: you need to test not code as text, but risk as an event
This is the part worth remembering.
A good test checks not that a function was called, but that a business risk is controlled.
For example:
- not just checking that an HTTP request was sent
- but checking that an order is genuinely not duplicated during repeated webhook delivery
- not just checking that the parser returned an array
- but checking that a new record on the page is detected once and gets into the database without duplicates
- not just checking that the Telegram API responded 200
- but checking that the user received the correct notification and the system did not send it three times
- not just checking that AI returned text
- but checking that the text meets the task requirements, does not break the workflow, and does not burn the budget for nothing
This is a very important shift in thinking. Especially when tests are written by an agent. Because AI really likes doing what is easy to automate, not what is actually important.
Why Codex and Claude often write tests that look smart but make no sense
Because by default, an agent is optimized for a local task: generate something that looks like a good test. Not for your specific business risk.
Therefore, without normal boundaries, it often does the following:
- tests internal implementation details instead of system behavior
- mocks absolutely everything and gets a green but false picture of the world
- writes overly expensive end-to-end tests where one fast unit test was needed
- creates snapshots of huge structures that nobody reads afterward
- duplicates the same check across several layers
- creates unstable tests that cannot be run on every PR
- does not mark paid AI evals separately from regular regression tests
As a result, the team gets not a quality system, but a theatrical testing set. Everything is serious, everything blinks, CI makes noise, but there is no confidence.
What testing methodology modern applications need
The ordinary testing pyramid is no longer enough, although the idea itself is still useful. For modern systems, at Ingello Systems we usually use a more mature model: a portfolio of checks.
That is, we understand in advance which layers of risk exist in the product, and for each layer we build its own type of tests.
1. Deterministic domain tests
These are rules, formulas, statuses, state transitions, deduplication, constraints, access rights, mappings, calculations. Everything that must be predictable.
2. Component and integration tests
They check work with the database, files, queues, message brokers, cache, HTTP clients, external APIs, local containers.
3. Contract tests
They fix exactly how your service expects to communicate with an external system. What a request must contain, what response shape is acceptable, what to do when the schema changes.
4. Scenario and end-to-end tests
They check several key user or business scenarios from beginning to end. There should be few of them. Otherwise, you are not testing the system, but trying to cement the entire world.
5. Runtime checks
These are not quite classic tests, but a modern system cannot do without them. Synthetic checks, canary, health endpoints, monitoring, tracing, alerting. That is, checking that the system not only builds, but also lives.
6. AI evals and agent evals
If the product has an LLM, text generation, assistants, decision-making pipelines, or autonomous agents, they cannot be tested like ordinary functions. They need a separate evaluation system.
How to choose which tests to run always, and which only sometimes
You very correctly noticed that running all tests every time is pointless. This is one of the most common mistakes of young processes. Not every test should live in the same contour.
A normal strategy looks approximately like this.
On every commit or locally before push
- linters
- type checking
- fast unit tests
- some component tests without network
- tests for critical domain invariants
This must be fast. If everything here takes forever, the team will start bypassing the rules. And if the rules are constantly bypassed, then this is no longer a process, but a decorative sign.
On pull request
- all fast tests
- changed component/integration tests
- contract tests
- a small set of critical e2e scenarios
At night or on schedule
- slow end-to-end tests
- sandbox tests of external integrations
- mini load runs
- paid AI evals
- drift checks for scrapers and parsers
Before release
- smoke on staging
- migrations
- regression set for the critical business contour
- checking queues, cron, webhook endpoints, external callbacks
After release
- production smoke
- canary
- error metrics
- alerts for anomalies
- synthetic journey for key actions
The main idea: tests must have a launch context. Some protect speed. Others protect integrations. Third ones protect the release. Fourth ones protect production.
How to write tests for ordinary algorithmic applications
This is the clearest class of tasks. Here the world is still more or less honest: inputs, outputs, rules, statuses, calculations.
Such systems are found in CRM, ERP, WMS, logistics, booking, accounting, sales, document flow, financial operations. In our cases these are, for example, platFORMA, FORMA CRM, FORMA WMS, FORMA BPM, NorthWest, Taxer.
What to test here first
- golden cases — reference scenarios with a known correct result in advance
- boundary cases — empty, extreme, zero, negative values, time zones, date transitions, currencies, rounding
- invariants — properties that must always be preserved
- property-based tests — generation of a large amount of input data and checking general properties
- state transitions — what can and cannot be done from a specific status
Which invariants are useful for business
- the order amount cannot become negative
- the same webhook applied twice must not double the operation
- canceling an order must not create a new payment
- warehouse balances must reconcile with the movement log
- a discount must not increase the price
- a user without the required role must not see the closed data contour
For agents like Codex, this is a good layer. But you cannot simply tell the agent to write tests for the service. It needs to be given an engineering technical specification:
- what business rules exist
- which invariants are critical
- which edge cases are mandatory
- which tests must be fast
- which dependencies are forbidden to mock
Otherwise, the agent will write something formally meaningful, but practically decorative.
How to test applications with integrations
A modern system almost never lives alone. There are payments, CRM, ERP, an email provider, SMS, maps, analytics, S3-compatible storage, external catalogs, the client's ERP, partner APIs, external accounts, aggregators, and another dozen friends who cannot be fully trusted.
In such systems, the main idea is very simple: you test not someone else's service, but your contract with it.
A good strategy for external integration
- fast tests for serialization and mapping
- contract tests for request and response formats
- handling errors, timeouts, retries, partial failures
- sandbox runs on schedule
- production monitoring and alerts
What is especially important to check
- what happens if the external service returns a field of another type
- what happens if the response lacks the required field
- what happens if new fields appear
- what happens if the response arrives with a delay
- what happens if the event arrives again
- what happens if events arrive in the wrong order
- how rate limits are handled
In projects like Prime EVA, Vorfahr, NaturalTTS, GadFul and City Ingello integrations are no longer an accessory, but part of the system's skeleton. There, an error at the boundary between services quickly turns into a chain of side effects.
How to test scraping and parsing
Scraping and parsing are a wonderful example of how a naive approach breaks the engineering psyche. If you test such a system with one big e2e through the live internet, it will work and then not work, then depend on the time of day, then on the mood of someone else's website. Wild beauty, rural reliability.
It is more correct to divide the task into layers.
Layer 1. Fetcher
How exactly we get the page. What we do with 403, 404, 429, redirect, broken SSL, timeout, anti-bot protection.
Layer 2. Parser
How we extract entities from HTML, XML, JSON, feed, sitemap, or another format.
Layer 3. Normalizer
How we bring data to a unified form: dates, currencies, names, links, identifiers.
Layer 4. Dedupe
How we understand that this is a new record, not an old record in a different dress.
Layer 5. Writer
How we save the result to the database and what side effects we launch.
Layer 6. Scheduler
When and how often the system performs a repeat check at all.
What tests are needed here
- fixtures with real HTML pages and their versions
- narrow parser tests on local files
- tests for detecting a new record
- deduplication tests
- tests for correct normalization of dates, prices, links
- scheduler tests with fake clock
- one or two canary runs on the real source on schedule
A very useful practice is to store a set of HTML page versions and run the parser on these snapshots. Then a DOM change breaks a narrow test, not the whole release. This is much better than learning about the problem from production two days later with the phrase for some reason new records are not coming in.
For corporate solutions, this approach is especially important when the system builds monitoring of the market, price lists, partner catalogs, or updates to external documents.
How to test Telegram integrations and similar external channels
Here it is useful to immediately divide the system into two contours.
Outbound
We send a message to a user, manager, operator, doctor, client, or administrator.
Inbound
A user or external service sends us a command, callback, status, button, confirmation, attachment.
What to test for outbound
- message format
- escaping
- localization
- deduplication
- retries
- connection to the domain event
- correlation id for tracing
What to test for inbound
- payload validity
- command routing
- authorization and signatures, if applicable
- repeated delivery
- idempotency
- side effect in the database and queues
Documentation Telegram Bot API and materials on webhook mechanics are useful not only for setup, but also as a source of realistic payloads. The best test here is not an abstract fantasy, but a real saved update from a production-like environment.
Such integrations are often important in support systems, notifications, action confirmations, internal BPM contours, logistics services, and CRM modules.
How to test queues, delayed actions, and regular tasks
As soon as a queue, cron, or delayed workflow appears in the system, you get one more hidden character — time. And time in software systems likes to make mischief silently, elegantly, and without witnesses.
Therefore, tests for queues and background tasks need to be built not around sleep and hope, but around controlled time.
What is important to check here
- the task runs once where this is critical
- repeated delivery does not lead to duplicates
- an error leads to retry according to the correct policy
- after retries are exhausted, the task goes to DLQ or a special status
- order-sensitive events are processed correctly
- periodic jobs do not overlap and do not consume each other
- restarting the worker does not break the state
What helps technically
- fake clock
- time travel
- local broker in a container
- test queues
- explicit idempotency keys
- tracing message chains
If you use a workflow engine model, it is useful to look toward approaches similar to Temporal testing, where the test environment can work with time as a managed resource, not as mysticism.
In systems at the level of platFORMA, FORMA BPM, FORMA WMS and production solutions like Prime EVA this is especially critical, because there are many background processes, dependencies, and business consequences there.
How to test real-time data streams, for example voice through a microphone in the browser
This is where the part of modern development begins where the classic unit test is no longer king, but just one official in a large department.
If the user speaks by voice in the browser, you get:
- microphone and browser permissions
- a stream of audio chunks
- encoding and decoding
- network and latency
- connection interruptions
- server-side buffering
- UI reaction to partial results
- final transcription or action
How to reasonably divide such tests
- unit — chunk format, buffering, aggregators, state machine on the client and server
- component — transferring chunks through mock transport
- integration — a real websocket or streaming endpoint in a test environment
- scenario — short end-to-end tests with pre-prepared audio
- observability — latency, dropped chunks, reconnect, timeout metrics
What is important to check
- that parts of the stream are not lost
- that messages are assembled in the correct order
- that a short network outage does not destroy the entire state
- that the server correctly closes the session
- that the UI adequately shows the intermediate status
- that the user does not get a stuck session after reload
For such systems, synthetic and replay checks are especially important. That is, you store a set of reference streams and periodically replay them on the test contour. Otherwise, you risk finding out about a problem only when a real user starts getting angrier than usual into the microphone =)
We encounter these principles in projects that involve real-time interaction, voice, communication contours, or complex client-server behavior, for example in NaturalTTS and a number of product AI modules.
How to test external webhooks
A webhook is simple only in a presentation. In practice, a webhook is a small door into the house through which the outside world can come in with papers, dirty boots, and sometimes with an axe.
What must be checked for a webhook endpoint
- payload schema validity
- signature and source verification
- idempotency
- retry handling
- late delivery handling
- correct reaction to unknown fields
- backward compatibility when the external schema changes
- audit and tracing
What not to do
- do not tie everything to one fragile giant e2e
- do not assume that 200 OK means business success
- do not keep only a minimal log with no ability to investigate
- do not rely only on the external system as the source of truth
A very useful practice is to store raw webhook payloads in a separate storage or journal so that real incidents can be replayed. Otherwise, investigating an outage turns into the genre of I think something came in there, but that is not certain.
How to test AI integrations if the tests are paid
This is one of the most interesting and most painful areas.
When the system has text generation, classification, entity extraction, summarization, ranking, copilots, chats, voice modules, or agent-based decision loops, tests really do become paid. Because every model call is a cost in budget, time, and sometimes also instability.
Therefore AI tests must not be mixed with regular fast regression tests.
The right division of AI checks by cost
- free layer — tests of prompt wrapping, schemas, parsing, guardrails, post-processing, fallback logic, and routing without a real model call
- cheap layer — a limited set of smoke evals on a small reference sample
- expensive layer — full evals on a schedule, before release, or at a dedicated stage
What must be separated
- paid tests
- nightly AI evals
- manual review cases
- benchmark runs
That is, you should have separate groups in CI. Otherwise, the team will accidentally start running paid evals on every commit and quickly feel like the owner of a casino where the roulette wheel is spinning, but for some reason the profit is not yours.
Materials OpenAI Developers and engineering notes from Anthropic emphasize the general idea well: for AI systems, quality must be evaluated systematically, reproducibly, and on representative task sets, not by one beautiful example.
How to test AI integrations if the result cannot be checked algorithmically
Now this is a real adult conversation.
If the system generates free-form text, recommendations, emails, summaries, instructions, hypotheses, content, or assistant responses, then a simple assert equals is almost always useless. The answer may be good, but different. Or bad, but formally similar.
Therefore, other methods are used here.
1. Reference-based evals
There are reference examples and expected characteristics of the result. It does not have to be a letter-for-letter match, but there are quality criteria.
2. Rule-based graders
Part of the result can be checked algorithmically:
- whether required entities are present
- whether the structure has been broken
- whether the limit has been exceeded
- whether there are prohibited phrasings
- whether key facts have been preserved
3. LLM-as-a-judge
Another AI evaluates the result according to criteria. But caution is needed here. Such a grader also has to be calibrated, tested, and kept under control.
4. Pairwise comparison
Useful when you compare two versions of a prompt, two models, or two pipelines and want to understand which version is objectively better on a set of tasks.
5. Human review for high-risk slices
Where the cost of error is high, you cannot remove the human completely. Especially in domains with complex expertise, medicine, regulation, sensitive communications, financial recommendations, and complex B2B processes.
In practice, for business this means the following: do not try to pretend that all AI behavior can be reduced to an exact algorithmic assert. In most serious systems, that is a lie. You need to honestly separate:
- what is checked strictly
- what is checked heuristically
- what is checked by another AI
- what is checked by a human
Do you need to run another AI to test AI
Sometimes yes. But not as a religion, as a tool.
LLM-as-a-judge is useful when:
- you need to evaluate the semantic completeness of the answer
- you need to check compliance with style
- you need to compare several answer options
- you need to quickly evaluate a large number of generations
But it must not be turned into the only source of truth. Because then you are building a nesting doll of probabilities: one nondeterministic object evaluates another nondeterministic object, while you are trying to explain to the director why quality seems to have improved, but in some places has gotten worse.
Good practice
- first algorithmic checks where possible
- then rule-based grading
- then LLM-judge for part of the metrics
- then selective human validation
Such a cascade usually works better than betting on one magical grader.
How to test agentic systems if it is not clear at all what normal is
This is probably the most important question in the entire article.
An agent is no longer just a function that receives input and returns output. An agent can:
- plan steps
- use tools
- go to the internet or through internal APIs
- read documents
- make intermediate decisions
- change state
- launch child processes
- make a mistake not in the answer, but in the route
Therefore, an agentic system needs to be tested at least on three levels.
Level 1. Outcome
Whether the correct business result has been achieved. Not just whether the agent said something plausible, but whether the correct final result was obtained.
Level 2. Process
Whether prohibited actions occurred along the way. For example, extra tool calls, dangerous operations, an incorrect order of steps, data leakage, unnecessary costs.
Level 3. Cost and stability
How many steps the route took, how much it cost, how much time it took, and how reproducible the result is.
What is especially useful in agent evals
- fixed task set
- several trials for one task
- log of all tool calls
- trace grading
- assert on the final state of the system
- limits on time, cost, and number of steps
Agentic tests often need to check not only a text answer, but a real state change:
- whether an object was created in the DB
- whether someone else’s data was affected
- whether an extra message was sent
- whether a prohibited integration was called
- whether the route exceeded the budget
If you have agentic development automation, an internal technical copilot, an AI process operator, or a hybrid workflow with an LLM, this becomes mandatory. In this contour, the case of FRACTAL, where automation of development and engineering pipelines requires a completely different level of testing discipline than a regular CRUD service, is especially close.
How to test complex infrastructure tasks and a cold start of the system
This is exactly the part many people remember only after an outage. Although infrastructure often determines whether your beautiful product will work after a machine restart, a new server rollout, a container crash, a database migration, or an orchestration update.
What needs to be checked here
- the system starts from zero in a clean environment
- migrations are applied correctly
- services start in the correct order
- readiness and liveness truly reflect the state
- queues and workers recover after restart
- the environment configuration is complete and consistent
- temporary absence of a dependency does not kill the entire contour
- cache, file storage, database, and broker are rebuilt predictably
How to test this in practice
- ephemeral environment with the entire stack brought up
- smoke after deployment
- recreating test environments from scratch
- chaos-lite exercises for restarting individual services
- boot sequence check
- automatic health audit after startup
In infrastructure-heavy systems, this is a mandatory part of quality. Especially if you have several services, a broker, a database, cron, AI workers, storage, a webhook-consumer, and streaming components.
Such contours are characteristic of production, accounting, logistics, and AI systems. In projects like Prime EVA, Vorfahr, NaturalTTS, platFORMA infrastructure testing is no longer optional.
What a mature testing system must have
- risk map — exactly where the system can cause damage
- launch contour map — what runs locally, on PR, at night, before release, after release
- test labeling — fast, slow, e2e, paid, ai, flaky-review, integration
- stable test data
- observability — logs, metrics, tracing, correlation id
- reproducibility — fixed fixtures, payload versions, seed, controlled environment
- cost model — which tests are expensive and how often it is reasonable to run them
- rules for AI agents — exactly what they are allowed to write and change
How to give a task to Codex, Claude, and other agents so they write good tests
This is a separate important topic. A bad prompt creates bad tests. And this is not because the agent is stupid. It is because you gave a task in the style of write tests, but expected architectural maturity.
Incorrect task statement
Write tests for this service.
Correct task statement
Here is the business context. Here are the critical invariants. Here is what must not be mocked. Here is what counts as a successful result. Here are which tests must be fast. Here are which scenarios belong to PR and which to nightly. Here are the risks that have already occurred in production. Here is where to test behavior, not implementation details.
What is worth specifying explicitly to the agent
- testing layer: unit, component, integration, contract, e2e, eval
- what is the object of verification
- which invariants are mandatory
- which dependencies are allowed to be mocked
- which real fixtures to use
- what launch cost is acceptable
- which tests must not be added to the fast pipeline
- which known failure modes are already known
A very useful rule
Ask the agent to first produce a testing plan, not test code immediately. Let it list:
- risks
- layers
- launch contours
- which checks are needed
- which checks are not needed
And only after that — code generation.
This sharply reduces the number of meaningless tests.
How we would break this down by typical Ingello Systems business scenarios
Logistics and booking
For projects like Svit BUS, LEX, BusTicket, UNO Taxi tests should cover:
- search and routing algorithms
- pricing and discounts
- refunds and cancellations
- maps and geo-data
- roles, accounts, partner access
- reprocessing external events
- payments and reconciliation
Corporate platforms
For platFORMA, FORMA CRM, FORMA WMS, FORMA BPM, FORMA HRM key tests live around:
- access rights
- workflow states
- document flow
- movement of goods
- synchronization between modules
- regular tasks and notifications
- idempotency and audit
Medicine and the regulatory contour
For Evrika, Lita, L-Doc, Rapport, Dent Ingello testing must be especially disciplined around:
- data structure
- access rights and privacy
- record integrity
- integration compatibility
- stability of forms, accounts, and data routes
- controlled AI behavior, if it is used
Production, accounting, and integrations
For Prime EVA, GadFul, Carveli, SKLO especially important are:
- accounting convergence
- production and status chains
- external integrations
- regular synchronizations
- mass background operations
- recovery after failures
AI and development automation
For FRACTAL, Vorfahr, NaturalTTS separate AI-evals, cost controls, tool use tracing, grading, and scenario checks of agentic routes are already needed.
Common mistakes that kill the real benefit of tests
- chasing coverage as an end in itself
- identical checks at all layers
- complete dependence on mocks
- too much e2e
- no labeling of slow and paid tests
- no production canary and observability
- mixing AI evals with fast CI
- trying to check a nondeterministic system with deterministic asserts where it does not work
- no real production-like payloads and fixtures
- delegating test writing to an agent without describing the risks
Practical conclusion for business
If we simplify everything said above down to one very down-to-earth conclusion, it will be this:
Testing in 2026 is no longer an add-on to development, but part of business architecture.
Because modern products are not just code. They are a bundle of domain logic, integrations, queues, AI, real-time contours, external events, and infrastructure. And if you check only one layer, the others begin to rot quietly, politely, and with a corporate smile.
Therefore, a good process looks like this:
- we know what risks the system has
- we understand which tests are needed at which layer
- we do not run everything indiscriminately on every commit
- we separate fast checks from expensive and rare ones
- we design AI-evals separately
- we test not only code, but also routes, events, state, and cost
- we build observability as part of quality control
What to do if you already have a project, but its tests are chaotic
In such a situation, you usually do not need to start with the heroic idea of covering everything. That is a path to beautiful exhaustion without a real result.
You need to do it like adults:
- identify critical business scenarios
- determine the most expensive risks
- divide the system into deterministic and nondeterministic contours
- choose a basic set of fast tests
- allocate integration and e2e scenarios only for key chains
- design AI-evals separately
- set up observability and post-release smoke
And only after that connect Codex, Claude, and other agents as accelerators, not as shamans.
Conclusion
Modern AI agents are not a magic button and not enemies. They are amplifiers. They amplify both good engineering culture and bad engineering culture. If the testing architecture is mature, agents sharply accelerate development. If there is no architecture, they accelerate chaos.
Therefore, the right question today is not: can AI write tests. Of course it can. The right question is: can your company define a testing task at the level of the system, risk, and business.
This is the difference between just code and real engineering.
If you need a project where testing, architecture, automation, integrations, and AI are designed from the start as a single system, take a look at our approach at Ingello Systems. We work with both startups and system companies: from MVPs and AI modules to heavy corporate contours, CRM, WMS, logistics, medicine, production, and accounting solutions. And we also love breaking chaos down into details and turning it into a manageable system. That, to be honest, is the whole thrill of engineering =)
