What Business Needs to Know About Modern AI Development and Testing

Reading: 24 min Published: 17.03.2026

I have seen many times how a business tried to vibe-code on its own or hired yesterday's students without experience in real programming and design to write applications cheaply and quickly. What could go wrong? You no longer need to know programming — you speak in your own language, AI writes everything for you, you can even ask it to cover it with tests and do everything according to architecture best practices... Right? Or not? =) Those who have written something more complex than a mini-parser or simple integrations (which can generally be found for free and do not need to be written) understand that the free ride ends quickly. As soon as the codebase grows, AI starts hallucinating, duplicating code, creating absurd business processes, getting confused, and the more you write to it "no, do it properly, do it well", the worse it does. Sound familiar? Let's discuss how to write code faster and cheaper, while minimizing the amount of gray hair, dead neurons, and dopamine pits.

How to properly write tests for Codex, Claude, and other modern agents
Why testing today matters not only to the developer, but also to the director, owner, and investor
The main idea: you need to test not code as text, but risk as an event
Why Codex and Claude often write tests that look smart but make no sense
What testing methodology modern applications need
1. Deterministic domain tests
2. Component and integration tests
3. Contract tests
4. Scenario and end-to-end tests
5. Runtime checks
6. AI evals and agent evals
How to choose which tests to run always, and which only sometimes
On every commit or locally before push
On pull request
At night or on schedule
Before release
After release
How to write tests for ordinary algorithmic applications
What to test here first
Which invariants are useful for business
How to test applications with integrations
A good strategy for external integration
What is especially important to check
How to test scraping and parsing
Layer 1. Fetcher
Layer 2. Parser
Layer 3. Normalizer
Layer 4. Dedupe
Layer 5. Writer
Layer 6. Scheduler
What tests are needed here
How to test Telegram integrations and similar external channels
Outbound
Inbound
What to test for outbound
What to test for inbound
How to test queues, delayed actions, and regular tasks
What is important to check here
What helps technically
How to test real-time data streams, for example voice through a microphone in the browser
How to reasonably divide such tests
What is important to check
How to test external webhooks
What must be checked for a webhook endpoint
What not to do
How to test AI integrations if the tests are paid
The right division of AI checks by cost
What must be separated
How to test AI integrations if the result cannot be checked algorithmically
1. Reference-based evals
2. Rule-based graders
3. LLM-as-a-judge
4. Pairwise comparison
5. Human review for high-risk slices
Do you need to run another AI to test AI
Good practice
How to test agentic systems if it is not clear at all what normal is
Level 1. Outcome
Level 2. Process
Level 3. Cost and stability
What is especially useful in agent evals
How to test complex infrastructure tasks and a cold start of the system
What needs to be checked here
How to test this in practice
What a mature testing system must have
How to give a task to Codex, Claude, and other agents so they write good tests
Incorrect task statement
Correct task statement
What is worth specifying explicitly to the agent
A very useful rule
How we would break this down by typical Ingello Systems business scenarios
Logistics and booking
Corporate platforms
Medicine and the regulatory contour
Production, accounting, and integrations
AI and development automation
Common mistakes that kill the real benefit of tests
Practical conclusion for business
What to do if you already have a project, but its tests are chaotic
Conclusion

How to properly write tests for Codex, Claude, and other modern agents

An article by the Ingello Systems team

Modern development has changed radically. Just yesterday, business lived in a world where people wrote code, people wrote tests, and bugs were also human, sometimes lazy, sometimes talented. Today AI agents have appeared on our team: Codex, Claude and other systems that can generate code, edit architecture, run tests, write pull requests, and say with a very confident face: everything is ready =)

The problem is that speed has increased, and the cost of error has also increased. If previously one developer could accidentally break one function, now an agent can confidently multiply an error across five modules in a short time, wrap it with mocks, cover it with snapshots, and even write tests for it that look solid but guarantee nothing.

For business, this is no longer a technical trifle, but a matter of money, deadlines, and process stability. According to materials from DORA 2024, AI has already widely entered the daily practice of development. Playwright, Pact, Temporal, OpenTelemetry, Prometheus, OpenAI Developers and Anthropic Engineering in different forms say the same thing: if a system has become more complex and more automated, then testing must become not a formality, but a separate engineering discipline.

That is exactly why the question of how to write tests for modern AI agents is critical for business today. Especially if we are not talking about a landing page with three buttons, but about real systems with integrations, queues, scraping, voice streams, Telegram bots, webhooks, AI modules, and complex infrastructure.

At Ingello Systems we design and develop both corporate systems and startup products. So further on, I will break down the topic not in the style of abstract theory, but using real classes of applications that businesses face every day.

Why testing today matters not only to the developer, but also to the director, owner, and investor

To be completely honest, business is not interested in the number of tests at all. Business is interested in something else:

whether money is being lost
whether processes break after release
whether the team is stuck on pause because of regression
whether every change turns into a small civil war
whether changes can be released quickly and without fear
whether part of the work can be entrusted to AI agents without the feeling that you have put an intern in charge of a nuclear power plant

When a company does not have a normal testing strategy, development starts to resemble an old warehouse where boxes are stacked to the ceiling, the light bulb is flickering, and someone insists that the accounting system almost works. Formally, everything is standing in place. In reality, any movement can trigger an avalanche.

For a startup, the absence of tests means that the MVP cannot be properly grown further. For a corporate system, this is even more dangerous: a failure starts hitting not the pretty interface, but real operations, logistics, warehouse, finance, document flow, doctors, managers, support, and executives.

We have seen this in different projects. For example, in Prime EVA and GadFul the cost of an error in accounting and integrations is very far from academic. In Svit BUS or LEX an error in routes, booking, payment, or refunds hits money and user trust. In Evrika, Lita and L-Doc the quality of testing already comes down not simply to convenience, but to the reliability of processes and data.

The main idea: you need to test not code as text, but risk as an event

This is the part worth remembering.

A good test checks not that a function was called, but that a business risk is controlled.

For example:

not just checking that an HTTP request was sent
but checking that an order is genuinely not duplicated during repeated webhook delivery
not just checking that the parser returned an array
but checking that a new record on the page is detected once and gets into the database without duplicates
not just checking that the Telegram API responded 200
but checking that the user received the correct notification and the system did not send it three times
not just checking that AI returned text
but checking that the text meets the task requirements, does not break the workflow, and does not burn the budget for nothing

This is a very important shift in thinking. Especially when tests are written by an agent. Because AI really likes doing what is easy to automate, not what is actually important.

Why Codex and Claude often write tests that look smart but make no sense

Because by default, an agent is optimized for a local task: generate something that looks like a good test. Not for your specific business risk.

Therefore, without normal boundaries, it often does the following:

tests internal implementation details instead of system behavior
mocks absolutely everything and gets a green but false picture of the world
writes overly expensive end-to-end tests where one fast unit test was needed
creates snapshots of huge structures that nobody reads afterward
duplicates the same check across several layers
creates unstable tests that cannot be run on every PR
does not mark paid AI evals separately from regular regression tests

As a result, the team gets not a quality system, but a theatrical testing set. Everything is serious, everything blinks, CI makes noise, but there is no confidence.

What testing methodology modern applications need

The ordinary testing pyramid is no longer enough, although the idea itself is still useful. For modern systems, at Ingello Systems we usually use a more mature model: a portfolio of checks.

That is, we understand in advance which layers of risk exist in the product, and for each layer we build its own type of tests.

1. Deterministic domain tests

These are rules, formulas, statuses, state transitions, deduplication, constraints, access rights, mappings, calculations. Everything that must be predictable.

2. Component and integration tests

They check work with the database, files, queues, message brokers, cache, HTTP clients, external APIs, local containers.

3. Contract tests

They fix exactly how your service expects to communicate with an external system. What a request must contain, what response shape is acceptable, what to do when the schema changes.

4. Scenario and end-to-end tests

They check several key user or business scenarios from beginning to end. There should be few of them. Otherwise, you are not testing the system, but trying to cement the entire world.

5. Runtime checks

These are not quite classic tests, but a modern system cannot do without them. Synthetic checks, canary, health endpoints, monitoring, tracing, alerting. That is, checking that the system not only builds, but also lives.

6. AI evals and agent evals

If the product has an LLM, text generation, assistants, decision-making pipelines, or autonomous agents, they cannot be tested like ordinary functions. They need a separate evaluation system.

How to choose which tests to run always, and which only sometimes

You very correctly noticed that running all tests every time is pointless. This is one of the most common mistakes of young processes. Not every test should live in the same contour.

A normal strategy looks approximately like this.

On every commit or locally before push

linters
type checking
fast unit tests
some component tests without network
tests for critical domain invariants

This must be fast. If everything here takes forever, the team will start bypassing the rules. And if the rules are constantly bypassed, then this is no longer a process, but a decorative sign.

On pull request

all fast tests
changed component/integration tests
contract tests
a small set of critical e2e scenarios

At night or on schedule

slow end-to-end tests
sandbox tests of external integrations
mini load runs
paid AI evals
drift checks for scrapers and parsers

Before release

smoke on staging
migrations
regression set for the critical business contour
checking queues, cron, webhook endpoints, external callbacks

After release

production smoke
canary
error metrics
alerts for anomalies
synthetic journey for key actions

The main idea: tests must have a launch context. Some protect speed. Others protect integrations. Third ones protect the release. Fourth ones protect production.

How to write tests for ordinary algorithmic applications

This is the clearest class of tasks. Here the world is still more or less honest: inputs, outputs, rules, statuses, calculations.

Such systems are found in CRM, ERP, WMS, logistics, booking, accounting, sales, document flow, financial operations. In our cases these are, for example, platFORMA, FORMA CRM, FORMA WMS, FORMA BPM, NorthWest, Taxer.

What to test here first

golden cases — reference scenarios with a known correct result in advance
boundary cases — empty, extreme, zero, negative values, time zones, date transitions, currencies, rounding
invariants — properties that must always be preserved
property-based tests — generation of a large amount of input data and checking general properties
state transitions — what can and cannot be done from a specific status

Which invariants are useful for business

the order amount cannot become negative
the same webhook applied twice must not double the operation
canceling an order must not create a new payment
warehouse balances must reconcile with the movement log
a discount must not increase the price
a user without the required role must not see the closed data contour

For agents like Codex, this is a good layer. But you cannot simply tell the agent to write tests for the service. It needs to be given an engineering technical specification:

what business rules exist
which invariants are critical
which edge cases are mandatory
which tests must be fast
which dependencies are forbidden to mock

Otherwise, the agent will write something formally meaningful, but practically decorative.

How to test applications with integrations

A modern system almost never lives alone. There are payments, CRM, ERP, an email provider, SMS, maps, analytics, S3-compatible storage, external catalogs, the client's ERP, partner APIs, external accounts, aggregators, and another dozen friends who cannot be fully trusted.

In such systems, the main idea is very simple: you test not someone else's service, but your contract with it.

A good strategy for external integration

fast tests for serialization and mapping
contract tests for request and response formats
handling errors, timeouts, retries, partial failures
sandbox runs on schedule
production monitoring and alerts

What is especially important to check

what happens if the external service returns a field of another type
what happens if the response lacks the required field
what happens if new fields appear
what happens if the response arrives with a delay
what happens if the event arrives again
what happens if events arrive in the wrong order
how rate limits are handled

In projects like Prime EVA, Vorfahr, NaturalTTS, GadFul and City Ingello integrations are no longer an accessory, but part of the system's skeleton. There, an error at the boundary between services quickly turns into a chain of side effects.

How to test scraping and parsing

Scraping and parsing are a wonderful example of how a naive approach breaks the engineering psyche. If you test such a system with one big e2e through the live internet, it will work and then not work, then depend on the time of day, then on the mood of someone else's website. Wild beauty, rural reliability.

It is more correct to divide the task into layers.

Layer 1. Fetcher

How exactly we get the page. What we do with 403, 404, 429, redirect, broken SSL, timeout, anti-bot protection.

Layer 2. Parser

How we extract entities from HTML, XML, JSON, feed, sitemap, or another format.

Layer 3. Normalizer

How we bring data to a unified form: dates, currencies, names, links, identifiers.

Layer 4. Dedupe

How we understand that this is a new record, not an old record in a different dress.

Layer 5. Writer

How we save the result to the database and what side effects we launch.

Layer 6. Scheduler

When and how often the system performs a repeat check at all.

What tests are needed here

fixtures with real HTML pages and their versions
narrow parser tests on local files
tests for detecting a new record
deduplication tests
tests for correct normalization of dates, prices, links
scheduler tests with fake clock
one or two canary runs on the real source on schedule

A very useful practice is to store a set of HTML page versions and run the parser on these snapshots. Then a DOM change breaks a narrow test, not the whole release. This is much better than learning about the problem from production two days later with the phrase for some reason new records are not coming in.

For corporate solutions, this approach is especially important when the system builds monitoring of the market, price lists, partner catalogs, or updates to external documents.

How to test Telegram integrations and similar external channels

Here it is useful to immediately divide the system into two contours.

Outbound

We send a message to a user, manager, operator, doctor, client, or administrator.

Inbound

A user or external service sends us a command, callback, status, button, confirmation, attachment.

What to test for outbound

message format
escaping
localization
deduplication
retries
connection to the domain event
correlation id for tracing

What to test for inbound

payload validity
command routing
authorization and signatures, if applicable
repeated delivery
idempotency
side effect in the database and queues

Documentation Telegram Bot API and materials on webhook mechanics are useful not only for setup, but also as a source of realistic payloads. The best test here is not an abstract fantasy, but a real saved update from a production-like environment.

Such integrations are often important in support systems, notifications, action confirmations, internal BPM contours, logistics services, and CRM modules.

How to test queues, delayed actions, and regular tasks

As soon as a queue, cron, or delayed workflow appears in the system, you get one more hidden character — time. And time in software systems likes to make mischief silently, elegantly, and without witnesses.

Therefore, tests for queues and background tasks need to be built not around sleep and hope, but around controlled time.

What is important to check here

the task runs once where this is critical
repeated delivery does not lead to duplicates
an error leads to retry according to the correct policy
after retries are exhausted, the task goes to DLQ or a special status
order-sensitive events are processed correctly
periodic jobs do not overlap and do not consume each other
restarting the worker does not break the state

What helps technically

fake clock
time travel
local broker in a container
test queues
explicit idempotency keys
tracing message chains

If you use a workflow engine model, it is useful to look toward approaches similar to Temporal testing, where the test environment can work with time as a managed resource, not as mysticism.

In systems at the level of platFORMA, FORMA BPM, FORMA WMS and production solutions like Prime EVA this is especially critical, because there are many background processes, dependencies, and business consequences there.

How to test real-time data streams, for example voice through a microphone in the browser

This is where the part of modern development begins where the classic unit test is no longer king, but just one official in a large department.

If the user speaks by voice in the browser, you get:

microphone and browser permissions
a stream of audio chunks
encoding and decoding
network and latency
connection interruptions
server-side buffering
UI reaction to partial results
final transcription or action

How to reasonably divide such tests

unit — chunk format, buffering, aggregators, state machine on the client and server
component — transferring chunks through mock transport
integration — a real websocket or streaming endpoint in a test environment
scenario — short end-to-end tests with pre-prepared audio
observability — latency, dropped chunks, reconnect, timeout metrics

What is important to check

that parts of the stream are not lost
that messages are assembled in the correct order
that a short network outage does not destroy the entire state
that the server correctly closes the session
that the UI adequately shows the intermediate status
that the user does not get a stuck session after reload

For such systems, synthetic and replay checks are especially important. That is, you store a set of reference streams and periodically replay them on the test contour. Otherwise, you risk finding out about a problem only when a real user starts getting angrier than usual into the microphone =)

We encounter these principles in projects that involve real-time interaction, voice, communication contours, or complex client-server behavior, for example in NaturalTTS and a number of product AI modules.

How to test external webhooks

A webhook is simple only in a presentation. In practice, a webhook is a small door into the house through which the outside world can come in with papers, dirty boots, and sometimes with an axe.

What must be checked for a webhook endpoint

payload schema validity
signature and source verification
idempotency
retry handling
late delivery handling
correct reaction to unknown fields
backward compatibility when the external schema changes
audit and tracing

What not to do

do not tie everything to one fragile giant e2e
do not assume that 200 OK means business success
do not keep only a minimal log with no ability to investigate
do not rely only on the external system as the source of truth

A very useful practice is to store raw webhook payloads in a separate storage or journal so that real incidents can be replayed. Otherwise, investigating an outage turns into the genre of I think something came in there, but that is not certain.

How to test AI integrations if the tests are paid

This is one of the most interesting and most painful areas.

When the system has text generation, classification, entity extraction, summarization, ranking, copilots, chats, voice modules, or agent-based decision loops, tests really do become paid. Because every model call is a cost in budget, time, and sometimes also instability.

Therefore AI tests must not be mixed with regular fast regression tests.

The right division of AI checks by cost

free layer — tests of prompt wrapping, schemas, parsing, guardrails, post-processing, fallback logic, and routing without a real model call
cheap layer — a limited set of smoke evals on a small reference sample
expensive layer — full evals on a schedule, before release, or at a dedicated stage

What must be separated

paid tests
nightly AI evals
manual review cases
benchmark runs

That is, you should have separate groups in CI. Otherwise, the team will accidentally start running paid evals on every commit and quickly feel like the owner of a casino where the roulette wheel is spinning, but for some reason the profit is not yours.

Materials OpenAI Developers and engineering notes from Anthropic emphasize the general idea well: for AI systems, quality must be evaluated systematically, reproducibly, and on representative task sets, not by one beautiful example.

How to test AI integrations if the result cannot be checked algorithmically

Now this is a real adult conversation.

If the system generates free-form text, recommendations, emails, summaries, instructions, hypotheses, content, or assistant responses, then a simple assert equals is almost always useless. The answer may be good, but different. Or bad, but formally similar.

Therefore, other methods are used here.

1. Reference-based evals

There are reference examples and expected characteristics of the result. It does not have to be a letter-for-letter match, but there are quality criteria.

2. Rule-based graders

Part of the result can be checked algorithmically:

whether required entities are present
whether the structure has been broken
whether the limit has been exceeded
whether there are prohibited phrasings
whether key facts have been preserved

3. LLM-as-a-judge

Another AI evaluates the result according to criteria. But caution is needed here. Such a grader also has to be calibrated, tested, and kept under control.

4. Pairwise comparison

Useful when you compare two versions of a prompt, two models, or two pipelines and want to understand which version is objectively better on a set of tasks.

5. Human review for high-risk slices

Where the cost of error is high, you cannot remove the human completely. Especially in domains with complex expertise, medicine, regulation, sensitive communications, financial recommendations, and complex B2B processes.

In practice, for business this means the following: do not try to pretend that all AI behavior can be reduced to an exact algorithmic assert. In most serious systems, that is a lie. You need to honestly separate:

what is checked strictly
what is checked heuristically
what is checked by another AI
what is checked by a human

Do you need to run another AI to test AI

Sometimes yes. But not as a religion, as a tool.

LLM-as-a-judge is useful when:

you need to evaluate the semantic completeness of the answer
you need to check compliance with style
you need to compare several answer options
you need to quickly evaluate a large number of generations

But it must not be turned into the only source of truth. Because then you are building a nesting doll of probabilities: one nondeterministic object evaluates another nondeterministic object, while you are trying to explain to the director why quality seems to have improved, but in some places has gotten worse.

Good practice

first algorithmic checks where possible
then rule-based grading
then LLM-judge for part of the metrics
then selective human validation

Such a cascade usually works better than betting on one magical grader.

How to test agentic systems if it is not clear at all what normal is

This is probably the most important question in the entire article.

An agent is no longer just a function that receives input and returns output. An agent can:

plan steps
use tools
go to the internet or through internal APIs
read documents
make intermediate decisions
change state
launch child processes
make a mistake not in the answer, but in the route

Therefore, an agentic system needs to be tested at least on three levels.

Level 1. Outcome

Whether the correct business result has been achieved. Not just whether the agent said something plausible, but whether the correct final result was obtained.

Level 2. Process

Whether prohibited actions occurred along the way. For example, extra tool calls, dangerous operations, an incorrect order of steps, data leakage, unnecessary costs.

Level 3. Cost and stability

How many steps the route took, how much it cost, how much time it took, and how reproducible the result is.

What is especially useful in agent evals

fixed task set
several trials for one task
log of all tool calls
trace grading
assert on the final state of the system
limits on time, cost, and number of steps

Agentic tests often need to check not only a text answer, but a real state change:

whether an object was created in the DB
whether someone else’s data was affected
whether an extra message was sent
whether a prohibited integration was called
whether the route exceeded the budget

If you have agentic development automation, an internal technical copilot, an AI process operator, or a hybrid workflow with an LLM, this becomes mandatory. In this contour, the case of FRACTAL, where automation of development and engineering pipelines requires a completely different level of testing discipline than a regular CRUD service, is especially close.

How to test complex infrastructure tasks and a cold start of the system

This is exactly the part many people remember only after an outage. Although infrastructure often determines whether your beautiful product will work after a machine restart, a new server rollout, a container crash, a database migration, or an orchestration update.

What needs to be checked here

the system starts from zero in a clean environment
migrations are applied correctly
services start in the correct order
readiness and liveness truly reflect the state
queues and workers recover after restart
the environment configuration is complete and consistent
temporary absence of a dependency does not kill the entire contour
cache, file storage, database, and broker are rebuilt predictably

How to test this in practice

ephemeral environment with the entire stack brought up
smoke after deployment
recreating test environments from scratch
chaos-lite exercises for restarting individual services
boot sequence check
automatic health audit after startup

In infrastructure-heavy systems, this is a mandatory part of quality. Especially if you have several services, a broker, a database, cron, AI workers, storage, a webhook-consumer, and streaming components.

Such contours are characteristic of production, accounting, logistics, and AI systems. In projects like Prime EVA, Vorfahr, NaturalTTS, platFORMA infrastructure testing is no longer optional.

What a mature testing system must have

risk map — exactly where the system can cause damage
launch contour map — what runs locally, on PR, at night, before release, after release
test labeling — fast, slow, e2e, paid, ai, flaky-review, integration
stable test data
observability — logs, metrics, tracing, correlation id
reproducibility — fixed fixtures, payload versions, seed, controlled environment
cost model — which tests are expensive and how often it is reasonable to run them
rules for AI agents — exactly what they are allowed to write and change

How to give a task to Codex, Claude, and other agents so they write good tests

This is a separate important topic. A bad prompt creates bad tests. And this is not because the agent is stupid. It is because you gave a task in the style of write tests, but expected architectural maturity.

Incorrect task statement

Write tests for this service.

Correct task statement

Here is the business context. Here are the critical invariants. Here is what must not be mocked. Here is what counts as a successful result. Here are which tests must be fast. Here are which scenarios belong to PR and which to nightly. Here are the risks that have already occurred in production. Here is where to test behavior, not implementation details.

What is worth specifying explicitly to the agent

testing layer: unit, component, integration, contract, e2e, eval
what is the object of verification
which invariants are mandatory
which dependencies are allowed to be mocked
which real fixtures to use
what launch cost is acceptable
which tests must not be added to the fast pipeline
which known failure modes are already known

A very useful rule

Ask the agent to first produce a testing plan, not test code immediately. Let it list:

risks
layers
launch contours
which checks are needed
which checks are not needed

And only after that — code generation.

This sharply reduces the number of meaningless tests.

How we would break this down by typical Ingello Systems business scenarios

Logistics and booking

For projects like Svit BUS, LEX, BusTicket, UNO Taxi tests should cover:

search and routing algorithms
pricing and discounts
refunds and cancellations
maps and geo-data
roles, accounts, partner access
reprocessing external events
payments and reconciliation

Corporate platforms

For platFORMA, FORMA CRM, FORMA WMS, FORMA BPM, FORMA HRM key tests live around:

access rights
workflow states
document flow
movement of goods
synchronization between modules
regular tasks and notifications
idempotency and audit

Medicine and the regulatory contour

For Evrika, Lita, L-Doc, Rapport, Dent Ingello testing must be especially disciplined around:

data structure
access rights and privacy
record integrity
integration compatibility
stability of forms, accounts, and data routes
controlled AI behavior, if it is used

Production, accounting, and integrations

For Prime EVA, GadFul, Carveli, SKLO especially important are:

accounting convergence
production and status chains
external integrations
regular synchronizations
mass background operations
recovery after failures

AI and development automation

For FRACTAL, Vorfahr, NaturalTTS separate AI-evals, cost controls, tool use tracing, grading, and scenario checks of agentic routes are already needed.

Common mistakes that kill the real benefit of tests

chasing coverage as an end in itself
identical checks at all layers
complete dependence on mocks
too much e2e
no labeling of slow and paid tests
no production canary and observability
mixing AI evals with fast CI
trying to check a nondeterministic system with deterministic asserts where it does not work
no real production-like payloads and fixtures
delegating test writing to an agent without describing the risks

Practical conclusion for business

If we simplify everything said above down to one very down-to-earth conclusion, it will be this:

Testing in 2026 is no longer an add-on to development, but part of business architecture.

Because modern products are not just code. They are a bundle of domain logic, integrations, queues, AI, real-time contours, external events, and infrastructure. And if you check only one layer, the others begin to rot quietly, politely, and with a corporate smile.

Therefore, a good process looks like this:

we know what risks the system has
we understand which tests are needed at which layer
we do not run everything indiscriminately on every commit
we separate fast checks from expensive and rare ones
we design AI-evals separately
we test not only code, but also routes, events, state, and cost
we build observability as part of quality control

What to do if you already have a project, but its tests are chaotic

In such a situation, you usually do not need to start with the heroic idea of covering everything. That is a path to beautiful exhaustion without a real result.

You need to do it like adults:

identify critical business scenarios
determine the most expensive risks
divide the system into deterministic and nondeterministic contours
choose a basic set of fast tests
allocate integration and e2e scenarios only for key chains
design AI-evals separately
set up observability and post-release smoke

And only after that connect Codex, Claude, and other agents as accelerators, not as shamans.

Conclusion

Modern AI agents are not a magic button and not enemies. They are amplifiers. They amplify both good engineering culture and bad engineering culture. If the testing architecture is mature, agents sharply accelerate development. If there is no architecture, they accelerate chaos.

Therefore, the right question today is not: can AI write tests. Of course it can. The right question is: can your company define a testing task at the level of the system, risk, and business.

This is the difference between just code and real engineering.

If you need a project where testing, architecture, automation, integrations, and AI are designed from the start as a single system, take a look at our approach at Ingello Systems. We work with both startups and system companies: from MVPs and AI modules to heavy corporate contours, CRM, WMS, logistics, medicine, production, and accounting solutions. And we also love breaking chaos down into details and turning it into a manageable system. That, to be honest, is the whole thrill of engineering =)

Need a web project for your business?

We develop CRM/ERP systems, dashboards, B2B/B2C services and corporate web systems: from requirements and architecture to launch and support.

Go to the main landing page

Frequently Asked Questions

How can this news become a business hypothesis?

Identify one customer problem and formulate a measurable value proposition that can be tested through real sales.

Where should demand validation start?

Launch a narrow MVP for one segment, measure conversion, acquisition cost and deal cycle before scaling.

Which KPIs matter first?

Track revenue in USD, CAC, gross margin, paid conversion and payback period. These are the baseline metrics for idea viability.

How long does a pilot launch take?

Usually 2-6 weeks: formulate the hypothesis, launch an MVP for a narrow segment and get the first demand and unit-economics numbers.

Get a project estimate

Последние проекты

The Van Gogh Method for Web Development: How to Release Finished Products Quickly Instead of Living in Endless Revisions

olijen