What confidence threshold should we set for auto-resolution?

Start at 0.85 for any new auto-resolution policy. After two weeks of supervised operation, examine the closed tickets and adjust based on customer follow-up rate. If fewer than 5 percent of auto-closed tickets generate a follow-up, you can lower to 0.80. If more than 10 percent generate a follow-up, raise to 0.90.

How is auto-resolution different from a chatbot?

A chatbot is a synchronous conversation; the customer is talking to it in real time and expects back-and-forth. Auto-resolution is asynchronous; the customer sends a message, the system decides whether to reply directly or escalate to a human. Auto-resolution is invisible when it works; the customer just gets a fast helpful answer.

Will auto-resolution replace support agents?

No. It removes the most repetitive 30 to 60 percent of inbound, which lets the team spend more time on the harder cases that actually benefit from human judgement. Most teams that adopt auto-resolution well end up hiring slightly fewer agents than projected, but they also stop burning out the agents they have.

How AI Auto-Resolution Works (and Where It Should Never Be Used)

Why this post is technical, not promotional

AI auto-resolution is the single highest-impact capability in a modern helpdesk and the single most dangerous one if implemented carelessly. Get it right and your team handles 30 to 60 percent more volume without hiring. Get it wrong and you replace a slow human reply with a fast wrong reply, which is worse than no reply at all.

I want to walk through how it actually works under the hood, what good configuration looks like, and the five ticket categories where it adds clear value plus the five where it must not auto-act. This is the post I wish I had read in 2024 when I was first wiring this up.

How it works under the hood

The mechanics are simpler than the marketing copy suggests. There are five steps.

1. The ticket arrives. From email, chat, WhatsApp, or any other channel, the message lands as a normalised ticket with: customer identifier, channel, message body, conversation history, and any structured metadata your system already attaches (account tier, recent order ID, etc.).

2. Intent classification. A small language model classifies the ticket into a category: billing, technical, shipping, refund, account, escalation, abuse, other. Classification is fast (sub-second) and cheap. Each category has its own auto-resolution policy.

3. Knowledge-base retrieval. For categories where auto-resolution is enabled, the system searches your knowledge base for the most semantically similar articles. The retrieval uses embeddings, not keyword matching, so a customer asking "how do I get my money back" matches a KB article titled "Refund process" even though no words overlap.

4. Confidence scoring. The model produces a draft reply citing the retrieved KB articles, then scores its own confidence in the answer. The score is between 0 and 1. The score reflects two things: how well the retrieved articles match the question, and how directly the draft answers the question.

5. The threshold gate. If confidence crosses your configured threshold (commonly 0.85), the system sends the reply directly to the customer and closes the ticket. If confidence is below the threshold, the ticket falls through to the human queue with the AI's draft attached as a starting point for the agent.

The whole pipeline runs in 2 to 6 seconds per ticket. The customer sees a helpful answer arrive within a minute of their message. The agent never sees the resolved tickets, which is the point.

What "good configuration" looks like

The default settings vendors ship with are usually too aggressive. The defensible configuration looks like this:

Per-category enablement. Auto-resolution is on for refund-status, password-reset, shipping-tracking, account-info and FAQ-style questions. It is off for everything else.

Confidence threshold of 0.85 to start. After two weeks of operation, review the closed tickets and adjust. If your follow-up rate (customers replying to an auto-resolved ticket) is under 5 percent, you can lower to 0.80. Over 10 percent, raise to 0.90.

Human escalation always one click away. Every auto-resolved reply ends with "If this didn't answer your question, reply to this email and a human will get back to you within X hours." The escalation rate is your truest measure of accuracy.

Categories with hard blocks. Some categories should never auto-resolve regardless of confidence: abuse reports, legal escalations, payment disputes above your refund threshold, anything from a customer flagged in your CRM as VIP or at-risk.

Audit log on every auto-action. Every auto-resolved ticket is logged with: model used, confidence score, retrieved KB articles, draft generated, threshold at the time. The log lets you investigate and learn from misses.

Weekly review for the first quarter. A team lead spends an hour a week reading 30 to 50 auto-resolved tickets at random. The review catches degradation early and tunes the threshold and KB.

The five categories where auto-resolution works well

Based on data from KimonDesk customers and from my prior helpdesk work, these are the five categories where auto-resolution consistently delivers value.

1. Refund status checks

"Where's my refund? You said it would arrive in 5 days and it's been a week." The answer is in your payment processor's API and your refund policy. The system retrieves the refund's current status, formats it into a customer-friendly reply ("your refund of £45 was processed on 20 April; it usually takes 5-10 business days to appear on the original payment method"), and sends. Confidence is consistently high because the source data is structured.

2. Password resets

"How do I reset my password?" Your KB article has the answer. The customer needs the link, the steps, and a fallback. Auto-resolution sends all three in a clear reply. The fallback is "if you don't receive the email within 5 minutes, check spam, then reply to this and we'll trigger it manually."

3. Shipping tracking and delivery date queries

"Where's my order?" Tracking number plus carrier API plus delivery estimate. The system pulls all three from your fulfilment system and formats them. If the tracking shows a delivery exception (failed delivery attempt, address issue), the system escalates to human instead of attempting the explanation.

4. Account information lookups

"What email is on my account?" or "When does my subscription renew?" The answer is in your own database. Authenticate the requester, retrieve the data, format it into a reply. Skip if the requester cannot be authenticated against your customer database.

5. Documentation-style FAQs

"How do I add a teammate?" or "What payment methods do you accept?" Static answers from your KB. The system retrieves the most relevant article and replies with the answer plus a link. These are the questions a good knowledge base would answer anyway; auto-resolution removes the friction of the customer having to find the article.

The five categories where auto-resolution must not act

These are the categories where a wrong answer is materially worse than a slow human reply. In every case, escalate to human regardless of confidence.

1. Emotional escalations

"This is unacceptable." "I'm furious." "I will be telling everyone about this." A customer in this state needs a human reply, even if the literal question (a refund, a tracking update) is in the auto-resolvable category. The sentiment classifier should override the auto-resolution policy. A correct factual reply to an angry customer reads as dismissive and tends to escalate further.

2. Abuse and harassment reports

A customer reporting that another customer has harassed them, or reporting abusive content on the platform, requires human judgement and proper documentation. Even if the system has a confident answer ("here's how to block the user"), the right action is to escalate to a trust-and-safety team member.

3. Financial decisions above your refund threshold

For any organisation, there is a refund amount above which the decision needs human approval. £20 for a small e-commerce store; £200 for a B2B SaaS. Auto-resolution can confirm refunds within the threshold but must escalate above it. The threshold is configurable per organisation.

4. Legal or compliance queries

"My data was leaked." "I want to exercise my GDPR right to deletion." "I'm being investigated and need my purchase history." These require human handling for both legal-defence and regulatory reasons. The auto-classifier should route them straight to escalation.

5. Anything from a flagged customer

If your CRM marks a customer as VIP, at-risk, in-dispute, or high-value, every ticket from that customer should reach a human regardless of the question. This is the single highest-impact rule because the ticket from your largest customer is the one where speed matters less than care.

Setup cost vs ongoing maintenance

Setting up auto-resolution well is a project, not a checkbox. Realistic time investment for an 8-agent team:

KB audit and improvement: 1 to 2 weeks. Auto-resolution is only as good as the KB it retrieves from. Most teams discover the KB is 30 percent out of date. Fixing this is the largest single time investment.
Category policy configuration: 2 days. Decide which categories auto-resolve, set thresholds, configure hard blocks.
Two-week supervised pilot: 2 weeks of half a team-lead's time reviewing every auto-resolution before send.
Threshold tuning: ongoing, lighter after the first month.

After the first quarter, ongoing maintenance is roughly 2 hours a week for one team-lead. Most of that is KB improvement triggered by tickets the AI got wrong.

What this looks like in numbers

For a typical 8-agent team handling 200 tickets a day across email and chat, well-configured auto-resolution reaches steady-state numbers like:

35 to 45 percent of total inbound auto-resolved (refunds, passwords, shipping, account, FAQs).
Customer follow-up rate on auto-resolved tickets: 4 to 8 percent (i.e. 92 to 96 percent stick).
Customer satisfaction on auto-resolved tickets: typically equal to or slightly higher than agent-handled, mostly because the reply is faster.

Translated into team capacity, that is roughly 70 to 90 tickets a day removed from the human queue. For an 8-agent team that previously had 200 tickets and was at capacity, you now have 110 to 130 tickets across 8 agents, which is a comfortable workload that lets the team spend real time on the harder cases.

Where this fits in the broader AI conversation

Auto-resolution is one of five AI capabilities that matter in a modern helpdesk. The others (drafts, summaries, sentiment, routing) all matter too but auto-resolution is the one that has the largest direct effect on team capacity. We covered the broader landscape in What AI-Native Helpdesk Actually Means in 2026.

For the KimonDesk-specific implementation, the AI features page lists the per-category controls, the threshold ranges, and the audit-log structure. Pricing for the AI features is included at every tier; see the pricing page.

If you want a glossary entry for the underlying concepts, AI auto-resolution and intent detection both have short definitions designed for a non-technical buyer.

Closing thought

The goal of auto-resolution is not to replace your support team. The goal is to give them the time to handle the cases where their judgement actually matters. The best support agents I have worked with are the ones who could spend 90 percent of their day on the 10 percent of tickets that needed real thought, instead of the other way round. That is what well-configured auto-resolution gives them back.

Set the threshold conservatively. Block the categories where wrong answers cause real harm. Review the audit log every week for the first quarter. Then leave it alone and let the team focus on the work that needs them.

References

KimonDesk auto-resolution audit log structure and confidence scoring methodology, internal V3 (March 2026).
Intercom Fin AI deflection benchmarks, public report Q1 2026.
Zendesk Advanced AI auto-resolver documentation, accessed Q2 2026.
Customer satisfaction comparison of AI-resolved vs agent-resolved tickets, KimonDesk pilot data Q4 2025 (n=4,127 tickets across 12 customers).

How AI Auto-Resolution Works (and Where It Should Never Be Used)

Why this post is technical, not promotional

How it works under the hood

What "good configuration" looks like

The five categories where auto-resolution works well

1. Refund status checks

2. Password resets

3. Shipping tracking and delivery date queries

4. Account information lookups

5. Documentation-style FAQs

The five categories where auto-resolution must not act

1. Emotional escalations

2. Abuse and harassment reports

3. Financial decisions above your refund threshold

4. Legal or compliance queries

5. Anything from a flagged customer

Setup cost vs ongoing maintenance

What this looks like in numbers

Where this fits in the broader AI conversation

Closing thought

References

The True Cost of Zendesk in 2026: What an 8-Agent Team Actually Pays

What 'AI-Native Helpdesk' Actually Means in 2026

Why Your Support Shouldn't Run on Gmail (And What to Switch To)

Try every feature for 14 days. No card.

How AI Auto-Resolution Works (and Where It Should Never Be Used)

Why this post is technical, not promotional

How it works under the hood

What "good configuration" looks like

The five categories where auto-resolution works well

1. Refund status checks

2. Password resets

3. Shipping tracking and delivery date queries

4. Account information lookups

5. Documentation-style FAQs

The five categories where auto-resolution must not act

1. Emotional escalations

2. Abuse and harassment reports

3. Financial decisions above your refund threshold

4. Legal or compliance queries

5. Anything from a flagged customer

Setup cost vs ongoing maintenance

What this looks like in numbers

Where this fits in the broader AI conversation

Closing thought

References

Read more

The True Cost of Zendesk in 2026: What an 8-Agent Team Actually Pays

What 'AI-Native Helpdesk' Actually Means in 2026

Why Your Support Shouldn't Run on Gmail (And What to Switch To)

Try every feature for 14 days. No card.