What “good” SIP trunk monitoring looks like in production (beyond up/down)
In production, SIP trunk monitoring can’t stop at “the trunk responds to OPTIONS.” Good SIP trunking is about more than “up or down,” and operational monitoring needs to reflect that reality with user-experience indicators and call completion outcomes—not just reachability.
Why? Because a trunk (or SBC edge) can respond to lightweight checks while real calls degrade: call setup gets slow, specific destinations start failing, or one route begins returning a different mix of SIP failures.
Why OPTIONS success can coexist with failed/slow call setups
SIP OPTIONS is a lightweight probe: it can validate that an endpoint is reachable and responding, and you can measure response latency. But OPTIONS does not exercise full call setup, early media, media path, codec negotiation, or destination routing logic.
Practically: your monitoring can show “green” while users complain about silence before ringback, sudden spikes in call failures to a region, or intermittent 5xx errors during peak hours. That’s a degraded state, and it needs separate detection and response.
The minimum monitoring layers: synthetic checks + real-traffic KPIs + SIP response distributions
- Synthetic checks: OPTIONS for basic availability/latency; optionally REGISTER (when relevant); and at least one INVITE-based call scenario to test end-to-end call setup.
- Real traffic KPIs: ASR, ACD (often referred to as ALOC), and PDD—tracked over time and compared against baselines.
- SIP response distributions: 4xx/5xx rate plus the error mix (how the distribution changes). A shift from 486 Busy to 4xx routing failures means a very different operational problem.
Segment first: route, destination, traffic profile (so KPIs mean something)
Before you alert on anything, decide how you’ll segment metrics. A recurring failure mode in voice observability is KPI pollution: mixing traffic types and destinations until the numbers stop being diagnostic.
- By route/trunk/POP: so you can compare “route vs route.”
- By destination: country, prefix, or carrier group—so localized degradations aren’t hidden in global averages.
- By traffic profile: human-to-human conversational traffic vs dialer/contact-center patterns. Operationally, these profiles behave differently; don’t mix them in the same ASR/ACD baselines unless you’re deliberately modeling that combined behavior.
- Normalize numbering (e.g., E.164) before routing and reporting so your destination grouping is stable and comparable.
The KPI core: ASR, ACD, PDD — definitions, formulas, and what they really signal
ASR, ACD, and PDD are widely used because they act as early-warning signals for route health—especially when you learn to read them together.
ASR (Answer Seizure Ratio): definition, formula, and interpretation
Definition: ASR is the percentage of attempted calls that are answered.
Formula: ASR = (answered calls ÷ total attempts) × 100%
What it tends to signal (operationally): call setup success and completion—often impacted by blocking, number quality, or filtering. However, ASR is also affected by user behavior; if the called party doesn’t answer, ASR falls even if the network is fine. That’s why segmentation and baseline comparisons matter.
ACD / ALOC (Average Call Duration): what it can and can’t tell you
Definition: ACD is the average duration of answered calls. (Many teams also call it ALOC: Average Length of Call.)
Formula: ACD = (sum of durations for answered calls) ÷ (number of answered calls)
Operational reading: A falling ACD can indicate callers are hanging up early (poor experience), calls are dropping, or a traffic mix shift is happening (more voicemail, more short-burst dialing, more “quick-answer then hang up”). Treat ACD as a corroborating signal, not a single source of truth.
PDD (Post-Dial Delay): why spikes feel like outages
PDD is the time between initiating a call (INVITE) and receiving meaningful progress from the caller’s point of view (ringback/early media) or an answer. Even if calls eventually connect, a PDD spike can look like an outage because users abandon when they hear silence.
One concrete place to check in SIP environments is early media. In practice, 183 Session Progress with SDP is often required for proper ringback/early media handling, and some carriers require PRACK support. Whether this applies depends on your carrier and your SBC/PBX behavior; confirm with traces rather than assuming.
Decision table: interpreting KPI patterns together
| What you see | Likely meaning (hypothesis) | First checks | Fast action options |
|---|---|---|---|
| ASR drops, 4xx/5xx rises, PDD normal | Call completion/routing failures by destination or route | Review error mix; slice by destination and route | Compare alternate route with controlled sample; sideline failing route if confirmed |
| PDD spikes, ASR flat or slightly down | Call setup slowness / progress signaling issues (often early media handling) | Inspect 183 w/ SDP behavior; confirm ringback behavior; check if PRACK is required | Route comparison; prioritize user-impact mitigation (fail over) while triaging |
| ASR flat, ACD drops sharply | Calls connect but users hang up early, or traffic mix changed | Segment traffic profile; listen to samples if available; check recent dialer/campaign changes | Suppress paging until corroborated; open investigation ticket rather than incident page |
| OPTIONS OK, but INVITE scenario fails or slows | Degraded state not caught by availability probes | Review INVITE responses and timing; check PDD and 4xx/5xx mix | Trigger “degraded trunk” alert; shift traffic if you have redundancy |
Note: The “likely meaning” column is intentionally framed as hypotheses. SIP failures are context-dependent; confirm with segmentation, controlled tests, and evidence collection.
Alerting on SIP response codes: 4xx/5xx rate, error mix, and the “top offenders” view
SIP response codes are noisy in isolation. The operational win comes from alerting on (1) the rate of failures and (2) the mix of failures changing unexpectedly.
What to alert on: failure rate and error mix changes
- 4xx% and 5xx% of attempts, per route and destination group.
- Error mix drift: e.g., a sudden increase in 4xx routing errors vs a steady background of 486 Busy.
- Top offenders: the top N destinations/prefixes causing errors, so on-call can quickly scope the blast radius.
503 specifically: treat it as a symptom, not a diagnosis
You’ll see “SIP 503 Service Unavailable” show up in production during real incidents. The practical approach is to avoid assuming a universal root cause. Treat 503 as “the call attempt was unsuccessful,” then collect enough evidence to narrow it:
- Which route/trunk/POP returned 503?
- Is it destination-specific or global?
- Did the error mix change (e.g., 503 replacing 4xx failures)?
- What do SIP ladders or pcaps show for failed vs successful calls?
Correlate response codes with ASR/PDD by destination and hour
Voice traffic is seasonal and destination-dependent. Instead of alerting on a single global threshold, correlate:
- ASR change + 4xx/5xx change for the same destination group
- PDD spikes at the same hour across multiple destinations (systemic) vs one destination (routing)
Synthetic monitoring: OPTIONS, REGISTER, and INVITE scenarios (and what each catches)
Synthetic probes reduce MTTR because they answer a simple question quickly: “Can we complete the call flow we care about right now?” Common monitoring patterns include OPTIONS, REGISTER checks, and INVITE-based call scenarios.
OPTIONS health checks: availability + latency
OPTIONS is ideal for basic reachability and response time. It’s also easy to export into time-series systems. If your environment already exports OPTIONS latency and failure counters into a metrics system (e.g., Prometheus-style), that’s a solid starting point.
- Alert on sustained OPTIONS failure counts (hard down).
- Alert on OPTIONS latency deviations from baseline (early signal that something is saturating).
REGISTER checks (when applicable): auth/registration path validation
When your design involves registrations, REGISTER probes validate authentication and registration flow. Monitoring tools commonly support this as a dedicated probe type.
- Detects auth misconfigurations, credential expirations, and registration-path issues.
- Does not prove that real calls (INVITEs) will set up correctly.
INVITE call scenarios: catch what OPTIONS can’t
INVITE-based call scenarios execute full call setup and teardown, which makes them better at detecting degraded states: unexpected SIP responses during setup, progress signaling problems, or timing regressions that show up as PDD spikes.
Implementation detail varies by environment, but the operational principle is consistent: keep at least one controlled, known-good call path you can test repeatedly and compare across routes.
Practical alert rules (templates): thresholds, baselines, and suppression logic
In production, dashboards don’t page anyone. Alert rules do—and they need to be engineered to avoid false positives.
Checklist #1: build your baseline before you page anyone
- Choose destination groupings (country/prefix/carrier group) and normalize numbering (e.g., E.164) so grouping is stable.
- Separate traffic profiles (e.g., human traffic vs contact-center/dialer traffic) so ASR/ACD aren’t blended into noise.
- Pick comparison windows: “today vs your own baseline for the same destination and hour” and “route vs route.”
- Define minimum sample sizes per bucket (attempts per 5–15 minutes) to avoid noisy paging.
Baseline-driven alerts (recommended) vs static thresholds (use cautiously)
For voice alerting, baseline comparisons (today vs your own normal for the same destination and hour) and route-to-route comparisons are often more reliable than global static thresholds because call behavior is not stationary.
- ASR anomaly (per route + destination): alert when ASR deviates materially from baseline for the same hour/day pattern and attempts exceed a minimum threshold.
- PDD spike (per route): alert when PDD exceeds baseline; optionally corroborate with synthetic INVITE scenario slowness/failure.
- Error-mix shift: alert when a specific response code class (4xx/5xx) increases and replaces the normal mix.
If you must use static thresholds, treat them as guardrails for catastrophic conditions (e.g., “ASR collapsed”) rather than routine degradation detection. Static thresholds can create alert fatigue when traffic profiles change.
Suppression and dedup: reduce alert fatigue without hiding real incidents
- Low sample suppression: don’t page on tiny buckets; route issues need enough attempts to be confident.
- Corroboration gates: page only when two signals align (example: ASR drop + 5xx rise; or PDD spike + INVITE scenario slow/failing).
- Dedup by root dimension: one incident per route + destination group, not 200 alerts per DID.
Incident response playbooks: what to do when alerts fire
The fastest teams treat voice incidents like infrastructure incidents: they have a scoped triage path, controlled tests, and a standard escalation bundle. Below are practical playbooks based on common operational patterns (controlled samples, alternate-route comparison, early-media checks like 183 w/ SDP, and packaging SIP ladders/pcaps).
Checklist #2: first-response triage (use in the first 10–15 minutes)
- Confirm scope: which route/trunk/POP and which destinations are affected?
- Check ASR/PDD trend vs baseline for the same destination and hour.
- Inspect SIP error mix (what changed?) rather than just total failures.
- Run a controlled sample (10–20 calls) at the same hour to the affected destination(s).
- Compare against an alternate route if available; if KPIs recover on the alternate route, consider sidelining the bad one.
ASR drop playbook
- Step 1: Slice ASR by destination group and route. Identify whether the drop is localized or global.
- Step 2: Review the error mix (e.g., increased 4xx routing errors vs busy signals).
- Step 3: Send a controlled sample (10–20 calls) at the same hour to validate reproducibility.
- Step 4: Compare an alternate route. If ASR recovers on the alternate route, sideline the bad one (per your change policy).
- Step 5: If call recordings are available in your tooling, listen to a few samples to classify the failure mode (silence, early media, IVR behavior, one-way audio). Availability depends on your stack and policies.
PDD spike playbook (user-perceived “dead air”)
- Step 1: Confirm PDD spike by destination and route; check whether it aligns with INVITE scenario timing changes.
- Step 2: Inspect early media handling. A practical focal point is whether your PBX/SBC treats 183 Session Progress with SDP correctly and plays ringback promptly.
- Step 3: Validate whether PRACK is required by the carrier and supported end-to-end (environment-specific; confirm with traces).
- Step 4: Compare across routes; if only one route has elevated PDD, treat it as route degradation and mitigate accordingly.
4xx/5xx surge playbook (including 503)
- Step 1: Identify “top offenders” by destination group and route. Don’t assume a single root cause.
- Step 2: Capture SIP ladders or pcaps for a small set of failing and succeeding calls.
- Step 3: Verify recent changes (routing changes, SBC policy updates, codec/SDP changes) and whether timing aligns with the onset.
- Step 4: Run the controlled sample and compare alternate route behavior to decide whether to sideline the failing route.
Packaging an escalation that carriers can act on (reduce back-and-forth)
Carrier tickets stall when they’re vague (“calls failing”). They move when you provide a concise, reproducible evidence bundle. A practical escalation bundle usually includes timestamps, called/calling numbers (appropriately redacted), SIP ladders or pcaps, and KPI snapshots—plus a few samples for listening when possible.
Escalation bundle template (copy/paste)
- Incident window: start/end timestamps (with timezone) + whether ongoing
- Scope: trunk/route/POP identifiers; affected destination group(s)
- Examples: 5–10 failing call examples and 2–3 successful examples (same destination), each with:
- timestamp
- calling number (ANI/CLI) and called number (DNIS) (redact/anonymize per your policy)
- SIP response code(s) observed
- PDD (if you measure it) and any other relevant KPIs
- Artifacts: SIP ladder diagrams or pcaps for the examples
- What you already tried: controlled sample size (e.g., 10–20 calls), alternate-route comparison outcome, and whether sidelining/failover mitigated impact
Compliance note: This article is operational guidance, not legal advice. If you collect pcaps or recordings, apply your organization’s privacy, retention, and access controls.
FAQ
What is ASR and ACD in VoIP (and why can they be misleading)?
ASR (Answer Seizure Ratio) is the percentage of attempted calls that are answered (answered ÷ attempts × 100%). ACD (Average Call Duration) is the average duration of answered calls (total duration of answered calls ÷ number of answered calls). Both can be misleading if you mix traffic profiles or ignore user behavior effects (no-answer/busy lowers ASR even when the network is fine), so segmenting and baseline comparisons are essential.
What is the formula for ASR and for ACD?
ASR = (answered calls ÷ total attempts) × 100%. ACD = (sum of durations for answered calls) ÷ (number of answered calls).
What is PDD (post-dial delay) and what typically causes spikes?
PDD measures how long it takes a caller to get progress (ringback/early media) or an answer after dialing. Spikes typically indicate call setup slowness and can cause user abandonment. A common operational check is early media handling: whether 183 Session Progress with SDP (and PRACK if required) is handled correctly end-to-end.
What should I alert on for SIP trunks: ASR/ACD/PDD or SIP 4xx/5xx rates?
Use both. ASR/ACD/PDD represent user experience and route health; SIP 4xx/5xx rates and error mix explain failure patterns and speed triage. Relying on only one category increases false positives and blind spots.
What does SIP 503 Service Unavailable mean, and what should I check first?
SIP 503 means the call attempt was unsuccessful, but it’s not a single root cause. First, scope it (route/destination/time), inspect the error mix, capture SIP ladders/pcaps, run controlled test calls, and compare against an alternate route before escalating with a clear evidence bundle.
IllyVoIP’s engineering viewpoint is simple: treat voice like production infrastructure. Monitor beyond up/down, alert against your own baselines, and keep a runbook-ready escalation bundle so you can restore service fast and reduce back-and-forth with carriers.
CTA: If you’re building a production monitoring and routing approach for SIP trunks, use this guide as your starting runbook—and align your alert dimensions (route, destination, traffic profile) before you tune thresholds.
