Three Ways Systems Degrade — and the ML Architecture That Catches All of Them

Share This Post

If your monitoring only catches spikes, you’re missing at least half of production failures. The other half arrive differently — as quiet level shifts after deployments, or as slow drift that accumulates over weeks without a single measurement ever looking wrong.

This post covers three topics that share a common thread: what to do when spikes aren’t the problem. Three-pillar ML detection for the full failure spectrum, pull-model probe architecture for the monitoring blind spot that push-model creates, and calibrated alerting for the operational cost of getting sensitivity wrong.

Start Here: What the Videos Cover

▶  Video 31 — “Three-Pillar ML Detection in Practice: Anomaly + Change-Point + Drift”

▶  Video 32 — “Edge Probe Resilience: The Pull Model and Surviving WAN Failures”

▶  Video 33 — “Calibrated Alerting: Auto-Tuning, Flapping, and the Cost of Noise”

The Three Failure Modes

Production systems degrade in three distinct ways. Each one has a characteristic signature, and each one requires a different detection approach. A platform that implements only one of the three will miss the other two.

Failure Mode Signatures
Failure Mode Signature Example Detection Method
Spike / Anomaly Sudden deviation from expected range; resolves quickly or escalates Latency jumps from 45ms to 800ms at 2:14 AM Anomaly detection (Isolation Forest)
Level Shift / Step Change Permanent move to a new operating level after a discrete event Deploy at 3 PM; latency moves from 45ms to 72ms and stays there Change-point detection (PELT algorithm)
Gradual Drift Slow, monotonic degradation; no single measurement looks wrong DNS resolution time increases 1ms per day over six weeks Drift detection (rolling baseline comparison)

Pillar 1: Anomaly Detection

Anomaly detection is the most familiar approach. A statistical model learns the expected distribution of a metric — accounting for time-of-day and day-of-week patterns — and fires when a measurement falls outside that learned range. It’s highly effective for sudden, obvious failures.

Its limitation is equally important to understand: anomaly detection is calibrated to the current baseline. When a deployment causes a permanent baseline shift — latency moves from 45ms to 72ms and stays there — the anomaly model adapts to the new baseline. Within hours or days, the new level is normal. The step change was never flagged.

Pillar 2: Change-Point Detection

Change-point detection identifies the exact moment a metric’s behavior permanently shifts. The PELT algorithm analyzes the statistical properties of the data on either side of every possible cut point and identifies locations where the two segments have significantly different distributions. The output is a timestamp: the baseline changed at 15:04:22 UTC.

This is what anomaly detection misses and what makes change-point detection operationally distinct. When a deployment at 3 PM shifts latency from 45 to 72 ms permanently, anomaly detection sees 72 ms as the new normal. Change-point detection records the shift and attributes it to the deployment. Over time, a history of change-point events becomes a record of every configuration change that affected metric behavior — including the ones that weren’t supposed to.

Pillar 3: Drift Detection

Drift detection catches what neither anomaly nor change-point detection surfaces: the problem that arrives so slowly that no single day looks unusual.

Your baseline DNS resolution time was 30ms six months ago. Today it’s 45ms. No single measurement triggered an anomaly. No step change is visible. But the cumulative drift is 50% — and it’s quietly consuming latency budget that other parts of your stack depend on. Drift detection measures the divergence of the current rolling average from the historical baseline and fires when the accumulated deviation crosses a configured threshold.

How They Work Together

In practice, the three pillars fire at different stages of the same incident. A deployment on day zero triggers a change-point event — flagged, but the new level is within acceptable range. Over the following two weeks, drift detection shows the shifted baseline continuing to climb. On day 15, a drift alert fires. On day 17, as degradation accelerates, an anomaly fires. A platform running only anomaly detection catches the problem on day 17. Three-pillar detection flagged it on day zero and tracked it every day since.

The Architecture Blind Spot: Push vs. Pull Probes

A less-discussed but consequential architectural choice in probe design is whether results flow from probe to server (push) or are retrieved by the server from the probe (pull). The difference matters most during the exact events you most need monitoring for: WAN failures at remote sites.

In a push model, the probe sends results to the central server after each check. When the WAN link fails, the probe has no outbound path. It goes silent. The monitoring platform shows a gap in data starting at the moment of the failure — the same moment you need the data most. By the time the WAN recovers and normal reporting resumes, the incident window is gone.

In a pull model, the probe stores results locally and waits for the server to retrieve them. When the WAN fails, the probe continues running checks and continues writing to local storage. When connectivity recovers, the server fetches everything buffered during the outage. The 75-minute incident window is fully captured, with per-minute resolution, backfilled automatically.

The question to ask any probe-based platform: what does a probe do when it can’t reach your server? If the answer is “it stops reporting,” you have a push model — and a data gap during every WAN failure. If the answer is “it buffers locally and backfills on recovery,” you have a resilient architecture that captures outages instead of hiding them.

Calibrated Alerting: The Economic Case

Alert noise has a quantifiable cost. One false-positive alert per day, times 30 engineers who each spend 5 minutes triaging it, equals 2.5 engineer-hours wasted daily. Over a year, that’s more than 900 hours spent responding to alerts that required no action. Teams that experience this consistently learn to discount alerts — which means the real ones get slower responses.

Auto-Tuning: Following the Pattern

Static thresholds require constant manual maintenance. Traffic patterns change with seasons, product launches, and user growth. A threshold tuned in January is wrong by July. Auto-tuning addresses this by observing the metric’s natural variation over time — its daily peaks, weekly cycles, expected ranges — and adjusting thresholds dynamically. The result: fewer false positives during predictable high-load periods, maintained sensitivity during off-peak hours when genuine failures are more likely and less expected.

Flapping Detection: One Page, Not Eight

Flapping is distinct from alert volume. A check that oscillates rapidly between passing and failing — perhaps because the threshold is positioned at the edge of the metric’s normal operating range — generates multiple alert notifications in minutes for what is effectively a single unstable condition. Flapping detection counts state transitions within a rolling time window and identifies checks that change state too frequently. Instead of eight separate pages in four minutes, the on-call engineer receives a single “this check is flapping” notification, with the raw state change history attached.

About Parlon
Parlon is an infrastructure observability platform built for enterprise teams operating complex, hybrid environments. Parlon combines active synthetic validation, real-time telemetry normalization, and learning-based alerting into a single platform — shifting operations from firefighting to foresight. Learn more at parlon.io.

More To Explore