Putting It All Together: A Synthetic Monitoring Strategy

Share This Post

Twenty-four videos. Nine posts. From the first ping command written in 1983 to ML-enhanced anomaly detection running across six continents. This final post assembles everything into a single, production-ready monitoring architecture.

We’re going to design monitoring for a real SaaS company from scratch — making every decision explicit, showing the math, and building a system that would hold up against a 99.95% SLA in the real world.

Start Here: What the Video Covers

▶ Video 24 — “Putting It All Together — A Synthetic Monitoring Strategy”

The Scenario: CloudApp Inc.

50,000 users across North America, Europe, and Asia-Pacific. Microservices on Kubernetes. A 99.95% SLA — which allows no more than 4.38 hours of downtime per year, or roughly 21.9 minutes per month, or 43.8 seconds per day.

That last number is the constraint everything else is designed around. If your maximum allowed daily downtime is 43 seconds, you need to detect problems in under 30 seconds. That number drives your check interval, which drives your probe count, which drives everything else.

Probe Placement

Six probe locations — two per geographic region where users are concentrated — with ISP diversity across each pair.

Region	Location 1	Location 2
North America	US-East (New York)	US-West (Los Angeles)
Europe	London	Frankfurt
Asia-Pacific	Tokyo	Sydney

ISP diversity rule: at least two different providers across the six locations. A regional routing problem that affects a single ISP's backbone will be detected by the second probe in that region while the first shows an outage.

Monitor Types

Monitor Type	What It Covers
HTTPS checks	Web application, API endpoints, CDN edge nodes — with full phase decomposition
TCP checks	Database ports (5432, 3306, 6379) and internal services that don't speak HTTP
DNS checks	Resolver health from each probe location — catching geo-steering failures and regional DNS issues
SSL checks	Certificate expiry tracking across all public-facing domains; alerts at 30 and 7 days
P2P tests	Between all six probe pairs — measuring WAN and cloud interconnect performance directly

Check Intervals for SLA Math

99.95% Uptime Allowance
Annual downtime allowance	4.38 hours
Monthly downtime allowance	21.9 minutes
Daily downtime allowance	4.38 minutes
Required detection time (Nyquist)	≤ 30 seconds
Required check interval	30 seconds for all SLA-bound endpoints

For endpoints not directly tied to the SLA — internal services, staging, non-critical paths — 60-second intervals are appropriate. Certificate checks can run hourly. P2P throughput tests can run every 5 minutes without meaningful load impact.

ML Detection Configuration

CloudApp has predictable daily traffic patterns — peak around noon EST, quiet at 3 AM. Static thresholds would need constant adjustment as user growth changes the noon peak month over month. ML anomaly detection learns the pattern and adapts automatically.

Detection Type	What It Catches	Configuration
ML Anomaly Detection	Sudden spikes and unexpected values at any hour	Medium sensitivity; high confidence required to page
Change-Point Detection	Permanent routing shifts after ISP changes or deployments	Applied to latency and packet loss metrics across all probes
Drift Tracking	Gradual degradation of DNS, latency, or TLS performance over weeks	Alert threshold at 15% drift from learned baseline

Alert Escalation

Alerts that go unacknowledged must escalate automatically. Waiting for someone to notice is not a process.

T+0:00

Slack — Team Channel

ML detection fires. Notification to the #infrastructure-alerts channel with context: which probe, which check, which phase, anomaly confidence score.

T+5:00

PagerDuty — On-Call Page

No acknowledgment in 5 minutes. On-call engineer receives a page with the same context. Silence is treated as unavailability.

T+15:00

Phone Call — Engineering Manager

No response to page in 10 minutes. Phone call to the engineering manager. At this point, the incident is escalated regardless of business hours.

CloudApp Inc. — Production Monitoring Architecture

Probes

6 locations across 3 continents, 2+ ISPs

Monitor Types

HTTPS, TCP, DNS, SSL, P2P

Check Intervals

30s (SLA-bound), 60s (standard), 1h (certs)

ML Detection

Anomaly + Change-Point + Drift

Escalation

Slack → PagerDuty → Phone (5 & 15 min)

SLA Compliance

Detection in ≤30s for 99.95% uptime

The Complete Architecture

Series

Synthetic Monitoring Blog Series

1 What Is Synthetic Monitoring — The core concept and 40 years of history
2 ICMP, DNS, and TCP — The three foundational protocols
3 TLS and HTTP — Certificate monitoring and the five-phase request waterfall
4 Beyond HTTP — WebSockets, databases, and AI service monitoring
5 The Standards — RFC 2544, TWAMP, and traceroute
6 The Philosophy — Active measurement, check intervals, and probe placement
7 Probe Architecture — Edge-first design, P2P testing, and path bandwidth discovery
8 The Intelligence Layer — ML anomaly detection, drift, and alert auto-tuning
9 The Complete Strategy — A production-ready architecture from first principles

We built this series because we believe infrastructure and operations teams deserve technically rigorous education — not vendor brochures. The concepts here apply regardless of what tools you use. The goal was always to make you a sharper engineer, not to sell you software.

If you found this series useful, the next step is exploring how Parlon applies these principles in practice — active synthetic validation, normalized telemetry, and learning-based alerting in a single platform designed for the complexity of modern infrastructure.

See Parlon in Action

Parlon brings everything in this series — active validation, normalized telemetry, ML-powered alerting — into a single platform built for enterprise infrastructure teams.

About Parlon
Parlon is an infrastructure observability platform built for enterprise teams operating complex, hybrid environments. Parlon combines active synthetic validation, real-time telemetry normalization, and learning-based alerting into a single platform — shifting operations from firefighting to foresight. Learn more at parlon.io.

More To Explore

From Alert to Answer: Full-Stack Incident Response and the Maturity Model

It’s 2:14 AM. A synthetic check just failed. What you do next (and how long it takes) depends almost entirely on whether your monitoring data lives in one place or five. This final post in Season 2 covers three topics

parlonteam June 2, 2026

Three Ways Systems Degrade — and the ML Architecture That Catches All of Them

If your monitoring only catches spikes, you’re missing at least half of production failures. The other half arrive differently — as quiet level shifts after deployments, or as slow drift that accumulates over weeks without a single measurement ever looking

parlonteam May 28, 2026