Putting It All Together: A Synthetic Monitoring Strategy

Share This Post

Twenty-four videos. Nine posts. From the first ping command written in 1983 to ML-enhanced anomaly detection running across six continents. This final post assembles everything into a single, production-ready monitoring architecture.

We’re going to design monitoring for a real SaaS company from scratch — making every decision explicit, showing the math, and building a system that would hold up against a 99.95% SLA in the real world.

Start Here: What the Video Covers

▶  Video 24 — “Putting It All Together — A Synthetic Monitoring Strategy”

The Scenario: CloudApp Inc.

50,000 users across North America, Europe, and Asia-Pacific. Microservices on Kubernetes. A 99.95% SLA — which allows no more than 4.38 hours of downtime per year, or roughly 21.9 minutes per month, or 43.8 seconds per day.

That last number is the constraint everything else is designed around. If your maximum allowed daily downtime is 43 seconds, you need to detect problems in under 30 seconds. That number drives your check interval, which drives your probe count, which drives everything else.

1
Probe Placement

Six probe locations — two per geographic region where users are concentrated — with ISP diversity across each pair.

Region Location 1 Location 2
North America US-East (New York) US-West (Los Angeles)
Europe London Frankfurt
Asia-Pacific Tokyo Sydney

ISP diversity rule: at least two different providers across the six locations. A regional routing problem that affects a single ISP's backbone will be detected by the second probe in that region while the first shows an outage.

2
Monitor Types
Monitor Type What It Covers
HTTPS checks Web application, API endpoints, CDN edge nodes — with full phase decomposition
TCP checks Database ports (5432, 3306, 6379) and internal services that don't speak HTTP
DNS checks Resolver health from each probe location — catching geo-steering failures and regional DNS issues
SSL checks Certificate expiry tracking across all public-facing domains; alerts at 30 and 7 days
P2P tests Between all six probe pairs — measuring WAN and cloud interconnect performance directly
3
Check Intervals for SLA Math
99.95% Uptime Allowance
Annual downtime allowance 4.38 hours
Monthly downtime allowance 21.9 minutes
Daily downtime allowance 4.38 minutes
Required detection time (Nyquist) ≤ 30 seconds
Required check interval 30 seconds for all SLA-bound endpoints

For endpoints not directly tied to the SLA — internal services, staging, non-critical paths — 60-second intervals are appropriate. Certificate checks can run hourly. P2P throughput tests can run every 5 minutes without meaningful load impact.

4
ML Detection Configuration

CloudApp has predictable daily traffic patterns — peak around noon EST, quiet at 3 AM. Static thresholds would need constant adjustment as user growth changes the noon peak month over month. ML anomaly detection learns the pattern and adapts automatically.

Detection Type What It Catches Configuration
ML Anomaly Detection Sudden spikes and unexpected values at any hour Medium sensitivity; high confidence required to page
Change-Point Detection Permanent routing shifts after ISP changes or deployments Applied to latency and packet loss metrics across all probes
Drift Tracking Gradual degradation of DNS, latency, or TLS performance over weeks Alert threshold at 15% drift from learned baseline
5
Alert Escalation

Alerts that go unacknowledged must escalate automatically. Waiting for someone to notice is not a process.

T+0:00
Slack — Team Channel

ML detection fires. Notification to the #infrastructure-alerts channel with context: which probe, which check, which phase, anomaly confidence score.

T+5:00
PagerDuty — On-Call Page

No acknowledgment in 5 minutes. On-call engineer receives a page with the same context. Silence is treated as unavailability.

T+15:00
Phone Call — Engineering Manager

No response to page in 10 minutes. Phone call to the engineering manager. At this point, the incident is escalated regardless of business hours.

6
CloudApp Inc. — Production Monitoring Architecture
Probes
6 locations across 3 continents, 2+ ISPs
Monitor Types
HTTPS, TCP, DNS, SSL, P2P
Check Intervals
30s (SLA-bound), 60s (standard), 1h (certs)
ML Detection
Anomaly + Change-Point + Drift
Escalation
Slack → PagerDuty → Phone (5 & 15 min)
SLA Compliance
Detection in ≤30s for 99.95% uptime

The Complete Architecture

Series
Synthetic Monitoring Blog Series
  • 1 What Is Synthetic Monitoring — The core concept and 40 years of history
  • 2 ICMP, DNS, and TCP — The three foundational protocols
  • 3 TLS and HTTP — Certificate monitoring and the five-phase request waterfall
  • 4 Beyond HTTP — WebSockets, databases, and AI service monitoring
  • 5 The Standards — RFC 2544, TWAMP, and traceroute
  • 6 The Philosophy — Active measurement, check intervals, and probe placement
  • 7 Probe Architecture — Edge-first design, P2P testing, and path bandwidth discovery
  • 8 The Intelligence Layer — ML anomaly detection, drift, and alert auto-tuning
  • 9 The Complete Strategy — A production-ready architecture from first principles

We built this series because we believe infrastructure and operations teams deserve technically rigorous education — not vendor brochures. The concepts here apply regardless of what tools you use. The goal was always to make you a sharper engineer, not to sell you software.

If you found this series useful, the next step is exploring how Parlon applies these principles in practice — active synthetic validation, normalized telemetry, and learning-based alerting in a single platform designed for the complexity of modern infrastructure.

See Parlon in Action

Parlon brings everything in this series — active validation, normalized telemetry, ML-powered alerting — into a single platform built for enterprise infrastructure teams.

 

About Parlon
Parlon is an infrastructure observability platform built for enterprise teams operating complex, hybrid environments. Parlon combines active synthetic validation, real-time telemetry normalization, and learning-based alerting into a single platform — shifting operations from firefighting to foresight. Learn more at parlon.io.

More To Explore