Twenty-four videos. Nine posts. From the first ping command written in 1983 to ML-enhanced anomaly detection running across six continents. This final post assembles everything into a single, production-ready monitoring architecture.
We’re going to design monitoring for a real SaaS company from scratch — making every decision explicit, showing the math, and building a system that would hold up against a 99.95% SLA in the real world.
Start Here: What the Video Covers
▶ Video 24 — “Putting It All Together — A Synthetic Monitoring Strategy”
The Scenario: CloudApp Inc.
50,000 users across North America, Europe, and Asia-Pacific. Microservices on Kubernetes. A 99.95% SLA — which allows no more than 4.38 hours of downtime per year, or roughly 21.9 minutes per month, or 43.8 seconds per day.
That last number is the constraint everything else is designed around. If your maximum allowed daily downtime is 43 seconds, you need to detect problems in under 30 seconds. That number drives your check interval, which drives your probe count, which drives everything else.
Six probe locations — two per geographic region where users are concentrated — with ISP diversity across each pair.
| Region | Location 1 | Location 2 |
|---|---|---|
| North America | US-East (New York) | US-West (Los Angeles) |
| Europe | London | Frankfurt |
| Asia-Pacific | Tokyo | Sydney |
ISP diversity rule: at least two different providers across the six locations. A regional routing problem that affects a single ISP's backbone will be detected by the second probe in that region while the first shows an outage.
| Monitor Type | What It Covers |
|---|---|
| HTTPS checks | Web application, API endpoints, CDN edge nodes — with full phase decomposition |
| TCP checks | Database ports (5432, 3306, 6379) and internal services that don't speak HTTP |
| DNS checks | Resolver health from each probe location — catching geo-steering failures and regional DNS issues |
| SSL checks | Certificate expiry tracking across all public-facing domains; alerts at 30 and 7 days |
| P2P tests | Between all six probe pairs — measuring WAN and cloud interconnect performance directly |
| 99.95% Uptime Allowance | |
|---|---|
| Annual downtime allowance | 4.38 hours |
| Monthly downtime allowance | 21.9 minutes |
| Daily downtime allowance | 4.38 minutes |
| Required detection time (Nyquist) | ≤ 30 seconds |
| Required check interval | 30 seconds for all SLA-bound endpoints |
For endpoints not directly tied to the SLA — internal services, staging, non-critical paths — 60-second intervals are appropriate. Certificate checks can run hourly. P2P throughput tests can run every 5 minutes without meaningful load impact.
CloudApp has predictable daily traffic patterns — peak around noon EST, quiet at 3 AM. Static thresholds would need constant adjustment as user growth changes the noon peak month over month. ML anomaly detection learns the pattern and adapts automatically.
| Detection Type | What It Catches | Configuration |
|---|---|---|
| ML Anomaly Detection | Sudden spikes and unexpected values at any hour | Medium sensitivity; high confidence required to page |
| Change-Point Detection | Permanent routing shifts after ISP changes or deployments | Applied to latency and packet loss metrics across all probes |
| Drift Tracking | Gradual degradation of DNS, latency, or TLS performance over weeks | Alert threshold at 15% drift from learned baseline |
Alerts that go unacknowledged must escalate automatically. Waiting for someone to notice is not a process.
ML detection fires. Notification to the #infrastructure-alerts channel with context: which probe, which check, which phase, anomaly confidence score.
No acknowledgment in 5 minutes. On-call engineer receives a page with the same context. Silence is treated as unavailability.
No response to page in 10 minutes. Phone call to the engineering manager. At this point, the incident is escalated regardless of business hours.
The Complete Architecture
- 1 What Is Synthetic Monitoring — The core concept and 40 years of history
- 2 ICMP, DNS, and TCP — The three foundational protocols
- 3 TLS and HTTP — Certificate monitoring and the five-phase request waterfall
- 4 Beyond HTTP — WebSockets, databases, and AI service monitoring
- 5 The Standards — RFC 2544, TWAMP, and traceroute
- 6 The Philosophy — Active measurement, check intervals, and probe placement
- 7 Probe Architecture — Edge-first design, P2P testing, and path bandwidth discovery
- 8 The Intelligence Layer — ML anomaly detection, drift, and alert auto-tuning
- 9 The Complete Strategy — A production-ready architecture from first principles
We built this series because we believe infrastructure and operations teams deserve technically rigorous education — not vendor brochures. The concepts here apply regardless of what tools you use. The goal was always to make you a sharper engineer, not to sell you software.
If you found this series useful, the next step is exploring how Parlon applies these principles in practice — active synthetic validation, normalized telemetry, and learning-based alerting in a single platform designed for the complexity of modern infrastructure.
See Parlon in Action
Parlon brings everything in this series — active validation, normalized telemetry, ML-powered alerting — into a single platform built for enterprise infrastructure teams.
About Parlon
Parlon is an infrastructure observability platform built for enterprise teams operating complex, hybrid environments. Parlon combines active synthetic validation, real-time telemetry normalization, and learning-based alerting into a single platform — shifting operations from firefighting to foresight. Learn more at parlon.io.