Monitoring, Logging, and Runtime Security
Required knowledge for the CKS certification.
Last reviewed: — verified against Kubernetes 1.36.
Monitoring, logging, and runtime security close the loop on every other layer in this site. By continuously collecting and analysing data from the cluster, operators can detect anomalies, unauthorized access, and active attacks before they escalate. This page is the head reference for Domain 6 of the CKS exam (Monitoring, Logging & Runtime Security, 20%) and covers the four practical pillars: audit logging, log aggregation, metrics, and runtime threat detection.
Why Runtime Security Matters
Static admission catches what is known to be bad at deploy time. Runtime security catches what was unknown then but is observable now: a process that should not exist in this container, a network connection to a country you do not operate in, a syscall pattern that matches a known exploit. The articles in this section cover the controls that produce that visibility:
- API server audit logs that record every authenticated call
- Centralised log aggregation across pods, nodes, and the control plane
- Runtime detection via eBPF or kernel modules
- Metric pipelines that turn anomalies into alerts before users notice
Pair this layer with the attack vector section to validate that the techniques you care about are actually visible in your tooling.
Runtime Domains at a Glance
| Domain | Primary Risk | Key Control | Reference |
|---|---|---|---|
| Audit logging | Cluster activity not traceable to an identity | API server audit policy with structured backend | Kubernetes Audit Logging |
| Runtime detection | Compromised pod operates undetected | Falco, Tetragon, or Tracee at the node level | Falco · Tetragon · Tracee |
| Log aggregation | Logs lost or unsearchable across the fleet | Centralised, immutable log store (Loki, ELK) | See "Log Aggregation" below |
| Metrics and alerting | Resource exhaustion or DoS goes unnoticed | Prometheus + Alertmanager | See "Metrics" below |
| Vulnerability scanning | New CVEs reach deployed images | Trivy Operator for continuous scans | Trivy |
Topics Covered in This Section
Audit Logging
Configure a structured audit policy on the API server, route logs to an immutable backend, and tune verbosity so the signal-to-noise ratio is acceptable. Audit logs are the only authoritative record of who did what to the cluster.
Runtime Threat Detection
Run an eBPF or kernel-module agent on every node that observes process executions, network connections, and file activity inside containers. Use it to detect post-exploitation behaviour that admission policy cannot see.
Log Aggregation
Ship pod, node, and control-plane logs to a centralised, immutable store. Apply retention and rotation policies that match your compliance posture.
Metrics and Alerting
Use Prometheus for cluster and application metrics, Grafana for visualisation, and Alertmanager to page on the security-relevant signals (failed authn, audit-log gaps, runtime alerts, certificate expiry).
Key Articles
Runtime Detection: Falco vs Tetragon vs Tracee
Three actively maintained projects cover runtime detection in Kubernetes. The head-term page compares Falco vs Tetragon at a higher level; the table below extends that with Tracee for completeness.
| Aspect | Falco | Tetragon | Tracee |
|---|---|---|---|
| CNCF status | Graduated | Incubating | Sandbox |
| Maintainer | Falco / Sysdig | Isovalent (Cilium) | Aqua Security |
| Sensor | eBPF or kernel module | eBPF (in-kernel hooks) | eBPF |
| Primary purpose | Runtime threat alerting | Observability + in-kernel enforcement | Runtime detection + forensics |
| Rule format | YAML rules with expression DSL | TracingPolicy CRDs | Signatures (Rego / Go) |
| Enforcement | Alerts only | Can kill processes / send signals in-kernel | Alerts only |
| Best fit | SOC integration, mature ruleset | Cilium environments, in-kernel response | Forensic captures, signature library |
Read more: Falco · Tetragon · Tracee
Log Aggregation: Loki vs Elastic vs OpenSearch
Centralised log storage is non-negotiable for incident response. The three practical choices in the Kubernetes ecosystem are Grafana Loki, the Elastic Stack, and OpenSearch.
| Aspect | Grafana Loki | Elastic Stack | OpenSearch |
|---|---|---|---|
| Indexing model | Labels only; payload is grep-style | Full-text index of every field | Full-text index of every field |
| Storage cost | Lowest (label index + object storage) | Highest (full inverted index) | High (full inverted index) |
| Query language | LogQL (PromQL-style) | KQL / Elasticsearch DSL | KQL / OpenSearch DSL |
| Ecosystem fit | Native Grafana / Prometheus integration | Kibana, deep APM tooling | Kibana fork, AWS-native |
| License | AGPLv3 | Elastic License v2 (source-available) | Apache 2.0 |
| Best fit | Cost-sensitive, label-driven queries | Rich full-text search, large query workloads | AWS-managed deployments wanting Apache licensing |
For audit logs specifically, prefer write-once / immutable storage on top of any of the three (e.g., S3 object lock for the underlying bucket).
Metrics and Alerting: Prometheus + Alertmanager
The default open-source stack on Kubernetes:
- Prometheus scrapes cluster and application metrics at a configurable interval.
- Alertmanager routes alerts to PagerDuty / Slack / email and handles deduplication and silencing.
- Grafana visualises Prometheus data; pair with Loki for unified logs + metrics.
Security-relevant alerts to wire up on day one:
- API server audit-log gap (no events for N seconds)
- Spike in
apiserver_request_total{code=~"4.."}— failed authn or authz - Falco / Tetragon
WARNand above - Certificate expiry within 14 days (cluster CA, etcd, kubelet)
- Pod restart loops in
kube-systemor other security-critical namespaces
Try It: Live YAML Security Analyzer
A workload that lacks liveness/readiness probes, resource limits, or correctly scoped ServiceAccount access shows up as runtime noise the moment it ships. Paste a manifest below to catch those issues before deploy.
Version-Specific Notes (Kubernetes 1.36)
The runtime and observability surface has tightened in recent Kubernetes versions:
- Structured authentication and authorization configuration — GA in 1.30+. The new
AuthenticationConfigurationandAuthorizationConfigurationfiles produce stable, easily-audited identity configuration for the audit log to reference. - KMS v2 encryption providers — GA since 1.29. Audit logs that record Secret access reflect KMS v2 events (key versioning, rotation) as first-class fields.
- Validating Admission Policy — GA since 1.30. CEL policies emit consistent admission decisions to the audit log without requiring a webhook to be reachable.
- Sidecar containers — GA since 1.33. Init containers with
restartPolicy: Alwaysare the supported pattern for log shippers and runtime agents that must outlive their target containers. - Pod sandboxing via
RuntimeClass— Stable. A namespace can require a sandboxed runtime; runtime detection tooling should be aware of which workloads are sandboxed and which share the host kernel.
Always check the Kubernetes deprecation guide before upgrading.
Hardening Principles for Runtime Operations
Secure by Default
Turn on audit logging on the first day of a cluster. Default-deny rules for runtime detection are easier to relax later than to introduce after an incident.
Least Privilege
The runtime agent itself is a privileged workload. Scope its RBAC tightly, run it on a dedicated ServiceAccount, and forward its findings to a tenant outside the cluster it is monitoring.
Defense in Depth
Pair audit logging (cluster-layer) with runtime detection (node-layer) and metric alerting (cluster + workload). A bypass at one layer should still produce a signal in another.
Continuous Verification
Treat runtime alerts as actionable, not informational. Re-tune detection rules after every incident; treat sustained "WARN" noise as a bug to fix, not a baseline to live with.
Conclusion
Runtime security is the layer where every other control in this site is verified or invalidated. Stack audit logging, runtime detection, centralised logs, and metrics so that nothing happens in the cluster without an authoritative record. Combine the practices linked here with the attack vectors and cluster hardening sections to design end-to-end coverage.