Skip to main content
6 min read·1,150 words

Monitoring, Logging, and Runtime Security

Required knowledge for the CKS certification.

Last reviewed: — verified against Kubernetes 1.36.

Monitoring, logging, and runtime security close the loop on every other layer in this site. By continuously collecting and analysing data from the cluster, operators can detect anomalies, unauthorized access, and active attacks before they escalate. This page is the head reference for Domain 6 of the CKS exam (Monitoring, Logging & Runtime Security, 20%) and covers the four practical pillars: audit logging, log aggregation, metrics, and runtime threat detection.


Why Runtime Security Matters

Static admission catches what is known to be bad at deploy time. Runtime security catches what was unknown then but is observable now: a process that should not exist in this container, a network connection to a country you do not operate in, a syscall pattern that matches a known exploit. The articles in this section cover the controls that produce that visibility:

  • API server audit logs that record every authenticated call
  • Centralised log aggregation across pods, nodes, and the control plane
  • Runtime detection via eBPF or kernel modules
  • Metric pipelines that turn anomalies into alerts before users notice

Pair this layer with the attack vector section to validate that the techniques you care about are actually visible in your tooling.


Runtime Domains at a Glance

DomainPrimary RiskKey ControlReference
Audit loggingCluster activity not traceable to an identityAPI server audit policy with structured backendKubernetes Audit Logging
Runtime detectionCompromised pod operates undetectedFalco, Tetragon, or Tracee at the node levelFalco · Tetragon · Tracee
Log aggregationLogs lost or unsearchable across the fleetCentralised, immutable log store (Loki, ELK)See "Log Aggregation" below
Metrics and alertingResource exhaustion or DoS goes unnoticedPrometheus + AlertmanagerSee "Metrics" below
Vulnerability scanningNew CVEs reach deployed imagesTrivy Operator for continuous scansTrivy

Topics Covered in This Section

Audit Logging

Configure a structured audit policy on the API server, route logs to an immutable backend, and tune verbosity so the signal-to-noise ratio is acceptable. Audit logs are the only authoritative record of who did what to the cluster.

Runtime Threat Detection

Run an eBPF or kernel-module agent on every node that observes process executions, network connections, and file activity inside containers. Use it to detect post-exploitation behaviour that admission policy cannot see.

Log Aggregation

Ship pod, node, and control-plane logs to a centralised, immutable store. Apply retention and rotation policies that match your compliance posture.

Metrics and Alerting

Use Prometheus for cluster and application metrics, Grafana for visualisation, and Alertmanager to page on the security-relevant signals (failed authn, audit-log gaps, runtime alerts, certificate expiry).

Key Articles


Runtime Detection: Falco vs Tetragon vs Tracee

Three actively maintained projects cover runtime detection in Kubernetes. The head-term page compares Falco vs Tetragon at a higher level; the table below extends that with Tracee for completeness.

AspectFalcoTetragonTracee
CNCF statusGraduatedIncubatingSandbox
MaintainerFalco / SysdigIsovalent (Cilium)Aqua Security
SensoreBPF or kernel moduleeBPF (in-kernel hooks)eBPF
Primary purposeRuntime threat alertingObservability + in-kernel enforcementRuntime detection + forensics
Rule formatYAML rules with expression DSLTracingPolicy CRDsSignatures (Rego / Go)
EnforcementAlerts onlyCan kill processes / send signals in-kernelAlerts only
Best fitSOC integration, mature rulesetCilium environments, in-kernel responseForensic captures, signature library

Read more: Falco · Tetragon · Tracee


Log Aggregation: Loki vs Elastic vs OpenSearch

Centralised log storage is non-negotiable for incident response. The three practical choices in the Kubernetes ecosystem are Grafana Loki, the Elastic Stack, and OpenSearch.

AspectGrafana LokiElastic StackOpenSearch
Indexing modelLabels only; payload is grep-styleFull-text index of every fieldFull-text index of every field
Storage costLowest (label index + object storage)Highest (full inverted index)High (full inverted index)
Query languageLogQL (PromQL-style)KQL / Elasticsearch DSLKQL / OpenSearch DSL
Ecosystem fitNative Grafana / Prometheus integrationKibana, deep APM toolingKibana fork, AWS-native
LicenseAGPLv3Elastic License v2 (source-available)Apache 2.0
Best fitCost-sensitive, label-driven queriesRich full-text search, large query workloadsAWS-managed deployments wanting Apache licensing

For audit logs specifically, prefer write-once / immutable storage on top of any of the three (e.g., S3 object lock for the underlying bucket).


Metrics and Alerting: Prometheus + Alertmanager

The default open-source stack on Kubernetes:

  • Prometheus scrapes cluster and application metrics at a configurable interval.
  • Alertmanager routes alerts to PagerDuty / Slack / email and handles deduplication and silencing.
  • Grafana visualises Prometheus data; pair with Loki for unified logs + metrics.

Security-relevant alerts to wire up on day one:

  • API server audit-log gap (no events for N seconds)
  • Spike in apiserver_request_total{code=~"4.."} — failed authn or authz
  • Falco / Tetragon WARN and above
  • Certificate expiry within 14 days (cluster CA, etcd, kubelet)
  • Pod restart loops in kube-system or other security-critical namespaces

Try It: Live YAML Security Analyzer

A workload that lacks liveness/readiness probes, resource limits, or correctly scoped ServiceAccount access shows up as runtime noise the moment it ships. Paste a manifest below to catch those issues before deploy.


Version-Specific Notes (Kubernetes 1.36)

The runtime and observability surface has tightened in recent Kubernetes versions:

  • Structured authentication and authorization configuration — GA in 1.30+. The new AuthenticationConfiguration and AuthorizationConfiguration files produce stable, easily-audited identity configuration for the audit log to reference.
  • KMS v2 encryption providers — GA since 1.29. Audit logs that record Secret access reflect KMS v2 events (key versioning, rotation) as first-class fields.
  • Validating Admission Policy — GA since 1.30. CEL policies emit consistent admission decisions to the audit log without requiring a webhook to be reachable.
  • Sidecar containers — GA since 1.33. Init containers with restartPolicy: Always are the supported pattern for log shippers and runtime agents that must outlive their target containers.
  • Pod sandboxing via RuntimeClass — Stable. A namespace can require a sandboxed runtime; runtime detection tooling should be aware of which workloads are sandboxed and which share the host kernel.

Always check the Kubernetes deprecation guide before upgrading.


Hardening Principles for Runtime Operations

Secure by Default

Turn on audit logging on the first day of a cluster. Default-deny rules for runtime detection are easier to relax later than to introduce after an incident.

Least Privilege

The runtime agent itself is a privileged workload. Scope its RBAC tightly, run it on a dedicated ServiceAccount, and forward its findings to a tenant outside the cluster it is monitoring.

Defense in Depth

Pair audit logging (cluster-layer) with runtime detection (node-layer) and metric alerting (cluster + workload). A bypass at one layer should still produce a signal in another.

Continuous Verification

Treat runtime alerts as actionable, not informational. Re-tune detection rules after every incident; treat sustained "WARN" noise as a bug to fix, not a baseline to live with.


Conclusion

Runtime security is the layer where every other control in this site is verified or invalidated. Stack audit logging, runtime detection, centralised logs, and metrics so that nothing happens in the cluster without an authoritative record. Combine the practices linked here with the attack vectors and cluster hardening sections to design end-to-end coverage.