Return to site

🚀🧊 KAFKA “DISKLESS” TOPICS— WHAT IT IS, WHEN TO USE IT, AND WHAT TO WATCH

· kafka,devops
Section image

TL;DR

Running Kafka with little-to-no dependence on local disks means decoupling compute from storage: brokers act as stateless/ephemeral compute while data lives in remote/object storage (and/or a minimal local cache). You get faster recovery, elastic scaling, and simpler ops—but you must budget for network, tail latency, and costs.

🔸 WHAT “DISKLESS” MEANS

▪️ Brokers don’t rely on large local SSDs for the full retention window

▪️ Data durability shifts to remote storage (e.g., object storage) with brokers using cache + fetch on demand

▪️ Brokers become replaceable (pets → cattle), improving operability and autoscaling

🔸 WHY TEAMS CONSIDER IT

▪️ Faster broker recovery: less time rebuilding large local logs

▪️ Elasticity: scale compute up/down without shuffling terabytes across nodes

▪️ Infra flexibility: run comfortably on Kubernetes/spot instances

▪️ Ops simplicity: fewer disk-related incidents (failures, rebalancing pain)

🔸 HOW IT CHANGES YOUR ARCHITECTURE

▪️ Compute–storage decoupling: Kafka brokers focus on serving traffic; storage layer handles durability

▪️ Caching strategy: hot partitions benefit from local cache; cold data comes from remote storage

▪️ Network-first thinking: throughput, latency, and SLOs depend more on your network and remote store

🔸 TRADE-OFFS & GOTCHAS

▪️ Latency: cold reads can be slower; watch p99/p999 tail latency

▪️ Network ceiling: broker NICs and egress limits become your new bottleneck

▪️ Costs: object storage + egress + more network can offset SSD savings

▪️ Operational guardrails: set clear retention tiers, cache sizes, and backpressure limits

🔸 WHEN IT SHINES

▪️ Bursty & spiky workloads needing rapid scale-out/in

▪️ Multi-AZ / Multi-region designs where storage durability is centralized

▪️ Data lakes & analytics where long retention lives in object storage anyway

▪️ Kubernetes-first platforms seeking stateless brokers

🔸 WHEN TO BE CAUTIOUS

▪️ Ultra-low latency pipelines with strict p99 SLOs

▪️ Heavy cross-AZ or cross-region traffic (egress bills + latency)

▪️ Clusters with limited network headroom or noisy neighbors

🔸 CHECKLIST TO GET STARTED

▪️ Define RPO/RTO objectives and SLOs (p95/p99 targets)

▪️ Right-size broker cache and socket buffers; validate read-ahead behavior

▪️ Load-test hot vs. cold reads and observe cache hit ratio

▪️ Instrument remote fetch latency, throughput, egress, and costs

▪️ Simulate broker kills to verify recovery and autoscaling workflows

TAKEAWAYS

▪️ Kafka “diskless” = stateless brokers + remote/object storage for durability.

▪️ You trade disk complexity for network & storage complexity—measure, don’t guess.

▪️ Best for elastic, cloud-native platforms; be mindful of tail latency and egress costs.

▪️ Success = right caching strategy, strong observability, and SLO-driven tuning.

#kafka #diskless #streaming #CloudNative

See: https://bit.ly/d1skless