|
For the latest stable version, please use Korvet 0.12.5! |
Cold Storage (Delta Lake)
Korvet includes a built-in archival service that continuously archives messages from Redis to Delta Lake for long-term retention.
Overview
The cold storage archival service:
-
Reads messages from Redis (hot or warm tier) using consumer groups
-
Writes them to Delta Lake tables in Parquet format with ZSTD compression
-
Supports S3, HDFS, or local filesystem storage
-
Runs as part of the Korvet server—no additional infrastructure required
When a warm tier is configured, the cold archiver reads from the warm tier instead of the hot tier. This enables a 3-tier pipeline: hot (RAM) → warm (disk-backed Redis) → cold (Delta Lake).
Configuration
Enable cold storage by setting the path property in your application.yml:
korvet:
server:
storage:
path: s3a://my-bucket/korvet/delta
s3:
region: us-west-1
archiver:
consumer-group: korvet-archiver
Cold storage is automatically enabled when path is set. There is no separate enabled property.
|
Configuration Properties
| Property | Default | Description |
|---|---|---|
|
required |
Delta Lake storage path. Supports |
|
required for S3 |
AWS region (e.g., |
|
none |
Custom S3 endpoint (for MinIO or LocalStack) |
|
|
Consumer group name for tracking archival progress |
Tuning Properties
| Property | Default | Description |
|---|---|---|
|
|
Number of parallel readers (set to number of streams/partitions) |
|
|
Number of parallel writers to S3 |
|
|
Messages per Redis XREADGROUP call |
|
|
Batches to accumulate before writing a Parquet file |
|
|
Parquet files per Delta Lake transaction |
|
|
Block duration for XREADGROUP in milliseconds |
|
|
Interval to refresh stream list in milliseconds |
Per-Topic Retention Configuration
Control when data moves from hot/warm tiers to cold storage using topic-level configuration:
| Configuration | Default | Description |
|---|---|---|
|
|
Enable tiered storage for this topic (Kafka KIP-405 standard) |
|
|
Time to keep in hot tier before archiving. |
|
|
Size to keep in hot tier before archiving. |
|
|
Time to keep in warm tier before archiving to cold. |
|
|
Size to keep in warm tier before archiving. |
|
|
Total retention across all tiers. Data deleted after this time (Kafka standard) |
Cold tier retention is implicit: cold_retention = retention.ms - local.retention.ms - near.retention.ms
Example: Keep 1 day hot, 7 days warm, ~357 days cold (1 year total):
kafka-configs --bootstrap-server localhost:9092 \
--entity-type topics --entity-name my-topic --alter \
--add-config remote.storage.enable=true,retention.ms=31536000000,local.retention.ms=86400000,near.retention.ms=604800000
Storage Format
Delta Lake Table Structure
Messages are stored in Delta Lake tables organized by stream:
s3a://my-bucket/korvet/delta/
└── korvet/stream/orders/0/ # Topic "orders", partition 0
├── _delta_log/ # Delta transaction log
│ ├── 00000000000000000000.json
│ └── ...
├── part-xxxxx.zstd.parquet # Data files
└── ...
Parquet Schema
Each Parquet file contains the following columns:
| Column | Type | Description |
|---|---|---|
|
STRING |
Redis stream key (e.g., |
|
STRING |
Redis stream message ID (e.g., |
|
LONG |
Message timestamp in milliseconds |
|
MAP<STRING, BINARY> |
Message fields as key-value pairs |
AWS Authentication
The archival service uses AWS SDK v2 (via Hadoop 3.4+) and supports multiple authentication methods:
IAM Role (Recommended)
On EC2, ECS, or EKS, the service automatically uses instance/task/pod IAM roles via the default credential chain. No additional configuration is required.
Korvet uses IAMInstanceCredentialsProvider from Hadoop’s S3A connector, which leverages AWS SDK v2 for optimal performance and security.
|
Performance
The archival service achieves high throughput when archiving to same-region S3:
| Configuration | Throughput | Notes |
|---|---|---|
Single stream |
~32,000 msg/s |
Baseline performance |
4 streams (parallel) |
~115,000 msg/s |
Near-linear scaling |
See Cold Storage Benchmarks for detailed results.
Performance Tips
-
Same-region S3: Deploy Korvet in the same AWS region as your S3 bucket
-
Multiple streams: Use multiple topic partitions for parallel archival
-
Batch size: Larger batches (10,000+ messages) improve throughput
-
Worker count: Match
read-worker-countandcommit-worker-countto your stream count
Reading Archived Data
Archived data can be read using:
-
Apache Spark: Direct Delta Lake access
-
Databricks: Native Delta Lake support
-
Presto/Trino: Query Delta Lake tables
-
AWS Athena: Serverless SQL queries
Example: Reading with Spark
val df = spark.read.format("delta")
.load("s3a://my-bucket/korvet/delta/korvet/stream/orders/0")
df.select("message_id", "timestamp", "fields")
.filter($"timestamp" > 1708956789000L)
.show()
See Databricks Integration for more examples.