|
For the latest stable version, please use Korvet 0.12.5! |
Cold Storage (Delta Lake)
Korvet includes a built-in archival service that continuously archives sealed Redis buckets to Delta Lake for long-term retention.
Overview
The cold storage archival service:
-
Reads sealed buckets from Redis using consumer groups
-
Writes them to Delta Lake tables in Parquet format with ZSTD compression
-
Supports S3, HDFS, or local filesystem storage
-
Runs as part of the Korvet server with no separate archiver process
When Redis Flex is in use, Redis itself handles hot/warm placement in RAM and Flash. Korvet’s archival service moves sealed buckets from Redis to the Delta Lake remote tier.
Configuration
Enable cold storage by setting korvet.storage.remote.path in your application.yml:
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-west-1
index:
bootstrap-enabled: true
bootstrap-timeout: 3s
archiver:
interval: 5m
batch-size: 10000
workers: 1
retention:
enabled: true
interval: 1h
Remote storage is enabled automatically when korvet.storage.remote.path is set. Topics are archived only when they also have remote.storage.enable=true.
|
Configuration Properties
| Property | Default | Description |
|---|---|---|
|
required |
Delta Lake storage path. Supports |
|
required for S3 |
AWS region (for example, |
|
none |
Custom S3 endpoint for MinIO or LocalStack |
|
|
Credential strategy: |
|
|
Bootstrap cold indexes during startup |
|
|
Maximum time a foreground request waits for cold index bootstrap |
|
|
Scanned files between cold index rebuild progress log messages |
Tuning Properties
| Property | Default | Description |
|---|---|---|
|
|
How often the archiver checks for sealed buckets to archive when idle |
|
|
Maximum records written per archive write |
|
|
Parquet files written per Delta commit |
|
|
Number of concurrent archiver workers |
|
|
Enable cold-storage retention cleanup |
|
|
How often cold-storage retention cleanup runs |
Per-Topic Retention Configuration
Control when data moves from local Redis storage to Delta Lake using topic-level configuration:
| Configuration | Default | Description |
|---|---|---|
|
|
Enable tiered storage for this topic (Kafka KIP-405 standard) |
|
|
Time to keep in the local tier before Redis data expires. |
|
|
Size to keep in the local tier before Redis trimming falls back to |
|
|
Total retention across all tiers. Data is deleted after this time (Kafka standard) |
Remote-tier retention is implicit: retention.ms - local.retention.ms.
Example: Keep 1 day local and the rest remote (1 year total):
kafka-configs --bootstrap-server localhost:9092 \
--entity-type topics --entity-name my-topic --alter \
--add-config remote.storage.enable=true,retention.ms=31536000000,local.retention.ms=86400000
Storage Format
Delta Lake Table Structure
Messages are stored in Delta Lake tables organized by stream:
s3a://my-bucket/korvet/delta/
└── korvet/stream/orders/0/ # Topic "orders", partition 0
├── _delta_log/ # Delta transaction log
│ ├── 00000000000000000000.json
│ └── ...
├── part-xxxxx.zstd.parquet # Data files
└── ...
Parquet Schema
Each Parquet file contains the following columns:
| Column | Type | Description |
|---|---|---|
|
STRING |
Redis stream key (for example, |
|
LONG |
Message timestamp extracted from the Redis message ID |
|
LONG |
Message sequence extracted from the Redis message ID |
|
MAP<STRING, BINARY> |
Message fields as key-value pairs |
AWS Authentication
The archival service uses Hadoop S3A configuration and supports multiple authentication methods.
Credential Types
| Type | Description |
|---|---|
|
Auto-detect credentials. Tries configured static credentials first, then environment or platform-provided credentials. |
|
Use instance metadata credentials, typically for EC2, ECS, or node-level EKS IAM roles. |
|
Use EKS IAM Roles for Service Accounts via |
|
Use explicit access key and secret. Useful for development, MinIO, or LocalStack. |
IRSA (EKS IAM Roles for Service Accounts)
For EKS deployments, IRSA provides pod-level IAM permissions without sharing node-level credentials.
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-east-1
credentials:
type: irsa
When IRSA is configured on your EKS cluster, the following environment variables are injected into pods:
-
AWS_ROLE_ARN- The IAM role ARN to assume -
AWS_WEB_IDENTITY_TOKEN_FILE- Path to the OIDC token file
IAM Role (EC2/ECS/EKS Nodes)
On EC2, ECS, or EKS with node-level IAM roles, the service can use instance or task credentials via IMDS:
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-west-1
credentials:
type: iam
Troubleshooting S3 Access
Korvet provides a diagnostic endpoint to troubleshoot S3 configuration and connectivity issues.
S3 Diagnostics Endpoint
Access the endpoint at /actuator/s3diagnostics:
curl http://localhost:8080/actuator/s3diagnostics | jq .
Common Issues
- Region not configured
-
Ensure
korvet.storage.remote.s3.regionis set, or setAWS_REGIONin the environment. - IRSA not detected
-
Verify that
AWS_ROLE_ARNandAWS_WEB_IDENTITY_TOKEN_FILEare present and the token file exists. - Permission denied (403)
-
The IAM role or user lacks required S3 permissions. Typical permissions are:
-
s3:ListBucketonarn:aws:s3:::bucket-name -
s3:GetObjectonarn:aws:s3:::bucket-name/* -
s3:PutObjectonarn:aws:s3:::bucket-name/* -
s3:DeleteObjectonarn:aws:s3:::bucket-name/*
-
- Identity shows wrong role
-
On EKS, ensure the ServiceAccount is annotated with
eks.amazonaws.com/role-arnand the pod is using that ServiceAccount.
Performance
The archival service achieves high throughput when archiving to same-region S3:
| Configuration | Throughput | Notes |
|---|---|---|
Single stream |
~32,000 msg/s |
Baseline performance |
4 streams (parallel) |
~115,000 msg/s |
Near-linear scaling |
See Cold Storage Benchmarks for detailed results.
Performance Tips
-
Same-region S3: Deploy Korvet in the same AWS region as your S3 bucket
-
Multiple streams: Use multiple topic partitions for parallel archival
-
Batch size: Larger archive batches can improve throughput
-
Worker count: Increase
korvet.storage.remote.archiver.workersto match your stream count
Reading Archived Data
Archived data can be read using:
-
Apache Spark: Direct Delta Lake access
-
Databricks: Native Delta Lake support
-
Presto/Trino: Query Delta Lake tables
-
AWS Athena: Serverless SQL queries
Example: Reading with Spark
val df = spark.read.format("delta")
.load("s3a://my-bucket/korvet/delta/korvet/stream/orders/0")
df.select("stream", "ts", "seq", "body")
.filter($"ts" > 1708956789000L)
.show()
See Databricks Integration for more examples.