|
For the latest stable version, please use Korvet 0.12.5! |
Cold Storage (Delta Lake)
Korvet includes a built-in archival service that continuously archives Redis stream data to Delta Lake for long-term retention.
Overview
The cold storage archival service:
-
Reads from Redis streams using consumer groups
-
Writes them to Delta Lake tables in Parquet format with ZSTD compression
-
Supports S3, HDFS, or local filesystem storage
-
Runs as part of the Korvet server with no separate archiver process
When Redis Flex is in use, Redis itself handles hot/warm placement in RAM and Flash. Korvet’s archival service moves archived stream data from Redis to the Delta Lake remote tier.
Configuration
Enable cold storage by setting korvet.storage.remote.path in your application.yml:
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-west-1
index:
bootstrap-enabled: true
bootstrap-timeout: 3s
archiver:
interval: 5m
batch-size: 10000
workers: 1
retention:
enabled: true
interval: 1h
Remote storage is enabled automatically when korvet.storage.remote.path is set. Topics are archived only when they also have remote.storage.enable=true.
|
Configuration Properties
| Property | Default | Description |
|---|---|---|
|
required |
Delta Lake storage path. Supports |
|
required for S3 |
AWS region (for example, |
|
none |
Custom S3 endpoint for MinIO or LocalStack |
|
|
Credential strategy: |
|
|
Bootstrap cold indexes during startup |
|
|
Maximum time a foreground request waits for cold index bootstrap |
|
|
Scanned files between cold index rebuild progress log messages |
|
|
Maximum number of cold-index segment entries examined per Redis Search lookup |
Tuning Properties
| Property | Default | Description |
|---|---|---|
|
|
How often idle archiver workers re-check Redis streams for eligible data |
|
|
Maximum records written per archive write |
|
|
Minimum undelivered archival-group backlog before workers start buffering full batches |
|
|
Archive partial batches when the archival cursor age reaches this threshold |
|
|
Parquet files written per Delta commit |
|
|
Number of concurrent archiver workers |
|
not set |
How often cold-storage retention cleanup runs |
Per-Topic Retention Configuration
Control when data moves from local Redis storage to Delta Lake using topic-level configuration:
| Configuration | Default | Description |
|---|---|---|
|
|
Enable tiered storage for this topic (Kafka KIP-405 standard) |
|
|
Time to keep in the local tier before Redis data expires. |
|
|
Size to keep in the local tier before Redis trimming falls back to |
|
|
Total retention across all tiers. Data is deleted after this time (Kafka standard) |
Remote-tier retention is implicit: retention.ms - local.retention.ms.
Example: Keep 1 day local and the rest remote (1 year total):
kafka-configs --bootstrap-server localhost:9092 \
--entity-type topics --entity-name my-topic --alter \
--add-config remote.storage.enable=true,retention.ms=31536000000,local.retention.ms=86400000
Storage Format
Delta Lake Table Structure
Messages are stored in Delta Lake tables organized by stream:
s3a://my-bucket/korvet/delta/
└── korvet/stream/orders/0/ # Topic "orders", partition 0
├── _delta_log/ # Delta transaction log
│ ├── 00000000000000000000.json
│ └── ...
├── part-xxxxx.zstd.parquet # Data files
└── ...
Parquet Schema
Each Parquet file contains the following columns:
| Column | Type | Description |
|---|---|---|
|
STRING |
Redis stream key (for example, |
|
LONG |
Message timestamp extracted from the Redis message ID |
|
LONG |
Message sequence extracted from the Redis message ID |
|
MAP<STRING, BINARY> |
Message fields as key-value pairs |
AWS Authentication
The archival service uses Hadoop S3A configuration and supports multiple authentication methods.
Credential Types
| Type | Description |
|---|---|
|
Auto-detect credentials. Tries configured static credentials first, then environment or platform-provided credentials. |
|
Use instance metadata credentials, typically for EC2, ECS, or node-level EKS IAM roles. |
|
Use EKS IAM Roles for Service Accounts via |
|
Use explicit access key and secret. Useful for development, MinIO, or LocalStack. |
IRSA (EKS IAM Roles for Service Accounts)
For EKS deployments, IRSA provides pod-level IAM permissions without sharing node-level credentials.
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-east-1
credentials:
type: irsa
When IRSA is configured on your EKS cluster, the following environment variables are injected into pods:
-
AWS_ROLE_ARN- The IAM role ARN to assume -
AWS_WEB_IDENTITY_TOKEN_FILE- Path to the OIDC token file
IAM Role (EC2/ECS/EKS Nodes)
On EC2, ECS, or EKS with node-level IAM roles, the service can use instance or task credentials via IMDS:
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-west-1
credentials:
type: iam
Troubleshooting S3 Access
Korvet provides a diagnostic endpoint to troubleshoot S3 configuration and connectivity issues.
S3 Diagnostics Endpoint
Access the endpoint at /actuator/s3diagnostics:
curl http://localhost:8080/actuator/s3diagnostics | jq .
Common Issues
- Region not configured
-
Ensure
korvet.storage.remote.s3.regionis set, or setAWS_REGIONin the environment. - IRSA not detected
-
Verify that
AWS_ROLE_ARNandAWS_WEB_IDENTITY_TOKEN_FILEare present and the token file exists. - Permission denied (403)
-
The IAM role or user lacks required S3 permissions. Typical permissions are:
-
s3:ListBucketonarn:aws:s3:::bucket-name -
s3:GetObjectonarn:aws:s3:::bucket-name/* -
s3:PutObjectonarn:aws:s3:::bucket-name/* -
s3:DeleteObjectonarn:aws:s3:::bucket-name/*
-
- Identity shows wrong role
-
On EKS, ensure the ServiceAccount is annotated with
eks.amazonaws.com/role-arnand the pod is using that ServiceAccount.
Performance
The archival service achieves high throughput when archiving to same-region S3:
| Configuration | Throughput | Notes |
|---|---|---|
Single stream |
~32,000 msg/s |
Baseline performance |
4 streams (parallel) |
~115,000 msg/s |
Near-linear scaling |
See Cold Storage Benchmarks for detailed results.
Performance Tips
-
Same-region S3: Deploy Korvet in the same AWS region as your S3 bucket
-
Multiple streams: Use multiple topic partitions for parallel archival
-
Batch size: Larger archive batches can improve throughput
-
Worker count: Increase
korvet.storage.remote.archiver.workersto match your stream count
Reading Archived Data
Archived data can be read using:
-
Apache Spark: Direct Delta Lake access
-
Databricks: Native Delta Lake support
-
Presto/Trino: Query Delta Lake tables
-
AWS Athena: Serverless SQL queries
Example: Reading with Spark
val df = spark.read.format("delta")
.load("s3a://my-bucket/korvet/delta/korvet/stream/orders/0")
df.select("stream", "ts", "seq", "body")
.filter($"ts" > 1708956789000L)
.show()
See Databricks Integration for more examples.