|
This version is still in development and is not considered stable yet. For the latest stable version, please use Korvet 0.12.5! |
Cold Storage (Delta Lake)
Korvet includes a built-in archival service that continuously archives Redis stream data to Delta Lake for long-term retention.
Overview
The cold storage archival service:
-
Reads from Redis streams using consumer groups
-
Writes them to Delta Lake tables in Parquet format with ZSTD compression
-
Supports S3, HDFS, or local filesystem storage
-
Runs as part of the Korvet server with no separate archiver process
When Redis Flex is in use, Redis itself handles hot/warm placement in RAM and Flash. Korvet’s archival service moves archived stream data from Redis to the Delta Lake remote tier.
Configuration
Enable cold storage by setting korvet.storage.remote.path in your application.yml:
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-west-1
archiver:
scan-interval: 5m
batch-size: 10000
max-unarchived-age: 5m
max-concurrent-streams: 1
Remote storage is enabled automatically when korvet.storage.remote.path is set. Topics are archived only when they also have remote.storage.enable=true.
|
Configuration Properties
| Property | Default | Description |
|---|---|---|
|
required |
Cold storage path. Supports |
|
required for S3 |
AWS region (for example, |
|
none |
Custom S3 endpoint for MinIO or LocalStack |
|
|
Credential strategy: |
Tuning Properties
| Property | Default | Description |
|---|---|---|
|
|
How often the archiver planner re-checks Redis streams for eligible data |
|
|
Maximum records written per archive write |
|
|
Archive a stream once its oldest unarchived message reaches this threshold |
|
|
Maximum records written into a single Parquet file before rolling over |
|
|
Target payload budget for each Parquet file so default archival writes favor healthier file sizes |
|
|
Number of concurrent stream archival attempts |
Per-Topic Retention Configuration
Control when data moves from local Redis storage to Delta Lake using topic-level configuration:
| Configuration | Default | Description |
|---|---|---|
|
|
Enable tiered storage for this topic (Kafka KIP-405 standard) |
|
|
Time to keep in the local tier before Redis data expires. |
|
|
Size to keep in the local tier before Redis trimming falls back to |
|
|
Total retention across all tiers. Data is deleted after this time (Kafka standard) |
Remote-tier retention is implicit: retention.ms - local.retention.ms.
Example: Keep 1 day local and the rest remote (1 year total):
kafka-configs --bootstrap-server localhost:9092 \
--entity-type topics --entity-name my-topic --alter \
--add-config remote.storage.enable=true,retention.ms=31536000000,local.retention.ms=86400000
Storage Format
Delta Lake Table Structure
Messages are stored in Delta Lake tables organized by stream:
s3a://my-bucket/korvet/delta/
└── korvet/stream/orders/0/ # Topic "orders", partition 0
├── _delta_log/ # Delta transaction log
│ ├── 00000000000000000000.json
│ └── ...
├── part-xxxxx.zstd.parquet # Data files
└── ...
Parquet Schema
Each Parquet file contains the following columns:
| Column | Type | Description |
|---|---|---|
|
STRING |
Redis stream key (for example, |
|
LONG |
Message timestamp extracted from the Redis message ID |
|
LONG |
Message sequence extracted from the Redis message ID |
|
MAP<STRING, BINARY> |
Message fields as key-value pairs |
AWS Authentication
The archival service uses Hadoop S3A configuration and supports multiple authentication methods.
Credential Types
| Type | Description |
|---|---|
|
Auto-detect credentials. Tries configured static credentials first, then environment or platform-provided credentials. |
|
Use instance metadata credentials, typically for EC2, ECS, or node-level EKS IAM roles. |
|
Use EKS IAM Roles for Service Accounts via |
|
Use explicit access key and secret. Useful for development, MinIO, or LocalStack. |
IRSA (EKS IAM Roles for Service Accounts)
For EKS deployments, IRSA provides pod-level IAM permissions without sharing node-level credentials.
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-east-1
credentials:
type: irsa
When IRSA is configured on your EKS cluster, the following environment variables are injected into pods:
-
AWS_ROLE_ARN- The IAM role ARN to assume -
AWS_WEB_IDENTITY_TOKEN_FILE- Path to the OIDC token file
IAM Role (EC2/ECS/EKS Nodes)
On EC2, ECS, or EKS with node-level IAM roles, the service can use instance or task credentials via IMDS:
korvet:
storage:
remote:
path: s3a://my-bucket/korvet/delta
s3:
region: us-west-1
credentials:
type: iam
Troubleshooting S3 Access
Korvet provides a diagnostic endpoint to troubleshoot S3 configuration and connectivity issues.
S3 Diagnostics Endpoint
Access the endpoint at /actuator/s3diagnostics:
curl http://localhost:8080/actuator/s3diagnostics | jq .
Common Issues
- Region not configured
-
Ensure
korvet.storage.remote.s3.regionis set, or setAWS_REGIONin the environment. - IRSA not detected
-
Verify that
AWS_ROLE_ARNandAWS_WEB_IDENTITY_TOKEN_FILEare present and the token file exists. - Permission denied (403)
-
The IAM role or user lacks required S3 permissions. Typical permissions are:
-
s3:ListBucketonarn:aws:s3:::bucket-name -
s3:GetObjectonarn:aws:s3:::bucket-name/* -
s3:PutObjectonarn:aws:s3:::bucket-name/* -
s3:DeleteObjectonarn:aws:s3:::bucket-name/*
-
- Identity shows wrong role
-
On EKS, ensure the ServiceAccount is annotated with
eks.amazonaws.com/role-arnand the pod is using that ServiceAccount.
Performance
The archival service achieves high throughput when archiving to same-region S3:
| Configuration | Throughput | Notes |
|---|---|---|
Single stream |
~32,000 msg/s |
Baseline performance |
4 streams (parallel) |
~115,000 msg/s |
Near-linear scaling |
See Cold Storage Benchmarks for detailed results.
Performance Tips
-
Same-region S3: Deploy Korvet in the same AWS region as your S3 bucket
-
Multiple streams: Use multiple topic partitions for parallel archival
-
Batch size: Larger archive batches can improve throughput
-
Archiver concurrency: Increase
korvet.archiver.max-concurrent-streamsto match your stream count
Reading Archived Data
Archived data can be read using:
-
Apache Spark: Direct Delta Lake access
-
Databricks: Native Delta Lake support
-
Presto/Trino: Query Delta Lake tables
-
AWS Athena: Serverless SQL queries
Example: Reading with Spark
val df = spark.read.format("delta")
.load("s3a://my-bucket/korvet/delta/korvet/stream/orders/0")
df.select("stream", "ts", "seq", "body")
.filter($"ts" > 1708956789000L)
.show()
See Databricks Integration for more examples.