Troubleshooting
Common issues and solutions for Korvet.
Connection Issues
Cannot connect to Korvet
Symptoms: Kafka clients timeout or fail to connect
Solutions:
-
Check Korvet is running:
curl http://localhost:8080/actuator/health -
Verify port is accessible:
telnet localhost 9092 -
Check firewall rules allow port 9092
Cannot connect to Redis
Symptoms: Korvet fails to start or logs Redis connection errors
Solutions:
-
Verify Redis is running:
redis-cli ping -
Check Redis connection settings:
export KORVET_REDIS_URI=redis://localhost:6379 export KORVET_REDIS_USERNAME=default export KORVET_REDIS_PASSWORD=secret -
Test Redis connectivity:
redis-cli -h localhost -p 6379 ping
Performance Issues
Data Issues
Messages not appearing
Symptoms: Produced messages don’t show up when consuming
Solutions:
-
Check topic exists:
kafka-topics --bootstrap-server localhost:9092 --list -
Verify partition assignment
-
Check consumer is reading from correct offset
-
Enable debug logging:
export LOGGING_LEVEL_COM_REDIS_KORVET=DEBUG
Resource Issues
Integration Issues
Databricks Spark Structured Streaming Timeout
Symptoms: Databricks Spark Structured Streaming fails with timeout error:
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: describeTopics
Root Cause: Network connectivity issue between Databricks and Korvet, typically when Korvet is behind an AWS load balancer.
Diagnosis:
-
Test TCP connectivity from Databricks notebook:
import socket socket.create_connection(("<korvet-host>", 9092), timeout=10)If this times out, Databricks cannot reach Korvet.
-
Check if producers work from the same Databricks workspace:
If producers work but Spark Structured Streaming doesn’t, the issue is likely network-related.
-
Determine where working producers are running:
-
Same Databricks workspace?
-
EC2 instances in Korvet’s VPC?
-
Different network?
-
Solutions:
If Korvet is behind an AWS ELB/NLB:
-
Check ELB scheme (internal vs internet-facing):
aws elbv2 describe-load-balancers \ --region <region> \ --query 'LoadBalancers[?DNSName==`<elb-hostname>`].[Scheme,Type,VpcId]' -
If ELB is internet-facing, update security group to allow Databricks:
# Add inbound rule for port 9092 from Databricks IP ranges aws ec2 authorize-security-group-ingress \ --group-id <elb-security-group-id> \ --protocol tcp \ --port 9092 \ --cidr <databricks-ip-range>See Databricks IP ranges for your region.
-
If ELB is internal, set up VPC peering:
# Create VPC peering between Databricks VPC and Korvet VPC aws ec2 create-vpc-peering-connection \ --vpc-id <databricks-vpc-id> \ --peer-vpc-id <korvet-vpc-id>Then update route tables and security groups to allow traffic.
-
Configure Korvet’s advertised address to use the ELB hostname:
export KORVET_SERVER_ADVERTISED_HOST=<elb-hostname> export KORVET_SERVER_ADVERTISED_PORT=9092
Verification:
After fixing network connectivity, test from Databricks:
# Test connectivity
import socket
socket.create_connection(("<korvet-host>", 9092), timeout=10)
# Test Spark Structured Streaming
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "<korvet-host>:9092") \
.option("subscribe", "<topic-name>") \
.option("startingOffsets", "latest") \
.load()
df.printSchema() # Should not timeout
Additional Notes:
-
Producers may work while consumers fail because producers don’t need to call
describeTopics()during initialization -
The error occurs when Spark’s
KafkaAdminClienttries to describe topics but cannot connect to the broker -
This is a network connectivity issue, not a Korvet bug or configuration issue
Databricks Consumer Not Receiving Messages
Symptoms: Databricks Spark Structured Streaming consumer connects successfully but receives no messages, even with startingOffsets = "earliest". Korvet logs show only metadata and ListOffsets requests, but no Fetch requests.
Root Cause: Databricks has saved checkpoint data from a previous run that contains committed offsets. The consumer resumes from these saved offsets instead of starting from "earliest".
Diagnosis:
-
Check Korvet logs for Fetch requests:
# Look for FetchHandler log entries grep "Handling fetch" korvet.logIf you only see
MetadataHandlerandListOffsetsHandlerentries but noFetchHandlerentries, the consumer is not fetching. -
Verify checkpoint location in Databricks:
# Check if checkpoint directory exists dbutils.fs.ls("/path/to/checkpoint/location")
Solutions:
Option 1: Clear checkpoint data (recommended for testing)
Delete the checkpoint directory to force the consumer to start fresh:
# Remove checkpoint directory
dbutils.fs.rm("/path/to/checkpoint/location", recurse=True)
# Restart the streaming query
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "<korvet-host>:9092") \
.option("subscribe", "<topic-name>") \
.option("startingOffsets", "earliest") \
.load()
Option 2: Use a different checkpoint location
Specify a new checkpoint location for the streaming query:
query = df.writeStream \
.format("console") \
.option("checkpointLocation", "/new/checkpoint/location") \
.start()
Option 3: Reset offsets in Korvet (if using consumer groups)
If using consumer groups, reset the committed offsets:
# Use kafka-consumer-groups tool to reset offsets
kafka-consumer-groups --bootstrap-server <korvet-host>:9092 \
--group <consumer-group-id> \
--topic <topic-name> \
--reset-offsets --to-earliest \
--execute
Verification:
After clearing checkpoints, verify messages are being consumed:
# Start streaming query with console output
query = df.writeStream \
.format("console") \
.option("truncate", False) \
.start()
# Check Korvet logs for Fetch requests
# You should now see:
# DEBUG c.r.korvet.core.kafka.FetchHandler - Handling fetch: topics=1, partitions=...
Additional Notes:
-
Databricks Structured Streaming automatically saves checkpoint data to ensure exactly-once processing
-
The
startingOffsetsoption only applies when there is no checkpoint data -
Once checkpoint data exists, the consumer always resumes from the saved offsets
-
This is expected Spark Structured Streaming behavior, not a Korvet issue
-
For production use, manage checkpoint locations carefully to avoid data loss
Getting Help
If you can’t resolve the issue:
-
Check logs with debug level enabled
-
Collect metrics from
/actuator/prometheus -
File an issue at https://github.com/redis-field-engineering/korvet-dist/issues