For the latest stable version, please use Korvet 0.12.5!

Monitoring

Korvet provides comprehensive monitoring through Spring Boot Actuator and Micrometer.

Health Checks

Korvet exposes health check endpoints:

# Overall health
curl http://localhost:8080/actuator/health

# Liveness probe (for Kubernetes)
curl http://localhost:8080/actuator/health/liveness

# Readiness probe (for Kubernetes)
curl http://localhost:8080/actuator/health/readiness

Metrics

Metrics are exposed in Prometheus format:

curl http://localhost:8080/actuator/prometheus

Available Metrics

Korvet Custom Metrics

korvet.produce.messages: Number of messages produced (tags: topic, partition)
korvet.produce.latency: Produce request latency histogram (tags: topic, partition)
korvet.fetch.requests: Number of fetch requests (tags: topic, partition)
korvet.fetch.latency: Fetch request latency histogram (tags: topic, partition)
korvet.fetch.messages: Number of messages fetched (tags: topic, partition, tier)
korvet.errors: Number of errors (tags: operation, error_type)

JVM and System Metrics

Standard JVM and system metrics from Micrometer:

jvm.memory.used: JVM memory used
jvm.memory.max: JVM maximum memory
jvm.gc.pause: Garbage collection pause time
jvm.threads.live: Live threads
process.cpu.usage: Process CPU usage
system.cpu.usage: System CPU usage
system.load.average.1m: System load average

Prometheus Configuration

Add Korvet to your Prometheus scrape config:

scrape_configs:
  - job_name: 'korvet'
    static_configs:
      - targets: ['korvet:8080']
    metrics_path: '/actuator/prometheus'

Grafana Dashboards

A complete monitoring stack with Prometheus and Grafana is available in the grafana/ directory. See grafana/README.adoc for quick start instructions.

The pre-built dashboard visualizes:

Kafka Operations: Produce/fetch rates, latency percentiles (p50, p95, p99)
Storage Tiers: Messages fetched by tier (hot/cold/warm)
Errors: Error rates by operation and type
Resources: CPU usage, memory usage
JVM: Heap memory, GC pauses, thread count
System: Load average, disk space

Alerting

Set up alerts for:

Korvet-Specific Alerts

High produce latency: p99 > 100ms
High fetch latency: p99 > 50ms
Error rate spike: Error rate > 1% of requests
No produce activity: No messages produced in 5 minutes (if expected)

System Alerts

Memory pressure: JVM heap > 80%
High GC activity: Frequent or long GC pauses
High CPU usage: Process CPU > 80%

Logging

See Logging for log-based monitoring.