Introduction
The Redis Connector for Spark provides integration between Redis and Apache Spark and supports reading data from and writing data to Redis. Within a Spark 3 environment the connector enables users to read data from Redis, manipulate it using Spark operations, and then write results back to Redis or to another system. Data can also be imported to Redis by reading it from any data source supported by Spark and then writing it to Redis.
The connector has the following system requirements:
-
Spark 3.5.4 is recommended but versions 3.1 to 3.5 should work as well.
-
When choosing a Spark distribution you must select one that uses Scala 2.12.
-
Redis version 5 or higher.
-
Java 17 or higher (if using Java to run Spark).
Getting Started
Java
Dependency Management
Provide the Spark SQL and Redis Spark Connector dependencies to your dependency management tool.
dependencies {
implementation 'com.redis:redis-spark-connector:0.5.1'
implementation 'org.apache.spark:spark-sql_2.12:3.5.4'
}
<dependencies>
<dependency>
<groupId>com.redis</groupId>
<artifactId>redis-spark-connector</artifactId>
<version>0.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.4</version>
</dependency>
</dependencies>
Spark Session Configuration
package com.redis.examples;
import org.apache.spark.sql.SparkSession;
public class RedisSparkExample {
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession.builder()
.master("local")
.appName("RedisSparkExample")
.config("spark.redis.read.connection.uri", "redis://localhost:6379")
.config("spark.redis.write.connection.uri", "redis://localhost:6379")
.getOrCreate();
}
}
For Redis Spark Connector configuration details, see the Configuring Spark section.
Python
PySpark
This guide describes how to use PySpark with the Redis Spark Connector but this works with self-contained Python applications as well.
When starting pyspark
you must use one of the following options to add the package to the classpath:
--packages com.redis:redis-spark-connector:0.5.1
-
downloads the Redis Spark Connector package using the given Maven coordinates or
--jars path/to/redis-spark-connector-0.5.1.jar
-
adds the downloaded Redis Spark Connector jar to the classpath
You can specify --conf
option(s) to configure the connector.
pyspark --conf "spark.redis.read.connection.uri=redis://localhost:6379" \
--conf "spark.redis.write.connection.uri=redis://localhost:6379" \
--packages com.redis:redis-spark-connector:0.5.1
Python Application
Create a SparkSession
object using the same configuration options as before:
from pyspark.sql import SparkSession
spark_session = SparkSession
.builder
.appName("myApp")
.config("spark.redis.read.connection.uri", "redis://localhost:6379")
.config("spark.redis.write.connection.uri", "redis://localhost:6379")
.getOrCreate()
Scala
Spark Shell
When starting the Spark shell you must use one of the following options to add the package to the classpath:
--packages com.redis:redis-spark-connector:0.5.1
-
downloads the Redis Spark Connector package using the given Maven coordinates or
--jars path/to/redis-spark-connector-0.5.1.jar
-
adds the downloaded Redis Spark Connector jar to the classpath
You can specify --conf
option(s) to configure the connector.
spark-shell --conf "spark.redis.read.connection.uri=redis://localhost:6379" \
--conf "spark.redis.write.connection.uri=redis://localhost:6379" \
--packages com.redis:redis-spark-connector:0.5.1
Scala Application
Dependency Management
Provide the Spark SQL and Redis Spark Connector dependencies to your dependency management tool.
scalaVersion := "2.12",
libraryDependencies ++= Seq(
"com.redis" %% "redis-spark-connector" % "0.5.1",
"org.apache.spark" %% "spark-sql" % "3.5.4"
)
Spark Session Configuration
package com.redis
object RedisSparkExample {
def main(args: Array[String]): Unit = {
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder()
.master("local")
.appName("RedisSparkExample")
.config("spark.redis.read.connection.uri", "redis://localhost:6379")
.config("spark.redis.write.connection.uri", "redis://localhost:6379")
.getOrCreate()
}
}
Databricks
Spark and Redis together unlock powerful capabilities for data professionals. This section describes how to integrate these technologies for enhanced analytics, real-time processing, and machine learning applications.
Connecting to Redis
In Databricks, open your cluster settings and locate Advanced Options. Under Spark in the Spark config text area, add your Redis connection string as:
spark.redis.read.connection.uri redis://…
spark.redis.write.connection.uri redis://…
.
Using Databricks Secret Management
It is recommended to use Databricks secrets to store the Redis URI if it contains sensitive credentials.
To configure secrets refer to the Databricks documentation.
You can reference those secrets in your Spark cluster using the same Spark config options:
spark.redis.read.connection.uri {{secrets/redis/uri}}
spark.redis.write.connection.uri {{secrets/redis/uri}}
Use SSL/TLS to connect Databricks to Redis
To enable SSL connections to Redis, follow the instructions in the TLS section of the documentation.
You can provide the configurations described there as options.
For example you specify the trusted certificates in the property redis.read.connection.ssl.cacert
.
It is recommended that you:
-
Store your certificates in cloud object storage. You can restrict access to the certificates only to cluster that can access Redis. See Data governance with Unity Catalog.
-
Store your passwords as secrets in a secret scope.
The following example uses object storage locations and Databricks secrets to enable a SSL connection:
df = spark.readStream \
.format("redis") \
.option("redis.read.connection.ssl.cacert", <trusted-certs>) \
.option("redis.read.connection.ssl.cert", <public-key>) \
.option("redis.read.connection.ssl.key", <private-key>) \
.option("redis.read.connection.ssl.key.password", dbutils.secrets.get(scope=<scope-name>, key=<key-name>))
Redis Spark Notebook
In this hands-on tutorial you’ll learn how to make efficient use of Redis data structures alongside Spark’s distributed computing framework. You’ll see firsthand how to extract data from Redis, process it in Spark, and write results back to Redis for application use.
Key topics include:
-
Setting up the Spark-Redis connector in Databricks
-
Reading data from Redis for application access
-
Writing data to Redis from Spark
You can edit and run this notebook by importing it into your Databricks account.
Select Import from any folder’s menu and paste this URL: https://github.com/redis-field-engineering/redis-spark-dist/raw/refs/heads/main/redis-spark-notebook.ipynb
Configuration
Connection Options
The following options apply to both read and write operations.
redis.<read|write>.connection.uri
-
Redis URI in the form
redis://username:password@host:port
. For secure (TLS) connections userediss://
. redis.<read|write>.connection.cluster
-
Set to true when connecting to a Redis Cluster.
TLS Connection Options
For secure (TLS) connections use rediss://
as the Redis URI scheme.
redis.<read|write>.connection.ssl.cacert
-
Certificate file to load trusted certificates. The file must provide X.509 certificates in PEM format.
redis.<read|write>.connection.ssl.cert
-
X.509 certificate chain in PEM format to use for client authentication.
redis.<read|write>.connection.ssl.key
-
PKCS#8 private key in PEM format to use for client authentication.
redis.<read|write>.connection.ssl.key.password
-
Password for the private key if it’s password-protected.
Read Options
redis.read.type
-
Type of reader to use for reading data from Redis (
KEYS
orSTREAM
) redis.read.schema
-
Specifies known fields to use when inferring the schema from Redis, in the form
<field1> <type>, <field2> <type>
where type is one ofSTRING TINYINT SMALLINT INT BIGINT BOOLEAN FLOAT DOUBLE DATE TIMESTAMP
.
Redis Keys
With read type KEYS
the connector iterates over keys using the Redis SCAN
command and then fetches the corresponding values.
redis.read.keyPattern
-
Read keys matching the given glob-style pattern (default:
*
). redis.read.keyType
-
Read keys matching the given type e.g.
string
,hash
,json
(default: all types). redis.read.threads
-
Number of reader threads to use in parallel (default:
1
). redis.read.batch
-
Number of keys each thread fetches the values for at a time in a pipeline call.
redis.read.pool
-
Maximum number of Redis connections to use across threads (default:
8
). redis.read.scanCount
-
Number of keys to read at once on each scan call.
redis.read.queueCapacity
-
Max number of values that the reader threads can queue up (default:
10000
). redis.read.readFrom
-
Which Redis cluster nodes to read from. See Lettuce ReadFrom.
Streaming
When using Spark streaming the Redis Spark Connector supports both micro-batch processing and continuous processing.
In this mode the connector reads a change stream from Redis using keyspace notifications in addition to the scan described previously.
redis.read.eventQueueCapacity
-
Capacity of the keyspace notification queue (default:
10000
). redis.read.idleTimeout
-
Min duration in milliseconds to consider reader complete.
redis.read.flushInterval
-
Max duration in milliseconds between flushes.
Redis Stream
Use read type STREAM
to read messages from a Redis stream.
redis.read.streamKey
-
Key of the Redis stream to read from.
Batch Mode
In batch mode the connector uses the Redis XRANGE command to read messages from the given stream.
redis.read.streamStart
-
ID to start reading from (default:
-
). redis.read.streamEnd
-
Max ID to read (default:
+
). redis.read.streamCount
-
Max number of messages to read.
Streaming Mode
In streaming mode the connector uses the Redis XREAD command to read messages from the given stream.
redis.read.offset
-
Initial message ID to read from the stream. Defaults to reading from the beginning of the stream.
redis.read.streamBlock
-
Maximum duration in milliseconds that XREAD will wait for messages to be available.
redis.read.streamCount
-
Maximum number of messages to fetch in each XREAD call.
Write Options
redis.write.type
-
Redis data-structure type to write to:
hash
,json
,string
,stream
. redis.write.keyspace
-
Prefix for keys written to Redis (default:
spark
). redis.write.key
-
Field or list of fields used to compose keys written to Redis (default: no keys). Separate with a comma to specify more than one field, e.g.
field1,field2
.
For types other than stream you need to specify a key otherwise all writes will go to the same single key equal to keyspace.
|
Support
Redis Spark Connector is supported by Redis, Inc. for enterprise-tier customers as a 'Developer Tool' under the Redis Software Support Policy. For non enterprise-tier customers we supply support for Redis Spark Connector on a good-faith basis. To report bugs, request features, or receive assistance, please file an issue.
License
Redis Spark Connector is licensed under the Business Source License 1.1.
Copyright © 2024 Redis, Inc.
See LICENSE for details.