Sometimes it’s handy to be able to test Apache Spark developments locally. This might include testing cloud storage such as WASB (Windows Azure Storage Blob).
These steps describe the process for testing WASB locally without the need for an Azure account. These steps make use of the Azurite Storage Emulator.
Steps
- Prerequisites
- Download and extract Apache Spark (spark-3.1.2-bin-hadoop3.2.tgz)
- Download and install Docker
- Start the Docker service – e.g. on Linux:
sudo service docker start
- (Optionally) Download and install Azure Storage Explorer
- Create a new directory and start the Azurite Storage Emulator Docker container – e.g.:
mkdir ~/blob
docker run -p 10000:10000 -p 10001:10001 -v /home/david/blob/:/data mcr.microsoft.com/azure-storage/azurite
NB – in the above example, data will be persisted to the local linux directory /home/david/blob. - Upload files with Storage Explorer:
Connect Storage Explorer to the Local Storage emulator (keep defaults when adding the connection):
Upload a sample file – e.g. to the “data” container: - Start Spark using the packages option to include libraries needed to access Blob storage. The Maven coordinates are shown here are for the latest hadoop-azure package:
cd ~/spark/spark-3.1.2-bin-hadoop3.2/bin
./pyspark --packages org.apache.hadoop:hadoop-azure:3.3.1
The PySpark shell should start as per normal after downloading hadoop-azure and its dependencies.
Troubleshooting:
The following stack trace indicates the hadoop-azure driver or dependencies were not loaded successfully:... py4j.protocol.Py4JJavaError: An error occurred while calling o33.load. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269) ... Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593) ... 25 more ...
Ensure the “packages” option is correctly set when invoking pyspark above. - Query the data using the emulated Blob storage location from the PySpark shell:
df=spark.read.format("csv").option("header",True).load("wasb://data@storageemulator/iris.csv")
df.show()
Notes:
data – container where the data was uploaded earlier
@storageemulator – this is a fixed string used to tell the WASB connector to point to the local emulator
Example output:
Conclusion
Local storage emulation allows testing of wasb locations without the need to connect to a remote Azure subscription / storage account.
Hi
I was looking for this kind of local setup for Spark + Azure Blob Storage Emulator. Thanks for the post.
When I trying it out, I’m getting an error when I use the storage account as “storageemulator”
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: An unknown failure occurred : Connection refused (Connection refused)
When I try to use storage account as “devstorageaccount1” then I got the below error
org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: Unable to access container test-container in account devstorageaccount1 using anonymous credentials, and no credentials found for them in the configuration.
The below is my code done in jupyter
import findspark
findspark.init()
import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = pyspark.SparkConf().setAppName(‘appName’).setMaster(‘local’)
conf.set(“http://127.0.0.1:10000”, “?sv=2018-03-28&st=2022-03-21T13%3A01%3A33Z&se=2022-04-22T13%3A01%3A00Z&sr=c&sp=rl&sig=DbknL5NtzqZtw6K9q1YIKombqVae7avBBDRJu9GbsK0%3D”)
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)
spark
session = spark.builder.getOrCreate()
df = spark.read.json(“wasb://live-test-container@devstorageaccount1/*.*”)
Running the just above line throws the error.
Can you please guide me to successfully access the local storage emulator?
LikeLike
Assuming the storage emulator is running, uou can try changing the read call to:
df = spark.read.json(“wasb://live-test-container@storageemulator/*.*”)
This should point to the same objects which the emulator says are stored in devstorageaccount1. I found it is not necessary to provide credentials to spark when pointing to the storage emulator.
LikeLike