Using Azurite blob storage emulator with Spark

Sometimes it’s handy to be able to test Apache Spark developments locally. This might include testing cloud storage such as WASB (Windows Azure Storage Blob).

These steps describe the process for testing WASB locally without the need for an Azure account. These steps make use of the Azurite Storage Emulator.

Steps

  1. Prerequisites
    • Download and extract Apache Spark (spark-3.1.2-bin-hadoop3.2.tgz)
    • Download and install Docker
    • Start the Docker service – e.g. on Linux:
      sudo service docker start
    • (Optionally) Download and install Azure Storage Explorer
  2. Create a new directory and start the Azurite Storage Emulator Docker container – e.g.:

    mkdir ~/blob

    docker run -p 10000:10000 -p 10001:10001 -v /home/david/blob/:/data mcr.microsoft.com/azure-storage/azurite

    NB – in the above example, data will be persisted to the local linux directory /home/david/blob.
  3. Upload files with Storage Explorer:

    Connect Storage Explorer to the Local Storage emulator (keep defaults when adding the connection):





    Upload a sample file – e.g. to the “data” container:

  4. Start Spark using the packages option to include libraries needed to access Blob storage. The Maven coordinates are shown here are for the latest hadoop-azure package:

    cd ~/spark/spark-3.1.2-bin-hadoop3.2/bin

    ./pyspark --packages org.apache.hadoop:hadoop-azure:3.3.1

    The PySpark shell should start as per normal after downloading hadoop-azure and its dependencies.

    Troubleshooting:
    The following stack trace indicates the hadoop-azure driver or dependencies were not loaded successfully:
    ... py4j.protocol.Py4JJavaError: An error occurred while calling o33.load. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2595) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3269) ... Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2593) ... 25 more ...

    Ensure the “packages” option is correctly set when invoking pyspark above.
  5. Query the data using the emulated Blob storage location from the PySpark shell:

    df=spark.read.format("csv").option("header",True).load("wasb://data@storageemulator/iris.csv")

    df.show()


    Notes:
    data – container where the data was uploaded earlier
    @storageemulator – this is a fixed string used to tell the WASB connector to point to the local emulator

    Example output:

Conclusion

Local storage emulation allows testing of wasb locations without the need to connect to a remote Azure subscription / storage account.

Advertisement

2 thoughts on “Using Azurite blob storage emulator with Spark

  1. Hi

    I was looking for this kind of local setup for Spark + Azure Blob Storage Emulator. Thanks for the post.

    When I trying it out, I’m getting an error when I use the storage account as “storageemulator”

    org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: An unknown failure occurred : Connection refused (Connection refused)

    When I try to use storage account as “devstorageaccount1” then I got the below error

    org.apache.hadoop.fs.azure.AzureException: org.apache.hadoop.fs.azure.AzureException: Unable to access container test-container in account devstorageaccount1 using anonymous credentials, and no credentials found for them in the configuration.

    The below is my code done in jupyter

    import findspark
    findspark.init()
    import pyspark
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
    conf = pyspark.SparkConf().setAppName(‘appName’).setMaster(‘local’)
    conf.set(“http://127.0.0.1:10000”, “?sv=2018-03-28&st=2022-03-21T13%3A01%3A33Z&se=2022-04-22T13%3A01%3A00Z&sr=c&sp=rl&sig=DbknL5NtzqZtw6K9q1YIKombqVae7avBBDRJu9GbsK0%3D”)
    sc = pyspark.SparkContext(conf=conf)
    spark = SparkSession(sc)
    spark

    session = spark.builder.getOrCreate()

    df = spark.read.json(“wasb://live-test-container@devstorageaccount1/*.*”)

    Running the just above line throws the error.

    Can you please guide me to successfully access the local storage emulator?

    Like

    1. Assuming the storage emulator is running, uou can try changing the read call to:
      df = spark.read.json(“wasb://live-test-container@storageemulator/*.*”)

      This should point to the same objects which the emulator says are stored in devstorageaccount1. I found it is not necessary to provide credentials to spark when pointing to the storage emulator.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s