Read & Write Avro files using Spark SQL

3 min readApr 17, 2020

READ AND WRITE — Avro, Parquet, ORC, CSV, JSON, Hive tables…

Here, I have covered all the Spark SQL APIs by which you can read and write data from and to HDFS and local files.

Sample data is available here. [Avro, Parquet, ORC, CSV, JSON]

Avro file format and Spark SQL integrated and it is easily available in Spark 2.4.x and later, but for Spark version( < 2.4.0 ) we have to configuration a bit different way. [Reference: https://github.com/databricks/spark-avro]

Command:
Spark version: 2.3.0
Python version: 3.6.8
Scala Version: 2.11

$Pyspark2

$Spark-shell

Configuration to make READ/WRITE APIs avilable for AVRO Data source

To read Avro File from Data Source, we need to make sure the Spark-Avro jar file must be available at the Spark configuration. (com.databricks:spark-avro_2.11:4.0.0)

Spark and Avro compatible matrix

Here, are two different methods to make Avro format available as a part of Spark-SQL APIs.

Pyspark — Spark-shell — Spark-submit add packages and dependency details

from pyspark.sql import SparkSession
# METHOD — 1
# import jar files
from pyspark.conf import SparkConf
conf = SparkConf()
conf.set(“spark.jars.packages”, “com.databricks:spark-avro_2.11:4.0.0”)
spark = SparkSession.builder.appName(‘AVRO-Excersices’).master(‘yarn’). \
config(conf= conf). \
getOrCreate()
# METHOD — 2
spark = SparkSession.builder.appName(‘AVRO-Excersices’).master(‘yarn’). \
config(“spark.jars.packages”, “com.databricks:spark-avro_2.11:4.0.0”). \
getOrCreate()