Read & Write Avro files using Spark SQL

READ AND WRITE — Avro, Parquet, ORC, CSV, JSON, Hive tables…
Here, I have covered all the Spark SQL APIs by which you can read and write data from and to HDFS and local files.
Sample data is available here. [Avro, Parquet, ORC, CSV, JSON]
Avro file format and Spark SQL integrated and it is easily available in Spark 2.4.x and later, but for Spark version( < 2.4.0 ) we have to configuration a bit different way. [Reference: https://github.com/databricks/spark-avro]
Command:
Spark version: 2.3.0
Python version: 3.6.8
Scala Version: 2.11
$Pyspark2

$Spark-shell

Configuration to make READ/WRITE APIs avilable for AVRO Data source
To read Avro File from Data Source, we need to make sure the Spark-Avro jar file must be available at the Spark configuration. (com.databricks:spark-avro_2.11:4.0.0)
Spark and Avro compatible matrix

Here, are two different methods to make Avro format available as a part of Spark-SQL APIs.

from pyspark.sql import SparkSession
# METHOD — 1
# import jar files
from pyspark.conf import SparkConfconf = SparkConf()
conf.set(“spark.jars.packages”, “com.databricks:spark-avro_2.11:4.0.0”)spark = SparkSession.builder.appName(‘AVRO-Excersices’).master(‘yarn’). \
config(conf= conf). \
getOrCreate()# METHOD — 2
spark = SparkSession.builder.appName(‘AVRO-Excersices’).master(‘yarn’). \
config(“spark.jars.packages”, “com.databricks:spark-avro_2.11:4.0.0”). \
getOrCreate()

AVRO — READ AND WRITE DATA

PARQUET — READ AND WRITE DATA

ORC — READ AND WRITE DATA

CSV — READ AND WRITE DATA

JSON — READ AND WRITE DATA

HIVE — READ AND WRITE DATA

Jupyter Notebook file: Source code is available here.
Please Clap!! 👏 See you all in my next blog. Follow me to get more updates about data engineering.