CCA Spark and Hadoop Developer Mock Exam Practice(CCA175)

Exam Questions, which are similar to the CCA175 exam.

Akash Patel

5 min readMay 11, 2020

As per the CCA 175 syllabus, the following topics are covered in Part-1,2 :

Part — 1, Read and Write Data APIs are here:

Load data from HDFS for use in Spark applications
Write the results back into HDFS using Spark
Read and write files in a variety of file formats
Perform standard extract, transform, load (ETL) processes on data using the Spark API

Part — 2, Data Analysis

Understand the fundamentals of querying datasets in Spark
Filter data using Spark
Write queries that calculate aggregate statistics
Join disparate datasets using Spark
Produce ranked or sorted data
partitionBy and bucketBy

These are concepts are covered in this blog, here are the sample questions and their solutions.

Here we have used the Water Heater dataset. Which has information about the application submission date, procession status, location — address etc..

1. How many Water Heaters issued in year 2015?
2. Find out the yearwise First 3 Applicants of the Solar Water Heater plant?
3. Column data type conversion and use spark fuctions like (datediff, to_date, etc.)
4 . Save the results in the form partition by year, with gzip compression and parquet format.
5. Save the data partition by year, into spark metastore — table. also, make sure create Bucket with 5, as there are five different value for status column

To start with the exam practice, the very basic thing requires, that is environment, where Spark + Cloudera Hadoop platform.

Environment Setup steps for the practice:

Please visit these blogs, with these, we will develop the Cloudera + Spark Lab environment in our local system.
Cloudera Installation using docker — click here
Upgrade Cloudera Manager + CDH if required—click here
Spark 2.x installation —click here

Now, once you have the lab up and running. Let’s start exploring spark applications and operations.

SparkSession and Spark context with application-specific configuration.
e.g. AppName, Master, spark UI port etc..

Read JSON format data to Spark Dataframe.

PrintSchema of the dataframe.

Print the first 5 record, also understand the use of trucate parameter.

Print Columns names.

Drop Column from the DataFrame.

Print no of records.

Print no of unique value per columns. This helps to understand the given dataset.

How many types of STATUS values are there?
GroupBy + aggregation + rename aggregated column

FIlter the dataframe based on the column values.

Calculate Application processing time for each application.
Spark Functions + OrderBy

How to access pyspark API docs in the Jupyter notebook.
To convert datatypes of column and many other functions …

How many Water Heaters issued by year?
GroupBy + Spark functions + aggregation + rename aggregated column + orderBy

Join operation:
— > Filter the applications, for which application processing at least started to
— > generate column application processing_days column by APPLICATION_DATE, ISSUED_DATE and stored it into df_procession_days dataframe
— > JOIN the df_procession_days dataframe with the original dataframe — Use inner join. And make sure the processing_days columns is available with the original dataframe

Find out the First 3 Applicants of the Solar Water Heater plant for each year?
Window functions + basic spark functionalities

Distinct function: To find unique values inside the column.

Save the dataframe with the specific storage format, compression and data warehouse principles like Partition.

Read the saved — partitioned data from HDFS, Local or any storage system using PySpark Read APIs.

When you want to read, specific partition from the entire dataset.
Note: As the data is partitioned based on the year, and we are reading specific data,year 2009, the prinschema doesn’t show the column by which the data partition created. (Fore more clarification, compare the result with the above answers)

Save the dataframe with the specific storage format,mode, compression and data warehouse principles like Partition.
format type+ mode + partitionBy + compression