CCA Spark and Hadoop Developer Mock Exam Practice(CCA175)
Exam Questions, which are similar to the CCA175 exam.

As per the CCA 175 syllabus, the following topics are covered in Part-1,2 :
Part — 1, Read and Write Data APIs are here:
- Load data from HDFS for use in Spark applications
- Write the results back into HDFS using Spark
- Read and write files in a variety of file formats
- Perform standard extract, transform, load (ETL) processes on data using the Spark API
Part — 2, Data Analysis
- Understand the fundamentals of querying datasets in Spark
- Filter data using Spark
- Write queries that calculate aggregate statistics
- Join disparate datasets using Spark
- Produce ranked or sorted data
- partitionBy and bucketBy
These are concepts are covered in this blog, here are the sample questions and their solutions.
Here we have used the Water Heater dataset. Which has information about the application submission date, procession status, location — address etc..
1. How many Water Heaters issued in year 2015?
2. Find out the yearwise First 3 Applicants of the Solar Water Heater plant?
3. Column data type conversion and use spark fuctions like (datediff, to_date, etc.)
4 . Save the results in the form partition by year, with gzip compression and parquet format.
5. Save the data partition by year, into spark metastore — table. also, make sure create Bucket with 5, as there are five different value for status column
To start with the exam practice, the very basic thing requires, that is environment, where Spark + Cloudera Hadoop platform.
Environment Setup steps for the practice:
Please visit these blogs, with these, we will develop the Cloudera + Spark Lab environment in our local system.
Cloudera Installation using docker — click here
Upgrade Cloudera Manager + CDH if required—click here
Spark 2.x installation —click here
Now, once you have the lab up and running. Let’s start exploring spark applications and operations.
SparkSession and Spark context with application-specific configuration.
e.g. AppName, Master, spark UI port etc..


Read JSON format data to Spark Dataframe.

PrintSchema of the dataframe.

Print the first 5 record, also understand the use of trucate parameter.

Print Columns names.

Drop Column from the DataFrame.

Print no of records.

Print no of unique value per columns. This helps to understand the given dataset.

How many types of STATUS values are there?
GroupBy + aggregation + rename aggregated column

FIlter the dataframe based on the column values.

Calculate Application processing time for each application.
Spark Functions + OrderBy

How to access pyspark API docs in the Jupyter notebook.
To convert datatypes of column and many other functions …

How many Water Heaters issued by year?
GroupBy + Spark functions + aggregation + rename aggregated column + orderBy

Join operation:
— > Filter the applications, for which application processing at least started to
— > generate column application processing_days column by APPLICATION_DATE, ISSUED_DATE and stored it into df_procession_days dataframe
— > JOIN the df_procession_days dataframe with the original dataframe — Use inner join. And make sure the processing_days columns is available with the original dataframe

Find out the First 3 Applicants of the Solar Water Heater plant for each year?
Window functions + basic spark functionalities

Distinct function: To find unique values inside the column.

Save the dataframe with the specific storage format, compression and data warehouse principles like Partition.

Read the saved — partitioned data from HDFS, Local or any storage system using PySpark Read APIs.

When you want to read, specific partition from the entire dataset.
Note: As the data is partitioned based on the year, and we are reading specific data,year 2009, the prinschema doesn’t show the column by which the data partition created. (Fore more clarification, compare the result with the above answers)

Save the dataframe with the specific storage format,mode, compression and data warehouse principles like Partition.
format type+ mode + partitionBy + compression

BucketBy | Cluster number
At this time, this feature only supports when we write data as a table in Spark Metastore or Hive Metastore.

Good luck with your exam, My dear friend! Just believe in yourself. And, give your best effort in the exam arena.
See you all in my next blog. Follow Clairvoyant to get more updates about data engineering.