Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Questions 4

The code block displayed below contains an error. The code block should configure Spark to split data in 20 parts when exchanging data between executors for joins or aggregations. Find the error.

Code block:

spark.conf.set(spark.sql.shuffle.partitions, 20)

Options:

The code block uses the wrong command for setting an option.

The code block sets the wrong option.

The code block expresses the option incorrectly.

The code block sets the incorrect number of parts.

The code block is missing a parameter.

Buy Now

Questions 5

Which of the following describes the conversion of a computational query into an execution plan in Spark?

Options:

Spark uses the catalog to resolve the optimized logical plan.

The catalog assigns specific resources to the optimized memory plan.

The executed physical plan depends on a cost optimization from a previous stage.

Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

The catalog assigns specific resources to the physical plan.

Buy Now

Questions 6

Which of the following statements about executors is correct?

Options:

Executors are launched by the driver.

Executors stop upon application completion by default.

Each node hosts a single executor.

Executors store data in memory only.

An executor can serve multiple applications.

Buy Now

Questions 7

The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate

row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes

contains the element cozy.

A sample of DataFrame itemsDf is below.

Code block:

itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))

Options:

1. filter

2. array_contains("cozy")

3. select

4. "itemId"

5. explode

6. "attributes"

1. where

2. "array_contains(attributes, 'cozy')"

3. select

4. itemId

5. explode

6. attributes

1. filter

2. "array_contains(attributes, 'cozy')"

3. select

4. "itemId"

5. map

6. "attributes"

1. filter

2. "array_contains(attributes, cozy)"

3. select

4. "itemId"

5. explode

6. "attributes"

1. filter

2. "array_contains(attributes, 'cozy')"

3. select

4. "itemId"

5. explode

6. "attributes"

Buy Now

Questions 8

Which of the following statements about the differences between actions and transformations is correct?

Options:

Actions are evaluated lazily, while transformations are not evaluated lazily.

Actions generate RDDs, while transformations do not.

Actions do not send results to the driver, while transformations do.

Actions can be queued for delayed execution, while transformations can only be processed immediately.

Actions can trigger Adaptive Query Execution, while transformation cannot.

Buy Now

Questions 9

Which of the following code blocks reads JSON file imports.json into a DataFrame?

Options:

spark.read().mode("json").path("/FileStore/imports.json")

spark.read.format("json").path("/FileStore/imports.json")

spark.read("json", "/FileStore/imports.json")

spark.read.json("/FileStore/imports.json")

spark.read().json("/FileStore/imports.json")

Buy Now

Questions 10

Which of the following describes the difference between client and cluster execution modes?

Options:

In cluster mode, the driver runs on the worker nodes, while the client mode runs the driver on the client machine.

In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.

In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.

In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.

In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.

Buy Now

Questions 11

Which of the following statements about broadcast variables is correct?

Options:

Broadcast variables are serialized with every single task.

Broadcast variables are commonly used for tables that do not fit into memory.

Broadcast variables are immutable.

Broadcast variables are occasionally dynamically updated on a per-task basis.

Broadcast variables are local to the worker node and not shared across the cluster.

Buy Now

Questions 12

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.|1 |3 |4 |25 |1 |null|

5.|2 |6 |7 |2 |2 |null|

6.|3 |3 |null |25 |3 |null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.withColumnRemoved("predError", "productId")

transactionsDf.drop(["predError", "productId", "associateId"])

transactionsDf.drop("predError", "productId", "associateId")

transactionsDf.dropColumns("predError", "productId", "associateId")

transactionsDf.drop(col("predError", "productId"))

Buy Now

Questions 13

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

Options:

1. select

2. "storeId"

3. print_schema()

1. limit

2. 1

3. columns

1. select

2. "storeId"

3. printSchema()

1. limit

2. "storeId"

3. printSchema()

1. select

2. storeId

3. dtypes

Buy Now

Questions 14

The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching

value in column itemId with a value in column transactionsId of DataFrame transactionsDf. Find the error.

Code block:

itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

Options:

The join statement is incomplete.

The union method should be used instead of join.

The join method is inappropriate.

The merge method should be used instead of join.

The join expression is malformed.

Buy Now

Questions 15

The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code

block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame

transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(itemsDf, __2__).__3__(__4__)

Options:

1. join

2. transactionsDf.productId==itemsDf.itemId, how="inner"

3. select

4. "transactionId", "supplier"

1. select

2. "transactionId", "supplier"

3. join

4. [transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId]

1. join

2. [transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId]

3. select

4. "transactionId", "supplier"

1. filter

2. "transactionId", "supplier"

3. join

4. "transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId"

1. join

2. transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId

3. filter

4. "transactionId", "supplier"

Buy Now

Answer:

Explanation:

Explanation

This QUESTION NO: is pretty complex and, in its complexity, is probably above what you would encounter in the exam. However, reading the QUESTION NO: carefully, you can use your logic skills

to weed out the

wrong answers here.

First, you should examine the join statement which is common to all answers. The first argument of the join() operator (documentation linked below) is the DataFrame to be joined with. Where join is

in gap 3, the first argument of gap 4 should therefore be another DataFrame. For none of the questions where join is in the third gap, this is the case. So you can immediately discard two answers.

For all other answers, join is in gap 1, followed by .(itemsDf, according to the code block. Given how the join() operator is called, there are now three remaining candidates.

Looking further at the join() statement, the second argument (on=) expects "a string for the join column name, a list of column names, a join expression (Column), or a list of Columns", according to

the documentation. As one answer option includes a list of join expressions (transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId) which is unsupported according to the

documentation, we can discard that answer, leaving us with two remaining candidates.

Both candidates have valid syntax, but only one of them fulfills the condition in the QUESTION NO: "only where column storeId of DataFrame transactionsDf does not match column itemId of

DataFrame

itemsDf". So, this one remaining answer option has to be the correct one!

As you can see, although sometimes overwhelming at first, even more complex questions can be figured out by rigorously applying the knowledge you can gain from the documentation during the

exam.

More info: pyspark.sql.DataFrame.join — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 47 (Databricks import instructions)

Questions 16

Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

Options:

array_remove(transactionsDf, "*")

transactionsDf.unpersist()

(Correct)

del transactionsDf

transactionsDf.clearCache()

transactionsDf.persist()

Buy Now

Questions 17

Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?

Entire DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.filter(col("transactionId").isin([3,4,6])).select([predError, value])

transactionsDf.select(col("transactionId").isin([3,4,6]), "predError", "value")

transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value")

transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value")

(Correct)

1.transactionsDf.createOrReplaceTempView("transactionsDf")

2.spark.sql("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")

transactionsDf.filter(col(transactionId).isin([3,4,6]))

Buy Now

Questions 18

Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.drop(col("value"), col("predError"))

transactionsDf.drop("predError", "value")

transactionsDf.drop(value, predError)

transactionsDf.drop(["predError", "value"])

transactionsDf.drop([col("predError"), col("value")])

Buy Now

Questions 19

The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose

the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

__1__(__2__.__3__.csv(filePath, __4__).__5__)

Options:

1. size

2. spark

3. read()

4. escape='#'

5. columns

1. DataFrame

2. spark

3. read()

4. escape='#'

5. shape[0]

1. len

2. pyspark

3. DataFrameReader

4. comment='#'

5. columns

1. size

2. pyspark

3. DataFrameReader

4. comment='#'

5. columns

1. len

2. spark

3. read

4. comment='#'

5. columns

Buy Now

Answer:

Explanation:

Explanation

Correct code block:

len(spark.read.csv(filePath, comment='#').columns)

This is a challenging QUESTION NO: with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a QUESTION NO: of this difficulty level

appears in the

exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.

Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1,

returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard

this answer option.

Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but

this method only returns the length of an array or map stored within a column (documentation linked below). So, using a size() method is not an option here. This leaves us with two potentially valid

answers.

We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql,

which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session

(pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.

More info:

- pyspark.sql.functions.size — PySpark 3.1.2 documentation

- pyspark.sql.DataFrameReader.csv — PySpark 3.1.2 documentation

- pyspark.sql.SparkSession.read — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 50 (Databricks import instructions)

Questions 20

Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?

Options:

transactionsDf.withColumn("storeId", convert("storeId", "string"))

transactionsDf.withColumn("storeId", col("storeId", "string"))

transactionsDf.withColumn("storeId", col("storeId").convert("string"))

transactionsDf.withColumn("storeId", col("storeId").cast("string"))

transactionsDf.withColumn("storeId", convert("storeId").as("string"))

Buy Now

Questions 21

Which of the following statements about Spark's DataFrames is incorrect?

Options:

Spark's DataFrames are immutable.

Spark's DataFrames are equal to Python's DataFrames.

Data in DataFrames is organized into named columns.

RDDs are at the core of DataFrames.

The data in DataFrames may be split into multiple chunks.

Buy Now

Questions 22

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.

Code block:

itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

Options:

Since itemId is the index, it does not need to be an argument to the select() method.

The alias() method needs to be called after the select() method.

The explode() method expects a Column object rather than a string.

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

The split() method should be used inside the select() method instead of the explode() method.

Buy Now

Questions 23

Which of the following describes characteristics of the Spark driver?

Options:

The Spark driver requests the transformation of operations into DAG computations from the worker nodes.

If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.

The Spark driver processes partitions in an optimized, distributed fashion.

In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.

The Spark driver's responsibility includes scheduling queries for execution on worker nodes.

Buy Now

Questions 24

Which of the following is a problem with using accumulators?

Options:

Only unnamed accumulators can be inspected in the Spark UI.

Only numeric values can be used in accumulators.

Accumulator values can only be read by the driver, but not by executors.

Accumulators do not obey lazy evaluation.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Buy Now

Answer:

Explanation:

Explanation

Accumulator values can only be read by the driver, but not by executors.

Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for

example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good

way to do that.

Only numeric values can be used in accumulators.

No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).

Accumulators do not obey lazy evaluation.

Incorrect – accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a

subsequent action is run.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator

variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be

repeated.

Only unnamed accumulators can be inspected in the Spark UI.

No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.

More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator — PySpark 3.1.2 documentation, and

pyspark.AccumulatorParam — PySpark 3.1.2 documentation

Questions 25

The code block displayed below contains an error. The code block should arrange the rows of DataFrame transactionsDf using information from two columns in an ordered fashion, arranging first by

column value, showing smaller numbers at the top and greater numbers at the bottom, and then by column predError, for which all values should be arranged in the inverse way of the order of items

in column value. Find the error.

Code block:

transactionsDf.orderBy('value', asc_nulls_first(col('predError')))

Options:

Two orderBy statements with calls to the individual columns should be chained, instead of having both columns in one orderBy statement.

Column value should be wrapped by the col() operator.

Column predError should be sorted in a descending way, putting nulls last.

Column predError should be sorted by desc_nulls_first() instead.

Instead of orderBy, sort should be used.

Buy Now

Questions 26

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

Options:

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Buy Now

Questions 27

Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned

DataFrame?

Options:

transactionsDf.resample(0.15, False, 3142)

transactionsDf.sample(0.15, False, 3142)

transactionsDf.sample(0.15)

transactionsDf.sample(0.85, 8429)

transactionsDf.sample(True, 0.15, 8261)

Buy Now

Answer:

Explanation:

Explanation

Answering this QUESTION NO: correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows:

DataFrame.sample(withReplacement=None, fraction=None, seed=None).

The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument.

About replacement: "Replacement" is easiest explained with the example of removing random items from a box. When you remove those "with replacement" it means that after you have taken an

item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. "Without

replacement" means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the

same item twice.

The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the QUESTION NO: we are asked for 150 out of 1000 items – a

fraction of 0.15.

The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you

would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you!

More info: pyspark.sql.DataFrame.sample — PySpark 3.1.1 documentation

Static notebook | Dynamic notebook: See test 1, QUESTION NO: 49 (Databricks import instructions)

Exam Code: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Last Update: Feb 22, 2025

Questions: 180

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 PDF

$25.5 ~~$84.99~~

Add to Cart

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Testing Engine

$30 ~~$99.99~~

Add to Cart

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 PDF + Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 PDF + Testing Engine

$40.5 ~~$134.99~~

Add to Cart

Weekend Special Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: clap70

clapgeek logo

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 PDF

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Testing Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 PDF + Testing Engine

Quick Links

Recently New Released Certification Exams