Winter Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: geek65

Databricks-Machine-Learning-Associate Databricks Certified Machine Learning Associate Exam Questions and Answers

Questions 4

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:

Which of the following lines of code can be used to complete the code block to successfully complete the task?

Options:

A.

predict(*spark_df.columns)

B.

mapInPandas(predict)

C.

predict(Iterator(spark_df))

D.

mapInPandas(predict(spark_df.columns))

E.

predict(spark_df.columns)

Buy Now
Questions 5

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.

Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

Options:

A.

Change the number of compute nodes to be half or less than half of the number of evaluations.

B.

Change the number of compute nodes and the number of evaluations to be much larger but equal.

C.

Change the iterative optimization algorithm used to facilitate the tuning process.

D.

Change the number of compute nodes to be double or more than double the number of evaluations.

Buy Now
Questions 6

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

A.

spark_df.describe()

B.

dbutils.data(spark_df).summarize()

C.

This task cannot be accomplished in a single line of code.

D.

spark_df.summary()

E.

dbutils.data.summarize (spark_df)

Buy Now
Questions 7

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

Options:

A.

Run each notebook interactively

B.

Review the matrix view in the Job's runs

C.

Migrate the Job to a Delta Live Tables pipeline

D.

Change each Task’s setting to use a dedicated cluster

Buy Now
Questions 8

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

Options:

A.

Impute the missing values using each respective feature variable's mean value instead of the median value

B.

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

C.

Remove all feature variables that originally contained missing values from the feature set

D.

Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

E.

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Buy Now
Questions 9

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Spark ML cannot distribute linear regression training

C.

Iterative optimization

D.

Least-squares method

E.

Singular value decomposition

Buy Now
Questions 10

A machine learning engineer has grown tired of needing to install the MLflow Python library on each of their clusters. They ask a senior machine learning engineer how their notebooks can load the MLflow library without installing it each time. The senior machine learning engineer suggests that they use Databricks Runtime for Machine Learning.

Which of the following approaches describes how the machine learning engineer can begin using Databricks Runtime for Machine Learning?

Options:

A.

They can add a line enabling Databricks Runtime ML in their init script when creating their clusters.

B.

They can check the Databricks Runtime ML box when creating their clusters.

C.

They can select a Databricks Runtime ML version from the Databricks Runtime Version dropdown when creating their clusters.

D.

They can set the runtime-version variable in their Spark session to “ml”.

Buy Now
Questions 11

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A)

B)

C)

D)

E)

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Buy Now
Questions 12

A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the pathmodel_urifor the DataFramebatch_df.

batch_dfhas the following schema:

customer_id STRING

The machine learning engineer runs the following code block to perform inference onbatch_dfusing the linear regression model atmodel_uri:

In which situation will the machine learning engineer’s code block perform the desired inference?

Options:

A.

When the Feature Store feature set was logged with the model at model_uri

B.

When all of the features used by the model at model_uri are in a Spark DataFrame in the PySpark

C.

When the model at model_uri only uses customer_id as a feature

D.

This code block will not perform the desired inference in any situation.

E.

When all of the features used by the model at model_uri are in a single Feature Store table

Buy Now
Questions 13

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.

Which of the following approaches will guarantee a reproducible training and test set for each model?

Options:

A.

Manually configure the cluster

B.

Write out the split data sets to persistent storage

C.

Set a speed in the data splitting operation

D.

Manually partition the input data

Buy Now
Questions 14

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:

● Hyperparameter 1: [2, 5, 10]

● Hyperparameter 2: [50, 100]

Which of the following represents the number of machine learning models that can be trained in parallel during this process?

Options:

A.

3

B.

5

C.

6

D.

18

Buy Now
Questions 15

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

A.

One-hot encoding is not supported by most machine learning libraries.

B.

One-hot encoding is dependent on the target variable's values which differ for each application.

C.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

D.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

E.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Buy Now
Questions 16

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically.

Which of the following lines of code will return the metadata description?

Options:

A.

There is no way to return the metadata description programmatically.

B.

fs.create_training_set("new_table")

C.

fs.get_table("new_table").description

D.

fs.get_table("new_table").load_df()

E.

fs.get_table("new_table")

Buy Now
Questions 17

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Options:

A.

Spark ML decision trees test every feature variable in the splitting algorithm

B.

Spark ML decision trees automatically prune overfit trees

C.

Spark ML decision trees test more split candidates in the splitting algorithm

D.

Spark ML decision trees test a random sample of feature variables in the splitting algorithm

E.

Spark ML decision trees test binned features values as representative split candidates

Buy Now
Questions 18

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_pandas()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

Buy Now
Questions 19

Which of the following hyperparameter optimization methods automatically makes informed selections of hyperparameter values based on previous trials for each iterative model evaluation?

Options:

A.

Random Search

B.

Halving Random Search

C.

Tree of Parzen Estimators

D.

Grid Search

Buy Now
Questions 20

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

Options:

A.

13.0

B.

17.0

C.

12.0

D.

39.0

E.

10.0

Buy Now
Questions 21

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.

Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

Options:

A.

A holdout set is not necessary when using a train-validation split

B.

Reproducibility is achievable when using a train-validation split

C.

Fewer hyperparameter values need to be tested when usinga train-validation split

D.

Bias is avoidable when using a train-validation split

E.

Fewer models need to be trained when using a train-validation split

Buy Now
Questions 22

Which of the following statements describes a Spark ML estimator?

Options:

A.

An estimator is a hyperparameter arid that can be used to train a model

B.

An estimator chains multiple alqorithms toqether to specify an ML workflow

C.

An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions

D.

An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer

E.

An estimator is an evaluation tool to assess to the quality of a model

Buy Now
Exam Name: Databricks Certified Machine Learning Associate Exam
Last Update: Dec 4, 2024
Questions: 74
Databricks-Machine-Learning-Associate pdf

Databricks-Machine-Learning-Associate PDF

$29.75  $84.99
Databricks-Machine-Learning-Associate Engine

Databricks-Machine-Learning-Associate Testing Engine

$35  $99.99
Databricks-Machine-Learning-Associate PDF + Engine

Databricks-Machine-Learning-Associate PDF + Testing Engine

$47.25  $134.99