Databricks Databricks-Machine-Learning-Associate Exam Questions Free Practice Test

Viewing page 2 out of 3 pages

Viewing questions 11-20 out of questions

Questions # 11:

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library'sfminoperation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with theobjective_functionbeing passed as an argument tofmin.

They use the following code block to create theobjective_function:

Question # 11

Which of the following changes does the data scientist need to make to theirobjective_functionin order to produce a more accurate model?

Options:

Add test set validation process

Add a random_state argument to the RandomForestRegressor operation

Remove the mean operation that is wrapping the cross_val_score operation

Replace the r2 return value with -r2

Replace the fmin operation with the fmax operation

Expert Solution

Questions # 12:

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Question # 12

Assuming the default Spark configuration is in place, which of the following is a benefit of using anIterator?

Options:

The data will be limited to a single executor preventing the model from being loaded multiple times

The model will be limited to a single executor preventing the data from being distributed

The model only needs to be loaded once per executor rather than once per batch during the inference process

The data will be distributed across multiple executors during the inference process

Expert Solution

Questions # 13:

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

Options:

When the new solution requires if-else logic determining which model to use to compute each prediction

When the new solution's models have an average latency that is larger than the size of the original model

When the new solution requires the use of fewer feature variables than the original model

When the new solution requires that each model computes a prediction for every record

When the new solution's models have an average size that is larger than the size of the original model

Expert Solution

Questions # 14:

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

pandas API on Spark DataFrames are more performant than Spark DataFrames

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

pandas API on Spark DataFrames are unrelated to Spark DataFrames

Expert Solution

Questions # 15:

A machine learning engineer wants to parallelize the inference of group-specific models using the Pandas Function API. They have developed theapply_modelfunction that will look up and load the correct model for each group, and they want to apply it to each group of DataFramedf.

They have written the following incomplete code block:

Question # 15

Which piece of code can be used to fill in the above blank to complete the task?

Options:

applyInPandas

groupedApplyInPandas

mapInPandas

predict

Expert Solution

Questions # 16:

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

Question # 16

The machine learning engineer shares the following code block:

Question # 16

Which of the following changes does the machine learning engineer need to make to complete the task?

Options:

They need to call the transform method on train df

They need to convert the features column to be a vector

They do not need to make any changes

They need to utilize a Pipeline to fit the model

They need to split thefeaturescolumn out into one column for each feature

Expert Solution

Questions # 17:

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.

Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

Options:

They can refactor their notebook to process the data in parallel.

They can refactor their notebook to use the PySpark DataFrame API.

They can refactor their notebook to use the Scala Dataset API.

They can refactor their notebook to use Spark SQL.

They can refactor their notebook to utilize the pandas API on Spark.

Expert Solution

Questions # 18:

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

Options:

When the features are of the categorical type

When the features are of the boolean type

When the features contain a lot of extreme outliers

When the features contain no outliers

When the features contain no missingno values

Expert Solution

Questions # 19:

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

Options:

13.0

17.0

12.0

39.0

10.0

Expert Solution

Questions # 20:

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

Logistic regression

Singular value decomposition

Iterative optimization

Least-squares method

Expert Solution

Viewing page 2 out of 3 pages

Viewing questions 11-20 out of questions

Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with CertsForce