Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with CertsForce

Viewing page 2 out of 3 pages
Viewing questions 11-20 out of questions
Questions # 11:

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library'sfminoperation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with theobjective_functionbeing passed as an argument tofmin.

They use the following code block to create theobjective_function:

Question # 11

Which of the following changes does the data scientist need to make to theirobjective_functionin order to produce a more accurate model?

Options:

A.

Add test set validation process


B.

Add a random_state argument to the RandomForestRegressor operation


C.

Remove the mean operation that is wrapping the cross_val_score operation


D.

Replace the r2 return value with -r2


E.

Replace the fmin operation with the fmax operation


Expert Solution
Questions # 12:

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Question # 12

Assuming the default Spark configuration is in place, which of the following is a benefit of using anIterator?

Options:

A.

The data will be limited to a single executor preventing the model from being loaded multiple times


B.

The model will be limited to a single executor preventing the data from being distributed


C.

The model only needs to be loaded once per executor rather than once per batch during the inference process


D.

The data will be distributed across multiple executors during the inference process


Expert Solution
Questions # 13:

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

Options:

A.

When the new solution requires if-else logic determining which model to use to compute each prediction


B.

When the new solution's models have an average latency that is larger than the size of the original model


C.

When the new solution requires the use of fewer feature variables than the original model


D.

When the new solution requires that each model computes a prediction for every record


E.

When the new solution's models have an average size that is larger than the size of the original model


Expert Solution
Questions # 14:

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

A.

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata


B.

pandas API on Spark DataFrames are more performant than Spark DataFrames


C.

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata


D.

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames


E.

pandas API on Spark DataFrames are unrelated to Spark DataFrames


Expert Solution
Questions # 15:

A machine learning engineer wants to parallelize the inference of group-specific models using the Pandas Function API. They have developed theapply_modelfunction that will look up and load the correct model for each group, and they want to apply it to each group of DataFramedf.

They have written the following incomplete code block:

Question # 15

Which piece of code can be used to fill in the above blank to complete the task?

Options:

A.

applyInPandas


B.

groupedApplyInPandas


C.

mapInPandas


D.

predict


Expert Solution
Questions # 16:

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

Question # 16

The machine learning engineer shares the following code block:

Question # 16

Which of the following changes does the machine learning engineer need to make to complete the task?

Options:

A.

They need to call the transform method on train df


B.

They need to convert the features column to be a vector


C.

They do not need to make any changes


D.

They need to utilize a Pipeline to fit the model


E.

They need to split thefeaturescolumn out into one column for each feature


Expert Solution
Questions # 17:

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.

Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

Options:

A.

They can refactor their notebook to process the data in parallel.


B.

They can refactor their notebook to use the PySpark DataFrame API.


C.

They can refactor their notebook to use the Scala Dataset API.


D.

They can refactor their notebook to use Spark SQL.


E.

They can refactor their notebook to utilize the pandas API on Spark.


Expert Solution
Questions # 18:

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

Options:

A.

When the features are of the categorical type


B.

When the features are of the boolean type


C.

When the features contain a lot of extreme outliers


D.

When the features contain no outliers


E.

When the features contain no missingno values


Expert Solution
Questions # 19:

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

Options:

A.

13.0


B.

17.0


C.

12.0


D.

39.0


E.

10.0


Expert Solution
Questions # 20:

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression


B.

Singular value decomposition


C.

Iterative optimization


D.

Least-squares method


Expert Solution
Viewing page 2 out of 3 pages
Viewing questions 11-20 out of questions