Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with CertsForce

Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
Questions # 1:

Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?

Options:

A.

F1


B.

R-squared


C.

MAE


D.

MSE


Expert Solution
Questions # 2:

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.

Which of the following code blocks will accomplish this task?

Options:

A.

spark_df.loc[:,spark_df["discount"] <= 0]


B.

spark_df[spark_df["discount"] <= 0]


C.

spark_df.filter (col("discount") <= 0)


D.

spark_df.loc(spark_df["discount"] <= 0, :]


Expert Solution
Questions # 3:

A machine learning engineer is trying to scale a machine learning pipelinepipelinethat contains multiple feature engineering stages and a modeling stage. As part of the cross-validation process, they are using the following code block:

Question # 3

A colleague suggests that the code block can be changed to speed up the tuning process by passing the model object to theestimatorparameter and then placing the updated cv object as the final stage of thepipelinein place of the original model.

Which of the following is a negative consequence of the approach suggested by the colleague?

Options:

A.

The model will take longerto train for each unique combination of hvperparameter values


B.

The feature engineering stages will be computed using validation data


C.

The cross-validation process will no longer be


D.

The cross-validation process will no longer be reproducible


E.

The model will be refit one more per cross-validation fold


Expert Solution
Questions # 4:

A data scientist is using the following code block to tune hyperparameters for a machine learning model:

Question # 4

Which change can they make the above code block to improve the likelihood of a more accurate model?

Options:

A.

Increase num_evals to 100


B.

Change fmin() to fmax()


C.

Change sparkTrials() to Trials()


D.

Change tpe.suggest to random.suggest


Expert Solution
Questions # 5:

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective functionobjective_functionand they have defined the search spacesearch_space.

As a result, they have the following code block:

Question # 5

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Options:

A.

Change SparkTrials() to Trials()


B.

Reduce num_evals to be less than 10


C.

Change fmin() to fmax()


D.

Remove the trials=trials argument


E.

Remove the algo=tpe.suggest argument


Expert Solution
Questions # 6:

A new data scientist has started working on an existing machine learning project. The project is a scheduled Job that retrains every day. The project currently exists in a Repo in Databricks. The data scientist has been tasked with improving the feature engineering of the pipeline’s preprocessing stage. The data scientist wants to make necessary updates to the code that can be easily adopted into the project without changing what is being run each day.

Which approach should the data scientist take to complete this task?

Options:

A.

They can create a new branch in Databricks, commit their changes, and push those changes to the Git provider.


B.

They can clone the notebooks in the repository into a Databricks Workspace folder and make the necessary changes.


C.

They can create a new Git repository, import it into Databricks, and copy and paste the existing code from the original repository before making changes.


D.

They can clone the notebooks in the repository into a new Databricks Repo and make the necessary changes.


Expert Solution
Questions # 7:

A data scientist has produced two models for a single machine learning problem. One of the models performs well when one of the features has a value of less than 5, and the other model performs well when the value of that feature is greater than or equal to 5. The data scientist decides to combine the two models into a single machine learning solution.

Which of the following terms is used to describe this combination of models?

Options:

A.

Bootstrap aggregation


B.

Support vector machines


C.

Bucketing


D.

Ensemble learning


E.

Stacking


Expert Solution
Questions # 8:

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical.

Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Options:

A.

Spark ML decision trees test every feature variable in the splitting algorithm


B.

Spark ML decision trees automatically prune overfit trees


C.

Spark ML decision trees test more split candidates in the splitting algorithm


D.

Spark ML decision trees test a random sample of feature variables in the splitting algorithm


E.

Spark ML decision trees test binned features values as representative split candidates


Expert Solution
Questions # 9:

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression


B.

Spark ML cannot distribute linear regression training


C.

Iterative optimization


D.

Least-squares method


E.

Singular value decomposition


Expert Solution
Questions # 10:

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.

Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

Options:

A.

Change the number of compute nodes to be half or less than half of the number of evaluations.


B.

Change the number of compute nodes and the number of evaluations to be much larger but equal.


C.

Change the iterative optimization algorithm used to facilitate the tuning process.


D.

Change the number of compute nodes to be double or more than double the number of evaluations.


Expert Solution
Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions