Databricks Databricks-Certified-Professional-Data-Scientist Exam Questions Free Practice Test

Viewing page 2 out of 5 pages

Viewing questions 11-20 out of questions

Questions # 11:

What type of output generated in case of linear regression?

Options:

Continuous variable

Discrete Variable

Any of the Continuous and Discrete variable

Values between 0 and 1

Expert Solution

Questions # 12:

You are building a classifier off of a very high-dimensiona data set similar to shown in the image with 5000 variables (lots of columns, not that many rows). It can handle both dense and sparse input. Which technique is most suitable, and why?

Question # 12

Options:

Logistic regression with L1 regularization, to prevent overfitting

Naive Bayes, because Bayesian methods act as regularlizers

k-nearest neighbors, because it uses local neighborhoods to classify examples

Random forest because it is an ensemble method

Expert Solution

Answer

Explanation

Logistic regression is widely used in machine learning for classification problems. It is well-known that regularization is required to avoid over-fitting, especially when there is a only small number of training examples, or when there are a large number of parameters to be learned. In particular L1 regularized logistic regression is often used for feature selection, and has been shown to have good generalization performance in the presence of many irrelevant features. (Ng 2004; Goodman 2004) Unregularized logistic regression is an unconstrained convex optimization problem with a continuously differentiate objective function. As a consequence, it can be solved fairly efficiently with standard convex optimization methods, such as Newton's method or conjugate gradient. However, adding the L1 regularization makes the optimization

problem com-putationally more expensive to solve. If the L1 regulariza-tion is enforced by an L1 norm constraint on the parameLogistic regression is a classifier and L1 regularization tends to produce models that ignore dimensions of the input that are not predictive. This is particularly useful when the input contains many dimensions, k-nearest neighbors classification is also a classification technique, but relies on notions of distance. In a high-dimensional space, most every data point is "far" from others (the curse of dimensionality) and so these techniques break down. Naive Bayes is not inherently regularizing. Random forests represent an ensemble method; but an ensemble method is not necessarily more suitable to high-dimensional data. Practically, I think the biggest reasons for regularization are 1) to avoid overfitting by not generating high coefficients for predictors that are sparse. 2) to stabilize the estimates especially when there's collinearity in the data.

1) is inherent in the regularization framework. Since there are two forces pulling each other in the objective function, if there's no meaningful loss reduction, the increased penalty from the regularization term wouldn't improve the overall objective function. This is a great property since a lot of noise would be automatically filtered out from the model. To give you an example for 2), if you have two predictors that have same values, if you just run a regression algorithm on it since the data matrix is singular your beta coefficients will be Inf if you try to do a straight matrix inversion. But if you add a very small regularization lambda to it, you will get stable beta coefficients with the coefficient values evenly divided between the equivalent two variables. For the difference between L1 and L2, the following graph demonstrates why people bother to have L1 since L2 has such an elegant analytical solution and is so computationally straightforward. Regularized regression can also be represented as a constrained regression problem (since they are Lagrangian equivalent). The implication of this is that the L1 regularization gives you sparse estimates. Namely, in a high dimensional space, you got mostly zeros and a small number of non-zero coefficients. This is huge since it incorporates variable selection to the modeling problem. In addition, if you have to score a large sample with your model, you can have a lot of computational savings since you don't have to compute features(predictors) whose coefficient is 0. I personally think L1 regularization is one of the most beautiful things in machine learning and convex optimization. It is indeed widely used in bioinformatics and large scale machine learning for companies like Facebook, Yahoo, Google and Microsoft.

Questions # 13:

In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters and the normalizing constant usually ignored in MLEs because

Options:

The normalizing constant is always very close to 1

The normalizing constant only has a small impact on the maximum likelihood

The normalizing constant is often zero and can cause division by zero

The normalizing constant doesn't impact the maximizing value

Expert Solution

Questions # 14:

You are creating a Classification process where input is the income, education and current debt of a customer, what could be the possible output of this process.

Options:

Probability of the customer default on loan repayment

Percentage of the customer loan repayment capability

Percentage of the customer should be given loan or not

The output might be a risk class, such as "good", "acceptable", "average", or "unacceptable".

Expert Solution

Questions # 15:

Which of the following technique can be used to the design of recommender systems?

Options:

Naive Bayes classifier

Power iteration

Collaborative filtering

1 and 3

2 and 3

Expert Solution

Questions # 16:

Select the correct statement which applies to logistic regression

Options:

Computationally inexpensive, easy to implement knowledge representation easy to interpret

May have low accuracy

Works with Numeric values

Only 1 and 3 are correct

All 1, 2 and 3 are correct

Expert Solution

Questions # 17:

You are working on a problem where you have to predict whether the claim is done valid or not. And you find that most of the claims which are having spelling errors as well as corrections in the manually filled claim forms compare to the honest claims. Which of the following technique is suitable to find out whether the claim is valid or not?

Options:

Naive Bayes

Logistic Regression

Random Decision Forests

Any one of the above

Expert Solution

Questions # 18:

Reducing the data from many features to a small number so that we can properly visualize it in

two or three dimensions. It is done in_______

Options:

supervised learning

un-supervised learning

k-Nearest Neighbors

Support vector machines

Expert Solution

Questions # 19:

Consider the following confusion matrix for a data set with 600 out of 11,100 instances positive:

In this case, Precision = 50%, Recall = 83%, Specificity = 95%, and Accuracy = 95%.

Select the correct statement

Question # 19

Options:

Precision is low, which means the classifier is predicting positives best

Precision is low, which means the classifier is predicting positives poorly

problem domain has a major impact on the measures that should be used to evaluate a classifier within it

1 and 3

2 and 3

Expert Solution

Questions # 20:

Select the correct problems which can be solved using SVMs

Options:

SVMs are helpful in text and hypertext categorization

Classification of images can also be performed using SVMs

SVMs are also useful in medical science to classify proteins with up to 90% of the compounds classified correctly

Hand-written characters can be recognized using SVM

Expert Solution

Viewing page 2 out of 5 pages

Viewing questions 11-20 out of questions

Pass the Databricks Databricks Certification Databricks-Certified-Professional-Data-Scientist Questions and answers with CertsForce