Databricks Databricks-Certified-Data-Engineer-Associate Exam Questions Free Practice Test

Viewing page 2 out of 4 pages

Viewing questions 11-20 out of questions

Questions # 11:

A data engineer wants to schedule their Databricks SQL dashboard to refresh every hour, but they only want the associated SQL endpoint to be running when it is necessary. The dashboard has multiple queries on multiple datasets associated with it. The data that feeds the dashboard is automatically processed using a Databricks Job.

Which of the following approaches can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

Options:

They can turn on the Auto Stop feature for the SQL endpoint.

They can ensure the dashboard's SQL endpoint is not one of the included query's SQL endpoint.

They can reduce the cluster size of the SQL endpoint.

They can ensure the dashboard's SQL endpoint matches each of the queries' SQL endpoints.

They can set up the dashboard's SQL endpoint to be serverless.

Expert Solution

Questions # 12:

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name.

They have the following incomplete code block:

____(f"SELECT customer_id, spend FROM {table_name}")

Which of the following can be used to fill in the blank to successfully complete the task?

Options:

spark.delta.sql

spark.delta.table

spark.table

dbutils.sql

spark.sql

Expert Solution

Questions # 13:

Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?

Options:

CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.

CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.

CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used.

CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.

CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.

Expert Solution

Questions # 14:

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values.

Which of the following describes why Auto Loader inferred all of the columns to be of the string type?

Options:

There was a type mismatch between the specific schema and the inferred schema

JSON data is a text-based format

Auto Loader only works with string data

All of the fields had at least one null value

Auto Loader cannot infer the schema of ingested data

Expert Solution

Questions # 15:

Which SQL keyword can be used to convert a table from a long format to a wide format?

Options:

TRANSFORM

PIVOT

SUM

CONVERT

Expert Solution

Questions # 16:

Which of the following must be specified when creating a new Delta Live Tables pipeline?

Options:

A key-value pair configuration

The preferred DBU/hour cost

A path to cloud storage location for the written data

A location of a target database for the written data

At least one notebook library to be executed

Expert Solution

Questions # 17:

Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?

Options:

Parquet files can be partitioned

CREATE TABLE AS SELECT statements cannot be used on files

Parquet files have a well-defined schema

Parquet files have the ability to be optimized

Parquet files will become Delta tables

Expert Solution

Questions # 18:

A data analyst has a series of queries in a SQL program. The data analyst wants this program to run every day. They only want the final query in the program to run on Sundays. They ask for help from the data engineering team to complete this task.

Which of the following approaches could be used by the data engineering team to complete this task?

Options:

They could submit a feature request with Databricks to add this functionality.

They could wrap the queries using PySpark and use Python’s control flow system to determine when to run the final query.

They could only run the entire program on Sundays.

They could automatically restrict access to the source table in the final query so that it is only accessible on Sundays.

They could redesign the data model to separate the data used in the final query into a new table.

Expert Solution

Questions # 19:

In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?

Options:

When another task needs to be replaced by the new task

When another task needs to fail before the new task begins

When another task has the same dependency libraries as the new task

When another task needs to use as little compute resources as possible

When another task needs to successfully complete before the new task begins

Expert Solution

Answer

Explanation

A data engineer can create a multi-task job in Databricks that consists of multiple tasks that run in a specific order. Each task can have one or more dependencies, which are other tasks that must run before the current task. The Depends On field of a new Databricks Job Task allows the data engineer to specify the dependencies of the task. The data engineer should select a task in the Depends On field when they want the new task to run only after the selected task has successfully completed. This can help the data engineer to create a logical sequence of tasks that depend on each other’s outputs or results. For example, a data engineer can create a multi-task job that consists of the following tasks:

Task A: Ingest data from a source using Auto Loader

Task B: Transform the data using Spark SQL

Task C: Write the data to a Delta Lake table

Task D: Analyze the data using Spark ML

Task E: Visualize the data using Databricks SQL

In this case, the data engineer can set the dependencies of each task as follows:

Task A: No dependencies

Task B: Depends on Task A

Task C: Depends on Task B

Task D: Depends on Task C

Task E: Depends on Task D

This way, the data engineer can ensure that each task runs only after the previous task has successfully completed, and the data flows smoothly from ingestion to visualization.

The other options are incorrect because they do not describe valid scenarios for selecting a task in the Depends On field. The Depends On field does not affect the following aspects of a task:

Whether the task needs to be replaced by another task

Whether the task needs to fail before another task begins

Whether the task has the same dependency libraries as another task

Whether the task needs to use as little compute resources as possible References: Create a multi-task job, Run tasks conditionally in a Databricks job, Databricks Jobs.

Questions # 20:

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables.

Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?

Options: