Google Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 3 out of 6 pages

Viewing questions 21-30 out of questions

Questions # 21:

You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?

Options:

Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.

Log debug information in each ParDo function, and analyze the logs at execution time.

Insert output sinks after each key processing step, and observe the writing throughput of each block.

Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks

Expert Solution

Answer

Explanation

When Dataflow fuses multiple transformations into a single stage (step), it can make it harder to pinpoint which specific part of that fused stage is causing a bottleneck because internal metrics for individual ParDos within the fused stage might not be as distinct.

Reshuffle Operation (Option D):Inserting a Reshuffle (or GroupByKey followed by ungrouping, which forces a shuffle) operation between logical processing steps in your Beam pipeline prevents Dataflow from fusing those steps. A shuffle operation acts as a barrier to fusion. This materializes the intermediate PCollection and forces data to be redistributed across workers.

Benefit for Debugging:By breaking the fusion, the Dataflow monitoring UI will display distinct steps for the operations before and after the Reshuffle. This allows you to observe metrics like processing time, throughput, and watermarks for each now-separated step, making it much easier to identify which part of your original fused logic is the bottleneck.

Let's analyze why other options are less effective for this specific problem of afused step:

A (Verify service account permissions):While important for overall pipeline health, permission issues usually result in outright failures or errors in logs, not typically a slowdown within a successfully running (albeit slow) fused step.

B (Insert output sinks):Adding actual output sinks (like writing to Pub/Sub or GCS) after each key step would also break fusion and allow you to measure throughput. However, it's a more heavyweight approach than Reshuffle. It introduces I/O overhead and requires setting up and managing these temporary sinks. Reshuffle is a lighter-weight way to achieve the same goal of breaking fusion for diagnostic purposes within the pipeline itself.

C (Log debug information):Logging can be helpful, but if the entire fused step is slow, logs might not easily distinguish which internal operation is the culprit without very careful and verbose logging. Analyzing potentially massive volumes of logs for performance bottlenecks can be less direct than observing stage metrics in the Dataflow UI once fusion is broken.

Using Reshuffle is a standard technique recommended by Google Cloud for debugging performance issues in fused Dataflow stages.

[Reference:, Google Cloud Documentation: Dataflow > Troubleshooting Dataflow pipelines > Common Dataflow errors and troubleshooting steps > Pipeline is slow or stuck. "Break transform fusion: Certain transforms in your pipeline might be fused together into a single stage for optimization. If a particular fused stage is causing a bottleneck, you can temporarily add Reshuffle transforms between the fused transforms to break them into smaller, separate stages. This allows you to get more visibility into the performance of each individual transform and isolate the bottleneck.", Apache Beam Documentation: Programming Guide > Pipeline I/O > Reshuffle."Reshuffle can be used to prevent fusion, and ensure that data is materialized and redistributed." (While the primary purpose of Reshuffle is often related to data distribution and freshness, a side effect and common use case is to break fusion for monitoring and debugging)., , , , ]

Questions # 22:

You migrated your on-premises Apache Hadoop Distributed File System (HDFS) data lake to Cloud Storage. The data scientist team needs to process the data by using Apache Spark and SQL. Security policies need to be enforced at the column level. You need a cost-effective solution that can scale into a data mesh. What should you do?

Options:

1. Load the data to BigQuery tables.2. Create a taxonomy of policy tags in Data Catalog.3. Add policy tags to columns.4. Process with the Spark-BigQuery connector or BigQuery SQL.

1. Deploy a long-living Dataproc cluster with Apache Hive and Ranger enabled.2. Configure Ranger for column level security.3. Process with Dataproc Spark or Hive SQL.

1. Apply an Identity and Access Management (IAM) policy at the file level in Cloud Storage.2. Define a BigQuery external table for SQL processing.3. Use Dataproc Spark to process the Cloud Storage files.

1. Define a BigLake table.2. Create a taxonomy of policy tags in Data Catalog.3. Add policy tags to columns.4. Process with the Spark-BigQuery connector or BigQuery SQL.

Expert Solution

Answer

Explanation

The key requirements are:

Data on Cloud Storage (migrated from HDFS).

Processing with Spark and SQL.

Column-level security.

Cost-effective and scalable for a data mesh.

Let's analyze the options:

Option A (Load to BigQuery tables, policy tags, Spark-BQ connector/BQ SQL):

Pros: BigQuery native tables offer excellent performance. Policy tags provide robust column-level security managed centrally in Data Catalog. The Spark-BigQuery connector allows Spark to read from/write to BigQuery. BigQuery SQL is powerful. Scales well.

Cons: "Loading" the data into BigQuery means moving it from Cloud Storage into BigQuery's managed storage. This incurs storage costs in BigQuery and an ETL step. While effective, it might not be the most "cost-effective" if the goal is to query data in place on Cloud Storage, especially for very large datasets.

Option B (Long-living Dataproc, Hive, Ranger):

Pros: Provides a Hadoop-like environment with Spark, Hive, and Ranger for column-level security.

Cons: "Long-living Dataproc cluster" is generally not the most cost-effective, as you pay for the cluster even when idle. Managing Hive and Ranger adds operational overhead. While scalable, it requires more infrastructure management than serverless options.

Option C (IAM at file level, BQ external table, Dataproc Spark):

Pros: Using Cloud Storage is cost-effective for storage. BigQuery external tables allow SQL access.

Cons: IAM at the file level in Cloud Storage does not provide column-level security. This option fails to meet a critical requirement.

Option D (Define a BigLake table, policy tags, Spark-BQ connector/BQ SQL):

Pros:BigLake Tables: These tables allow you to query data in open formats (like Parquet, ORC) on Cloud Storage as if it were a native BigQuery table, but without ingesting the data into BigQuery's managed storage. This is highly cost-effective for storage.

Column-Level Security with Policy Tags: BigLake tables integrate with Data Catalog policy tags to enforce fine-grained column-level security on the data residing in Cloud Storage. This is a centralized and robust security model.

Spark and SQL Access: Data scientists can use BigQuery SQL directly on BigLake tables. The Spark-BigQuery connector can also be used to access BigLake tables, enabling Spark processing.

Cost-Effective & Scalable Data Mesh: This approach leverages the cost-effectiveness of Cloud Storage, the serverless querying power and security features of BigQuery/Data Catalog, and provides a clear path to building a data mesh by allowing different domains to manage their data in Cloud Storage while exposing it securely through BigLake.

Cons: Performance for BigLake tables might be slightly different than BigQuery native storage for some workloads, but it's designed for high performance on open formats.

Why D is superior for this scenario:

BigLake tables (Option D) directly address the need to keep data in Cloud Storage (cost-effective for a data lake) while providing strong, centrally managed column-level security via policy tags and enabling both SQL (BigQuery) and Spark (via Spark-BigQuery connector) access. This is more aligned with modern data lakehouse and data mesh architectures than loading everything into native BigQuery storage (Option A) if the data is already in open formats on Cloud Storage, or managing a full Hadoop stack on Dataproc (Option B).

[Reference:, , Google Cloud Documentation: BigLake > Overview. "BigLake lets you unify your data warehouses and data lakes. BigLake tables provide fine-grained access control for tables based on data in Cloud Storage, while preserving access through other Google Cloud services like BigQuery, GoogleSQL, Spark, Trino, and TensorFlow.", Google Cloud Documentation: BigLake > Introduction to BigLake tables. "BigLake tables bring BigQuery features to your data in Cloud Storage. You can query external data with fine-grained security (including row-level and column-level security) without needing to move or duplicate data.", Google Cloud Documentation: Data Catalog > Overview of policy tags. "You can use policy tags to enforce column-level access control for BigQuery tables, including BigLake tables.", Google Cloud Blog: "Announcing BigLake – Unifying data lakes and warehouses" (and similar articles) highlight how BigLake enables querying data in place on Cloud Storage with BigQuery's governance features., , , , , ]

Questions # 23:

You need to create a near real-time inventory dashboard that reads the main inventory tables in your BigQuery data warehouse. Historical inventory data is stored as inventory balances by item and location. You have several thousand updates to inventory every hour. You want to maximize performance of the dashboard and ensure that the data is accurate. What should you do?

Options:

Leverage BigQuery UPDATE statements to update the inventory balances as they are changing.

Partition the inventory balance table by item to reduce the amount of data scanned with each inventory update.

Use the BigQuery streaming the stream changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.

Use the BigQuery bulk loader to batch load inventory changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.

Expert Solution

Questions # 24:

You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure. What should you do?

Options:

Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.

Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.

Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.

Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.

Expert Solution

Questions # 25:

Your team is building a data lake platform on Google Cloud. As a part of the data foundation design, you are planning to store all the raw data in Cloud Storage You are expecting to ingest approximately 25 GB of data a day and your billing department is worried about the increasing cost of storing old data. The current business requirements are:

• The old data can be deleted anytime

• You plan to use the visualization layer for current and historical reporting

• The old data should be available instantly when accessed

• There should not be any charges for data retrieval.

What should you do to optimize for cost?

Options:

Create the bucket with the Autoclass storage class feature.

Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearline, 90 days to coldline. and 365 days to archive storage class. Delete old data as needed.

Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to coldline, 90 days to nearline. and 365 days to archive storage class Delete old data as needed.

Create an Object Lifecycle Management policy to modify the storage class for data older than 30 days to nearlme. 45 days to coldline. and 60 days to archive storage class Delete old data as needed.

Expert Solution

Questions # 26:

You are implementing a chatbot to help an online retailer streamline their customer service. The chatbot must be able to respond to both text and voice inquiries. You are looking for a low-code or no-code option, and you want to be able to easily train the chatbot to provide answers to keywords. What should you do?

Options:

Use the Speech-to-Text API to build a Python application in App Engine.

Use the Speech-to-Text API to build a Python application in a Compute Engine instance.

Use Dialogflow for simple queries and the Speech-to-Text API for complex queries.

Use Dialogflow to implement the chatbot. defining the intents based on the most common queries collected.

Expert Solution

Questions # 27:

You want to optimize your queries for cost and performance. How should you structure your data?

Options:

Partition table data by create_date, location_id and device_version

Partition table data by create_date cluster table data by location_Id and device_version

Cluster table data by create_date location_id and device_version

Cluster table data by create_date partition by locationed and device_version

Expert Solution

Questions # 28:

You have a data analyst team member who needs to analyze data by using BigQuery. The data analyst wants to create a data pipeline that would load 200 CSV files with an average size of 15MB from a Cloud Storage bucket into BigQuery daily. The data needs to be ingested and transformed before being accessed in BigQuery for analysis. You need to recommend a fully managed, no-code solution for the data analyst. What should you do?

Options:

Create a Cloud Run function and schedule it to run daily using Cloud Scheduler to load the data into BigQuery.

Use the BigQuery Data Transfer Service to load files from Cloud Storage to BigQuery, create a BigQuery job which transforms the data using BigQuery SQL and schedule it to run daily.

Build a custom Apache Beam pipeline and run it on Dataflow to load the file from Cloud Storage to BigQuery and schedule it to run daily using Cloud Composer.

Create a pipeline by using BigQuery pipelines and schedule it to load the data into BigQuery daily.

Expert Solution

Answer

Explanation

The requirements are for a daily scheduled load, ingest, and transformation, and specifically a fully managed, no-code solution.

Ingest (Load): The BigQuery Data Transfer Service (DTS) is the fully managed, serverless, and no-code solution for batch loading files (including CSV from Cloud Storage) into BigQuery on a schedule. This is the "ingest" part.

Transform: After loading the raw data into a staging table using DTS, the transformation can be done using BigQuery SQL. This transformation query can then be automated using a Scheduled Query in BigQuery, which is also a fully managed and no-code feature that runs on a schedule.

Fully Managed & No-Code: Both DTS for Cloud Storage and Scheduled Queries are native BigQuery features that are fully managed and configured through the console without requiring code, directly meeting the constraints.

Correcting other options:

A (Cloud Run + Script): Cloud Run requires writing a custom Python script, which violates the no-code requirement.

C (Dataflow + Apache Beam + Cloud Composer): This is a powerful, highly scalable ETL solution, but it requires writing custom code (Apache Beam) and requires setting up and managing a workflow orchestrator (Cloud Composer/Airflow), which violates both the fully managed (Dataflow is serverless, but the code/pipeline itself is custom and needs maintenance) and no-code requirements.

D (BigQuery pipelines): "BigQuery pipelines" is not a distinct, official product name in the Google Cloud documentation that fulfills a no-code scheduled ETL. The closest product is the combination of DTS and Scheduled Queries, as described in option B.

[Reference: Google Cloud Documentation on BigQuery Data Transfer Service and Scheduled Queries:, "The BigQuery Data Transfer Service automates data movement into BigQuery on a scheduled, managed basis... The BigQuery Data Transfer Service supports loading data from Cloud Storage in one of the following formats: Comma-separated values (CSV)..." (Source: What is BigQuery Data Transfer Service? and Introduction to Cloud Storage transfers), "A scheduled query is a query that BigQuery automatically runs at regular intervals. When you configure a scheduled query, you specify the GoogleSQL SELECT statement to run, the destination table for the query results, and the frequency of the query." (Source: Scheduling queries), This combination delivers a fully managed, no-code ELT (Extract-Load-Transform) pipeline., ]

Questions # 29:

Each analytics team in your organization is running BigQuery jobs in their own projects. You want to enable each team to monitor slot usage within their projects. What should you do?

Options:

Create a Stackdriver Monitoring dashboard based on the BigQuery metric query/scanned_bytes

Create a Stackdriver Monitoring dashboard based on the BigQuery metric slots/allocated_for_project

Create a log export for each project, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric

Create an aggregated log export at the organization level, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric

Expert Solution

Questions # 30:

MJTelco’s Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

Options:

The zone

The number of workers

The disk size per worker

The maximum number of workers

Expert Solution

Viewing page 3 out of 6 pages

Viewing questions 21-30 out of questions

Pre-Winter Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Pass the Google Google Cloud Certified Professional-Data-Engineer Questions and answers with CertsForce