Google Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 3 out of 7 pages

Viewing questions 21-30 out of questions

Questions # 21:

You are developing an Apache Beam pipeline to extract data from a Cloud SQL instance by using JdbclO. You have two projects running in Google Cloud. The pipeline will be deployed and executed on Dataflow in Project A. The Cloud SQL instance is running jn Project B and does not have a public IP address. After deploying the pipeline, you noticed that the pipeline failed to extract data from the Cloud SQL instance due to connection failure. You verified that VPC Service Controls and shared VPC are not in use in these projects. You want to resolve this error while ensuring that the data does not go through the public internet. What should you do?

Options:

Set up VPC Network Peering between Project A and Project B. Add a firewall rule to allow the peered subnet range to access all instances on the network.

Turn off the external IP addresses on the Dataflow worker. Enable Cloud NAT in Project A.

Set up VPC Network Peering between Project A and Project B. Create a Compute Engine instance without external IP address in Project B on the peered subnet to serve as a proxy server to the Cloud SQL database.

Add the external IP addresses of the Dataflow worker as authorized networks in the Cloud SOL instance.

Expert Solution

Questions # 22:

You have a data pipeline with a Dataflow job that aggregates and writes time series metrics to Bigtable. You notice that data is slow to update in Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. What should you do?

Choose 2 answers

Options:

Configure your Dataflow pipeline to use local execution.

Modify your Dataflow pipeline lo use the Flatten transform before writing to Bigtable.

Modify your Dataflow pipeline to use the CoGrcupByKey transform before writing to Bigtable.

Increase the maximum number of Dataflow workers by setting maxNumWorkers in PipelineOptions.

Increase the number of nodes in the Bigtable cluster.

Expert Solution

Questions # 23:

Your startup has a web application that currently serves customers out of a single region in Asia. You are targeting funding that will allow your startup lo serve customers globally. Your current goal is to optimize for cost, and your post-funding goat is to optimize for global presence and performance. You must use a native JDBC driver. What should you do?

Options:

Use Cloud Spanner to configure a single region instance initially. and then configure multi-region C oud Spanner instances after securing funding.

Use a Cloud SQL for PostgreSQL highly available instance first, and 8»gtable with US. Europe, and Asiareplication alter securing funding

Use a Cloud SQL for PostgreSQL zonal instance first and Bigtable with US. Europe, and Asia after securing funding.

Use a Cloud SOL for PostgreSQL zonal instance first, and Cloud SOL for PostgreSQL with highly available configuration after securing funding.

Expert Solution

Questions # 24:

You are migrating your on-premises data warehouse to BigQuery. As part of the migration, you want to facilitate cross-team collaboration to get the most value out of the organization's data. You need to design an architecture that would allow teams within the organization to securely publish, discover, and subscribe to read-only data in a self-service manner. You need to minimize costs while also maximizing data freshness What should you do?

Options:

Create authorized datasets to publish shared data in the subscribing team's project.

Create a new dataset for sharing in each individual team's project. Grant the subscribing team the bigquery. dataViewer role on thedataset.

Use BigQuery Data Transfer Service to copy datasets to a centralized BigQuery project for sharing.

Use Analytics Hub to facilitate data sharing.

Expert Solution

Answer

Explanation

To provide a cost-effective storage and processing solution that allows data scientists to explore data similarly to using the on-premises HDFS cluster with SQL on the Hive query engine, deploying a Dataproc cluster is the best choice. Here’s why:

Compatibility with Hive:

Dataprocis a fully managed Apache Spark and Hadoop service that provides native support for Hive, making it easy for data scientists to run SQL queries on the data as they would in an on-premises Hadoop environment.

This ensures that the transition to Google Cloud is smooth, with minimal changes required in the workflow.

Cost-Effective Storage:

Storing the ORC files inCloud Storageis cost-effective and scalable, providing a reliable and durable storage solution that integrates seamlessly with Dataproc.

Cloud Storage allows you to store large datasets at a lower cost compared to other storage options.

Hive Integration:

Dataproc supports running Hive directly, which is essential for data scientists familiar with SQL on the Hive query engine.

This setup enables the use of existing Hive queries and scripts without significant modifications.

Steps to Implement:

Copy ORC Files to Cloud Storage:

Transfer the ORC files from the on-premises HDFS cluster to Cloud Storage, ensuring they are organized in a similar directory structure.

Deploy Dataproc Cluster:

Set up a Dataproc cluster configured to run Hive. Ensure that the cluster has access to the ORC files stored in Cloud Storage.

Configure Hive:

Configure Hive on Dataproc to read from the ORC files in Cloud Storage. This can be done by setting up external tables in Hive that point to the Cloud Storage location.

Provide Access to Data Scientists:

Grant the data scientist team access to the Dataproc cluster and the necessary permissions to interact with the Hive tables.

Reference Links:

Dataproc Documentation

Hive on Dataproc

Google Cloud Storage Documentation

Questions # 25:

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of datA. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?

Options:

Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user.

In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used toprovide the auditability.

In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin activity logs will be used to provide the auditability.

In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket.

Expert Solution

Questions # 26:

Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error.

SELECT person FROM `project1.example.table1` WHERE city = "London"

How would you correct the error?

Options:

Add ", UNNEST(person)" before the WHERE clause.

Change "person" to "person.city".

Change "person" to "city.person".

Add ", UNNEST(city)" before the WHERE clause.

Expert Solution

Questions # 27:

Which methods can be used to reduce the number of rows processed by BigQuery?

Options:

Splitting tables into multiple tables; putting data in partitions

Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause

Putting data in partitions; using the LIMIT clause

Splitting tables into multiple tables; using the LIMIT clause

Expert Solution

Questions # 28:

To give a user read permission for only the first three columns of a table, which access control method would you use?

Options:

Primitive role

Predefined role

Authorized view

It's not possible to give access to only the first three columns of a table.

Expert Solution

Questions # 29:

The Dataflow SDKs have been recently transitioned into which Apache service?

Options:

Apache Spark

Apache Hadoop

Apache Kafka

Apache Beam

Expert Solution

Questions # 30:

Which of the following is NOT true about Dataflow pipelines?

Options:

Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner

Dataflow pipelines can consume data from other Google Cloud services

Dataflow pipelines can be programmed in Java

Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources

Expert Solution

Viewing page 3 out of 7 pages

Viewing questions 21-30 out of questions

Pass the Google Google Cloud Certified Professional-Data-Engineer Questions and answers with CertsForce