Google Professional-Data-Engineer Exam Questions Free Practice Test

Viewing page 2 out of 7 pages

Viewing questions 11-20 out of questions

Questions # 11:

You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?

Options:

Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.

Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

Expert Solution

Questions # 12:

You have 100 GB of data stored in a BigQuery table. This data is outdated and will only be accessed one or two times a year for analytics with SQL. For backup purposes, you want to store this data to be immutable for 3 years. You want to minimize storage costs. What should you do?

Options:

1 Create a BigQuery table clone.2. Query the clone when you need to perform analytics.

1 Create a BigQuery table snapshot.2 Restore the snapshot when you need to perform analytics.

1. Perform a BigQuery export to a Cloud Storage bucket with archive storage class.2 Enable versionmg on the bucket.3. Create a BigQuery external table on the exported files.

1 Perform a BigQuery export to a Cloud Storage bucket with archive storage class.2 Set a locked retention policy on the bucket.3. Create a BigQuery external table on the exported files.

Expert Solution

Questions # 13:

You have an upstream process that writes data to Cloud Storage. This data is then read by an Apache Spark job that runs on Dataproc. These jobs are run in the us-central1 region, but the data could be stored anywhere in the United States. You need to have a recovery process in place in case of a catastrophic single region failure. You need an approach with a maximum of 15 minutes of data loss (RPO=15 mins). You want to ensure that there is minimal latency when reading the data. What should you do?

Options:

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.2. Enable turbo replication.3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the us-south1 region.4. In case of a regional failure, redeploy your Dataproc duster to the us-south1 region and continue reading from the same bucket.

1. Create a dual-region Cloud Storage bucket in the us-central1 and us-south1 regions.2. Enable turbo replication.3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in the same region.4. In case of a regional failure, redeploy the Dataproc clusters to the us-south1 region and read from the same bucket.

1. Create a Cloud Storage bucket in the US multi-region.2. Run the Dataproc cluster in a zone in the ua-central1 region, reading data from the US multi-region bucket.3. In case of a regional failure, redeploy the Dataproc cluster to the us-central2 region and continue reading from the same bucket.

1. Create two regional Cloud Storage buckets, one in the us-central1 region and one in the us-south1 region.2. Have the upstream process write data to the us-central1 bucket. Use the Storage Transfer Service to copy data hourly from the us-central1 bucket to the us-south1 bucket.3. Run the Dataproc cluster in a zone in the us-central1 region, reading from the bucket in that region.4. In case of regional failure, redeploy your Dataproc clust

Expert Solution

Answer

Questions # 14:

You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec. How should you migrate this data to Cloud Storage?

Options:

Use Transfer Appliance to copy the data to Cloud Storage

Use gsutil cp –J to compress the content being uploaded to Cloud Storage

Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage

Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic

Expert Solution

Questions # 15:

You are running a streaming pipeline with Dataflow and are using hopping windows to group the data as the data arrives. You noticed that some data is arriving late but is not being marked as late data, which is resulting in inaccurate aggregations downstream. You need to find a solution that allows you to capture the late data in the appropriate window. What should you do?

Options:

Change your windowing function to session windows to define your windows based on certain activity.

Change your windowing function to tumbling windows to avoid overlapping window periods.

Expand your hopping window so that the late data has more time to arrive within the grouping.

Use watermarks to define the expected data arrival window Allow late data as it arrives.

Expert Solution

Questions # 16:

You work for a farming company. You have one BigQuery table named sensors, which is about 500 MB and contains the list of your 5000 sensors, with columns for id, name, and location. This table is updated every hour. Each sensor generates one metric every 30 seconds along with a timestamp. which you want to store in BigQuery. You want to run an analytical query on the data once a week for monitoring purposes. You also want to minimize costs. What data model should you use?

Options:

1. Create a retries column in the sensor? table.2. Set record type and repeated mode for the metrics column.3. Use an UPDATE statement every 30 seconds to add new metrics.

1. Create a metrics column in the sensors table.2. Set RECORD type and REPEATED mode for the metrics column.3. Use an INSERT statement every 30 seconds to add new metrics.

1. Create a metrics table partitioned by timestamp.2. Create a sensorld column in the metrics table, that points to the id column in the sensors table.3. Use an IHSEW statement every 30 seconds to append new metrics to the metrics table.4. Join the two tables, if needed, when running the analytical query.

1. Create a metrics table partitioned by timestamp.2. Create a sensor Id column in the metrics table, that points to the _d column in the sensors table.3. Use an UPDATE statement every 30 seconds to append new metrics to the metrics table.4. Join the two tables, if needed, when running the analytical query.

Expert Solution

Questions # 17:

Your company is implementing a data warehouse using BigQuery, and you have been tasked with designing the data model You move your on-premises sales data warehouse with a star data schema to BigQuery but notice performance issues when querying the data of the past 30 days Based on Google's recommended practices, what should you do to speed up the query without increasing storage costs?

Options:

Denormalize the data

Shard the data by customer ID

Materialize the dimensional data in views

Partition the data by transaction date

Expert Solution

Questions # 18:

You have a table that contains millions of rows of sales data, partitioned by date Various applications and users query this data many times a minute. The query requires aggregating values by using avg. max. and sum, and does not require joining to other tables. The required aggregations are only computed over the past year of data, though you need to retain full historical data in the base tables You want to ensure that the query results always include the latest data from the tables, while also reducing computation cost, maintenance overhead, and duration. What should you do?

Options:

Create a materialized view to aggregate the base table data Configure a partition expiration on the base table to retain only the last one year of partitions.

Create a materialized view to aggregate the base table data include a filter clause to specify the last one year of partitions.

Create a new table that aggregates the base table data include a filter clause to specify the last year of partitions. Set up a scheduled query to recreate the new table every hour.

Create a view to aggregate the base table data Include a filter clause to specify the last year of partitions.

Expert Solution

Questions # 19:

You want to optimize your queries for cost and performance. How should you structure your data?

Options:

Partition table data by create_date, location_id and device_version

Partition table data by create_date cluster table data by location_Id and device_version

Cluster table data by create_date location_id and device_version

Cluster table data by create_date partition by locationed and device_version

Expert Solution

Questions # 20:

You are administering a BigQuery on-demand environment. Your business intelligence tool is submitting hundreds of queries each day that aggregate a large (50 TB) sales history fact table at the day and month levels. These queries have a slow response time and are exceeding cost expectations. You need to decrease response time, lower query costs, and minimize maintenance. What should you do?

Options:

Build materialized views on top of the sales table to aggregate data at the day and month level.

Build authorized views on top of the sales table to aggregate data at the day and month level.

Enable Bl Engine and add your sales table as a preferred table.

Create a scheduled query to build sales day and sales month aggregate tables on an hourly basis.

Expert Solution

Answer

Explanation

To improve response times and reduce costs for frequent queries aggregating a large sales history fact table, materialized views are a highly effective solution. Here’s why option A is the best choice:

Materialized Views:

Materialized views store the results of a query physically and update them periodically, offering faster query responses for frequently accessed data.

They are designed to improve performance for repetitive and expensive aggregation queries by precomputing the results.

Efficiency and Cost Reduction:

By building materialized views at the day and month level, you significantly reduce the computation required for each query, leading to faster response times and lower query costs.

Materialized views also reduce the need for on-demand query execution, which can be costly when dealing with large datasets.

Minimized Maintenance:

Materialized views in BigQuery are managed automatically, with updates handled by the system, reducing the maintenance burden on your team.

Steps to Implement:

Identify Aggregation Queries:

Analyze the existing queries to identify common aggregation patterns at the day and month levels.

Create Materialized Views:

Create materialized views in BigQuery for the identified aggregation patterns. For example

CREATE MATERIALIZED VIEW project.dataset.sales_daily_summary AS

SELECT

DATE(transaction_time) AS day,

SUM(amount) AS total_sales

FROM

project.dataset.sales

GROUP BY

day;

CREATE MATERIALIZED VIEW project.dataset.sales_monthly_summary AS

SELECT

EXTRACT(YEAR FROM transaction_time) AS year,

EXTRACT(MONTH FROM transaction_time) AS month,

SUM(amount) AS total_sales

FROM

project.dataset.sales

GROUP BY

year, month;

Query Using Materialized Views:

Update existing queries to use the materialized views instead of directly querying the base table.

Reference Links:

BigQuery Materialized Views

Optimizing Query Performance

Viewing page 2 out of 7 pages

Viewing questions 11-20 out of questions

Pass the Google Google Cloud Certified Professional-Data-Engineer Questions and answers with CertsForce