Spring Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: simple70

Pass the Amazon Web Services AWS Certified Data Engineer Data-Engineer-Associate Questions and answers with CertsForce

Viewing page 6 out of 9 pages
Viewing questions 51-60 out of questions
Questions # 51:

A company wants to ingest streaming data into an Amazon Redshift data warehouse from an Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster. A data engineer needs to develop a solution that provides low data access time and that optimizes storage costs.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.

Create an external schema that maps to the MSK cluster. Create a materialized view that references the external schema to consume the streaming data from the MSK topic.


B.

Develop an AWS Glue streaming extract, transform, and load (ETL) job to process the incoming data from Amazon MSK. Load the data into Amazon S3. Use Amazon Redshift Spectrum to read the data from Amazon S3.


C.

Create an external schema that maps to the streaming data source. Create a new Amazon Redshift table that references the external schema.


D.

Create an Amazon S3 bucket. Ingest the data from Amazon MSK. Create an event-driven AWS Lambda function to load the data from the S3 bucket to a new Amazon Redshift table.


Expert Solution
Questions # 52:

A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The repository needs to provide the schema and properties of many tables. Some of the metadata is stored in Apache Hive. The data engineer needs to import the metadata from Hive into the central metadata repository.

Which solution will meet these requirements with the LEAST development effort?

Options:

A.

Use Amazon EMR and Apache Ranger.


B.

Use a Hive metastore on an EMR cluster.


C.

Use the AWS Glue Data Catalog.


D.

Use a metastore on an Amazon RDS for MySQL DB instance.


Expert Solution
Questions # 53:

A company stores Apache Parquet files in an Amazon S3 data lake. The data lake receives thousands of files from multiple sources every hour. The files range in size from 50 KB to 100 KB.

The company is evaluating the implementation of Apache Iceberg tables for the data lake. The company is using AWS Glue Data Catalog as part of the evaluation. The company needs a solution to optimize query performance in Iceberg. The solution must ensure that Iceberg table performance does not degrade when more files are added over time.

Which solution will meet these requirements?

Options:

A.

Use an AWS Glue job to compact the files into a standard size of 512 MB at the end of each day. Run an AWS Glue crawler to update the Data Catalog.


B.

Configure the Data Catalog to automatically compact the files every minute.


C.

Configure Iceberg table properties to enable automatic compaction based on thresholds for file size and the number of files.


D.

Implement a partition strategy in Amazon S3. Run an AWS Glue crawler to update the Data Catalog every 5 minutes.


Expert Solution
Questions # 54:

A data engineer needs to onboard a new data producer into AWS. The data producer needs to migrate data products to AWS.

The data producer maintains many data pipelines that support a business application. Each pipeline must have service accounts and their corresponding credentials. The data engineer must establish a secure connection from the data producer ' s on-premises data center to AWS. The data engineer must not use the public internet to transfer data from an on-premises data center to AWS.

Which solution will meet these requirements?

Options:

A.

Instruct the new data producer to create Amazon Machine Images (AMIs) on Amazon Elastic Container Service (Amazon ECS) to store the code base of the application. Create security groups in a public subnet that allow connections only to the on-premises data center.


B.

Create an AWS Direct Connect connection to the on-premises data center. Store the service account credentials in AWS Secrets manager.


C.

Create a security group in a public subnet. Configure the security group to allow only connections from the CIDR blocks that correspond to the data producer. Create Amazon S3 buckets than contain presigned URLS that have one-day expiration dates.


D.

Create an AWS Direct Connect connection to the on-premises data center. Store the application keys in AWS Secrets Manager. Create Amazon S3 buckets that contain resigned URLS that have one-day expiration dates.


Expert Solution
Questions # 55:

A company has a data lake in Amazon 53. The company uses AWS Glue to catalog data and AWS Glue Studio to implement data extract, transform, and load (ETL) pipelines.

The company needs to ensure that data quality issues are checked every time the pipelines run. A data engineer must enhance the existing pipelines to evaluate data quality rules based on predefined thresholds.

Which solution will meet these requirements with the LEAST implementation effort?

Options:

A.

Add a new transform that is defined by a SQL query to each Glue ETL job. Use the SQL query to implement a ruleset that includes the data quality rules that need to be evaluated.


B.

Add a new Evaluate Data Quality transform to each Glue ETL job. Use Data Quality Definition Language (DQDL) to implement a ruleset that includes the data quality rules that need to be evaluated.


C.

Add a new custom transform to each Glue ETL job. Use the PyDeequ library to implement a ruleset that includes the data quality rules that need to be evaluated.


D.

Add a new custom transform to each Glue ETL job. Use the Great Expectations library to implement a ruleset that includes the data quality rules that need to be evaluated.


Expert Solution
Questions # 56:

A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket.

The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a correlation process that begins only after the first two processes complete.

Options:

A.

Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the workflow by using AWS Glue. Configure AWS Glue to begin the third process after the first two processes have finished.


B.

Use Amazon EMR to run each process in the workflow. Create an Amazon Simple Queue Service (Amazon SQS) queue to handle messages that indicate the completion of the first two processes. Configure an AWS Lambda function to process the SQS queue by running the third process.


C.

Use AWS Glue workflows to run the first two processes in parallel. Ensure that the third process starts after the first two processes have finished.


D.

Use AWS Step Functions to orchestrate a workflow that uses multiple AWS Lambda functions. Ensure that the third process starts after the first two processes have finished.


Expert Solution
Questions # 57:

A data engineer needs to optimize the performance of a data pipeline that handles retail orders. Data about the orders is ingested daily into an Amazon S3 bucket.

The data engineer runs queries once each week to extract metrics from the orders data based on the order date for multiple date ranges. The data engineer needs an optimization solution that ensures the query performance will not degrade when the volume of data increases.

Options:

A.

Partition the data based on order date. Use Amazon Athena to query the data.


B.

Partition the data based on order date. Use Amazon Redshift to query the data.


C.

Partition the data based on load date. Use Amazon EMR to query the data.


D.

Partition the data based on load date. Use Amazon Aurora to query the data.


Expert Solution
Questions # 58:

A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour.

Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)

Options:

A.

Configure AWS Glue triggers to run the ETL jobs even/ hour.


B.

Use AWS Glue DataBrewto clean and prepare the data for analytics.


C.

Use AWS Lambda functions to schedule and run the ETL jobs even/ hour.


D.

Use AWS Glue connections to establish connectivity between the data sources and Amazon Redshift.


E.

Use the Redshift Data API to load transformed data into Amazon Redshift.


Expert Solution
Questions # 59:

A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates that an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy instead of the refactor strategy.

Which solution will meet these requirements with the LEAST management overhead?

Options:

A.

Amazon Kinesis Data Streams


B.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned cluster


C.

Amazon Data Firehose


D.

Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless


Expert Solution
Questions # 60:

A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data.

Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.

Confiqure an AWS Lambda function to load data from the S3 bucket into a pandas dataframe- Write a SQL SELECT statement on the dataframe to query the required column.


B.

Use S3 Select to write a SQL SELECT statement to retrieve the required column from the S3 objects.


C.

Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column.


D.

Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in Amazon Athena to query the required column.


Expert Solution
Viewing page 6 out of 9 pages
Viewing questions 51-60 out of questions