Pre-Winter Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Amazon Web Services AWS Certified Generative AI Developer-Professional AIP-C01 Question # 18 Topic 2 Discussion

Amazon Web Services AWS Certified Generative AI Developer-Professional AIP-C01 Question # 18 Topic 2 Discussion

AIP-C01 Exam Topic 2 Question 18 Discussion:
Question #: 18
Topic #: 2

A publishing company is developing a chat assistant that uses a containerized large language model (LLM) that runs on Amazon SageMaker AI. The architecture consists of an Amazon API Gateway REST API that routes user requests to an AWS Lambda function. The Lambda function invokes a SageMaker AI real-time endpoint that hosts the LLM.

Users report uneven response times. Analytics show that a high number of chats are abandoned after 2 seconds of waiting for the first token. The company wants a solution to ensure that p95 latency is under 800 ms for interactive requests to the chat assistant.

Which combination of solutions will meet this requirement? (Select TWO.)


A.

Enable model preload upon container startup. Implement dynamic batching to process multiple user requests together in a single inference pass.


B.

Select a larger GPU instance type for the SageMaker AI endpoint. Set the minimum number of instances to 0. Continue to perform per-request processing. Lazily load model weights on the first request.


C.

Switch to a multi-model endpoint. Use lazy loading without request batching.


D.

Set the minimum number of instances to greater than 0. Enable response streaming.


E.

Switch to Amazon SageMaker Asynchronous Inference for all requests. Store requests in an Amazon S3 bucket. Set the minimum number of instances to 0.


Get Premium AIP-C01 Questions

Contribute your Thoughts:


Chosen Answer:
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.