Amazon Web Services AWS Certified Generative AI Developer-Professional AIP-C01 Question # 18 Topic 2 Discussion

AIP-C01 Exam Topic 2 Question 18 Discussion:

Question #: 18

Topic #: 2

A publishing company is developing a chat assistant that uses a containerized large language model (LLM) that runs on Amazon SageMaker AI. The architecture consists of an Amazon API Gateway REST API that routes user requests to an AWS Lambda function. The Lambda function invokes a SageMaker AI real-time endpoint that hosts the LLM.
Users report uneven response times. Analytics show that a high number of chats are abandoned after 2 seconds of waiting for the first token. The company wants a solution to ensure that p95 latency is under 800 ms for interactive requests to the chat assistant.
Which combination of solutions will meet this requirement? (Select TWO.)

Enable model preload upon container startup. Implement dynamic batching to process multiple user requests together in a single inference pass.

Select a larger GPU instance type for the SageMaker AI endpoint. Set the minimum number of instances to 0. Continue to perform per-request processing. Lazily load model weights on the first request.

Switch to a multi-model endpoint. Use lazy loading without request batching.

Set the minimum number of instances to greater than 0. Enable response streaming.

Switch to Amazon SageMaker Asynchronous Inference for all requests. Store requests in an Amazon S3 bucket. Set the minimum number of instances to 0.

Get Premium AIP-C01 Questions

Explanation

The correct answers are A and D because they directly reduce time-to-first-token and stabilize p95 latency for interactive, real-time chat workloads hosted on Amazon SageMaker AI real-time endpoints.

Option D addresses the biggest driver of uneven latency: cold starts and scale-to-zero behavior. By setting the minimum number of instances to greater than 0, the endpoint always has warm capacity and loaded runtime resources, eliminating the first-request penalty that causes users to wait multiple seconds. Enabling response streaming improves perceived latency by returning the first tokens as soon as they are generated rather than waiting for the complete response. This directly targets the abandonment problem described (users leaving after waiting for the first token).

Option A further improves p95 latency and throughput by removing model loading overhead during inference and improving GPU utilization. Preloading model weights during container startup ensures the model is ready before traffic arrives and avoids unpredictable on-demand weight loading. Dynamic batching increases efficiency by grouping compatible requests into a single inference pass, reducing per-request overhead and improving GPU saturation. When tuned properly for interactive workloads, batching can reduce tail latency while preserving responsiveness by enforcing small batch windows.

Option B makes latency worse because setting minimum instances to 0 and lazily loading weights guarantees cold-start delays and unpredictable first-token performance. Option C similarly increases cold-start behavior through lazy loading and offers no batching benefits. Option E is designed for non-interactive workloads and introduces queueing and storage latency, which conflicts with the 800 ms p95 requirement for interactive chat.

Therefore, A and D are the best combination to achieve consistently low p95 latency and fast first-token streaming for a SageMaker-hosted chat assistant.

Actual exam question for Amazon Web Services AIP-C01 exam by Lumen4873 at Feb 2, 2026, 9:10:43 PM

Contribute your Thoughts:

Chosen Answer: A B C D E
This is a voting comment (?). It is better to Upvote an existing comment if you don't have anything to add.

Pre-Winter Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Amazon Web Services AWS Certified Generative AI Developer-Professional AIP-C01 Question # 18 Topic 2 Discussion

Amazon Web Services AWS Certified Generative AI Developer-Professional AIP-C01 Question # 18 Topic 2 Discussion

Correct Answer:

Options Selected by Other Users:

Contribute your Thoughts:

Pre-Winter Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Amazon Web Services AWS Certified Generative AI Developer-Professional AIP-C01 Question # 18 Topic 2 Discussion

Amazon Web Services AWS Certified Generative AI Developer-Professional AIP-C01 Question # 18 Topic 2 Discussion

Correct Answer:

Options Selected by Other Users:

Contribute your Thoughts:

Awaiting moderator approval