Most generative AI language model pricing is based on token consumption , which measures the amount of text processed by the model. Tokens are sub-word units used internally by language models (for example, parts of words, whole words, or punctuation). When you send a prompt, the model consumes input tokens (your prompt + any system instructions + retrieved grounding context). When it generates a response, it consumes output tokens (the generated completion). Costs typically scale with the total input + output tokens processed, which is why long prompts, large grounding passages, and lengthy responses increase spend. This also explains why prompt optimization, response length limits, caching, and careful grounding are common cost-control techniques in enterprise solutions.
By contrast, “documents” is too coarse (a document can be 1 page or 500 pages). “Requests” is not the primary unit for most LLM pricing models because request sizes vary dramatically. “Words” is not used because the model’s actual compute unit is tokens, and tokenization differs across languages and text patterns. Therefore, the most accurate completion is tokens .
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit