Skip to content

Incorrect usage calculation for gemini models in stream mode #1736

Closed
@zahariash

Description

@zahariash

Initial Checks

Description

Usage for gemini models in stream mode is calculated incorrectly.
Request_tokens seems to be multiplied by number of chunks. As a result total_tokens is also too high.

Example Code

import asyncio
from pydantic_ai import Agent


async def main():
    agent = Agent(
        "google-gla:gemini-2.0-flash",
    )

    prompt = """return only "word_1 word_2 word_3 word_4 word_5 word_6 word_7 word_8 word_9 word_10" """

    results = await agent.run(prompt)
    print(f"Run usage:\n {results.usage()}")

    async with agent.run_stream(prompt) as results:
        chunks = len([chunk async for chunk in results.stream_text(debounce_by=None)])
        print(f"Stream run usage ({chunks} chunks):\n {results.usage()}")


asyncio.run(main())

# Run usage:
#  Usage(requests=1, request_tokens=36, response_tokens=32, total_tokens=68, details=None)
# Stream run usage (4 chunks):
#  Usage(requests=1, request_tokens=147, response_tokens=32, total_tokens=179, details=None)

Python, Pydantic AI & LLM client version

python 3.13.2
pydantic-ai 0.2.4

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions