Chat Renderer

Unified synchronous HTTP client for the Ollama API with connection pooling.

app/chat_renderer.py

Unified synchronous HTTP client for the Ollama API.

Provides two interfaces:
  • ChatRenderer.render() — fire-and-forget chat call that swallows errors and returns None on any failure (used by the Chat Translation page, matching mud_server behaviour).

  • ChatRenderer.generate() — same transport but raises on failure (used by the Character Description page, where the route handler maps each exception to an HTTPException).

  • ChatRenderer.list_models() — static helper that queries /api/tags and returns a sorted list of pulled model names.

Both generation methods use Ollama’s /api/chat endpoint with the OpenAI-compatible messages array (system + user roles), which is what the production MUD translation layer also uses. This ensures that any model-behaviour differences between /api/generate (flat prompt) and /api/chat (messages) are visible during lab testing.

Connection pooling

HTTP connections are reused across requests via a module-level client pool keyed by (host, connect_timeout, read_timeout). This avoids the overhead of a fresh TCP handshake + TLS negotiation on every call and prevents cold-start failures when Character B fires immediately after Character A. Call close_all_clients() at shutdown to release pooled connections cleanly.

Sync rationale

The lab’s route handlers are synchronous (FastAPI runs them in a thread-pool executor), so a blocking httpx call here does not stall the async event loop. Using an async client would require asyncio.run() or restructuring the handler, neither of which is worth the complexity for a single-user tool.

Request structure sent to Ollama

{
  "model": "<model-tag>",
  "stream": false,
  "keep_alive": "5m",
  "messages": [
    {"role": "system", "content": "<rendered system prompt>"},
    {"role": "user",   "content": "<ooc message>"}
  ],
  "options": {
    "temperature": <float>,
    "num_predict": <int>,
    "seed": <int>  // only when seed is not None
  }
}

The stream: false flag is required to get a single JSON response body rather than a series of newline-delimited chunks.

The keep_alive field tells Ollama how long to keep the model loaded in memory after responding (default "5m"). This prevents cold-start latency on back-to-back requests (e.g. Character A then Character B).

Environment variables

OLLAMA_HOST – Base URL of the Ollama server (default: http://localhost:11434).

Read once at import time so the value is consistent for the lifetime of the process.

app.chat_renderer.close_all_clients()[source]

Close all pooled HTTP clients and clear the pool.

Call this during application shutdown to release TCP connections cleanly.

class app.chat_renderer.ChatRenderer(*, host, model, timeout_seconds=120.0, temperature=0.7, seed=None, max_tokens=128, keep_alive='5m')[source]

Bases: object

Synchronous Ollama client that calls the /api/chat endpoint.

Requests reuse a shared httpx.Client from a module-level pool keyed by (host, connect_timeout, read_timeout). This enables HTTP Keep-Alive across calls and avoids cold-start latency when multiple requests target the same Ollama instance in quick succession.

Parameters:
  • host (str) – Ollama server base URL, e.g. 'http://localhost:11434'. A trailing slash is stripped automatically. /api/chat is appended internally.

  • model (str) – Ollama model tag, e.g. 'gemma2:2b'. Must match a model that has been pulled in Ollama.

  • timeout_seconds (float) – HTTP read timeout in seconds. Applies to waiting for the model to finish generating. Defaults to 120 s to accommodate slow hardware or large models. The connect timeout is always 10 s.

  • temperature (float) – Sampling temperature forwarded to Ollama’s options.temperature. 0.0 is deterministic (greedy decoding); higher values add randomness.

  • seed (int | None) – Optional integer forwarded to Ollama’s options.seed. When provided, Ollama uses this as the random seed for token sampling, which makes the output reproducible for the same input. When None, the seed key is omitted from the options object and Ollama chooses its own seed.

  • max_tokens (int) – num_predict ceiling for the generation. Ollama stops after this many tokens even if the model would continue.

  • keep_alive (str) – Duration string telling Ollama how long to keep the model loaded in memory after responding (e.g. "5m", "1h", "0" to unload immediately). Defaults to "5m" to prevent cold-start latency on back-to-back requests.

__init__(*, host, model, timeout_seconds=120.0, temperature=0.7, seed=None, max_tokens=128, keep_alive='5m')[source]
render(system_prompt, user_message)[source]

POST to Ollama /api/chat and return the raw response content.

Builds the request payload, sends it to self._endpoint, and extracts the model’s response from data["message"]["content"].

The system_prompt and user_message are sent as separate entries in the messages array using the "system" and "user" roles respectively. This matches the format used by the production MUD translation layer.

No content-level validation is performed here; that is handled downstream by OutputValidator.

Parameters:
  • system_prompt (str) – Fully-rendered system prompt text. All {{placeholder}} variables should already have been substituted before this call.

  • user_message (str) – The OOC message (user turn). Sent verbatim as the "user" role message.

Returns:

The stripped message.content string on success, or None on any of the following failure conditions:

  • TimeoutException: Ollama took longer than timeout_seconds to respond.

  • ConnectError: Ollama is not reachable at the configured endpoint (wrong host, not running, firewall).

  • Any other exception: Unexpected HTTP or JSON parsing error.

All failure paths log a warning/error via the module logger. None return causes the endpoint to report "fallback.api_error" in the translation result.

Return type:

str | None

generate(system_prompt, user_message)[source]

POST to /api/chat; return (text, usage). Raises on any failure.

Same payload structure as render(), but exceptions propagate to the caller instead of being caught. This matches the contract expected by the /api/generate route handler, which maps each exception type to an HTTPException.

Parameters:
  • system_prompt (str) – Fully-rendered system prompt text.

  • user_message (str) – The user turn (axis JSON string for description generation, or OOC message for chat).

Returns:

  • str — Stripped message.content.

  • dict{"prompt_eval_count": int|None, "eval_count": int|None}

Return type:

tuple[str, dict]

Raises:
  • httpx.HTTPStatusError – Non-2xx response from Ollama.

  • httpx.TimeoutException – Request timed out.

  • ValueError – Response is missing the "message" key.

static list_models(host=None)[source]

Sorted model names from /api/tags. Returns [] on any error.

Parameters:

host (str | None) – Optional Ollama server base URL. When None, the module-level OLLAMA_HOST constant is used.

Returns:

Sorted list of model name strings, e.g. ["gemma2:2b", "llama3.2:1b"]. Returns an empty list if Ollama is unreachable or returns an error, allowing the frontend to degrade gracefully.

Return type:

list[str]