Turn-Level Metadata Extraction for Conversational Agents
Abstract
We describe a paired-LLM setup in which a larger model (the primary conversationalist) conducts the dialogue while a smaller model (the extractor) derives compact, structured metadata from each turn in near real time. In our prototype, the primary model is dreamlabs_2 (<30B params; internal), and the extractor is Qwen3-8B. The extractor consumes both the inputs to and outputs from the primary model each turn (rolling context ≈ 6 messages for latency control) and emits metadata that improves downstream UX, orchestration, and embodiment. While we employed scene (setting/location) and pose tags to drive real-time character animation, the approach generalizes to broader conversational signals.
1. Motivation
Most chat systems reason well but present as disembodied text. Lightweight metadata (produced alongside, not within, the main conversation) enables believable agents (e.g., animation hooks), steadier dialogue management (e.g., intent/stance tracking), and safer or more context-aware behavior, without burdening the primary model or increasing token bloat for every response.
2. System Summary
- Primary Conversational Model (dreamlabs_2, <30B): Handles all user-facing dialogue. Optionally receives scene context for location awareness when beneficial, but this is generally unnecessary.
- Turn-Side Extractor (Qwen3-8B): After each user→assistant exchange, receives the paired turn (user input + model reply) plus a short trailing history (~6 messages) to preserve recency while keeping latency low.
- Output: A compact, structured metadata object for the latest turn (no heavy rationale text), optimized for immediate consumption by animation, UI, memory, and policy layers.
3. Metadata Scope (What We Extract)
Our extractor targets operationally useful signals rather than free-form summarization. The categories we found most valuable include:
- Scene / Setting (used in our deployment) - coarse location or environment implied by the dialogue (e.g., "dorm room," "office," "virtual/unspecified").
- Physical Pose (used in our deployment) - coarse body state suitable for animation state machines (e.g., "sitting," "standing," "walking," "unknown").
- Turn Intent & Act Type - e.g., user request class (ask/confirm/command), assistant act (explain/suggest/comply/escalate).
- Salient Entities & Topics (minimal) - only top-N entities/themes necessary for downstream routing.
- Temporal Anchors - mentions of time that ground the conversation (e.g., "later today," "tomorrow at 5").
- Dialogue Stance Signals - brief sentiment/arousal indicators useful for pacing or tone adaptation (kept coarse to avoid overreach).
- Tooling Hints (binary/lightweight) - whether the assistant appears to reference or require external tools/data next.
- Memory Candidates (flag only) - whether a snippet seems worth long-term storage under predefined rules.
- Safety/Policy Nudges (coarse) - presence of content that warrants a safer style or refusal path downstream (category only, not adjudication).
The extractor prioritizes precision and stability over coverage: it is allowed to output "unknown/none" when evidence is weak, avoiding hallucinated context.
4. Data Flow & Cadence
- Turn completes: user message → primary model reply.
- Extractor pass: Qwen3-8B reads the paired turn plus ~6 trailing messages and emits the metadata bundle with light confidence scores.
- Consumption:
- Embodiment: scene/pose feed real-time animation and ambience.
- Orchestration: intent/stance/tooling flags steer UI affordances and agent behaviors.
- Memory/Safety: candidates and coarse policy flags inform separate subsystems.
- Optional feedback: the primary model may receive the current scene for location awareness, but we found this rarely necessary to maintain response quality.
5. Observations
- The 8B extractor is sufficient for consistent, low-latency tagging when constrained to a short rolling context; it captures explicit and near-explicit cues reliably.
- Keeping the context to ~6 messages materially reduces latency without notable loss of tag quality for our categories.
- Separating metadata from the main reply avoids token-level overhead and keeps the primary model focused on conversation quality.
- Scene + pose alone yield outsized gains for perceived realism in agentic characters; the broader metadata further improves orchestration and UX polish.
6. Limitations & Scope Guardrails
- Many conversations provide no physical cues; the extractor must return unknown/none rather than speculate.
- Metadata is intentionally coarse and avoids sensitive inferences; categories are limited to environmental, structural, and operational signals.
- Long sessions can accumulate stale assumptions; periodic decay or refresh policies are advisable outside the scope of this note.
7. Takeaway
A paired-LLM pattern (large model for conversation, small model for turn-level metadata) provides a pragmatic path to embodied, context-aware experiences with tight latency budgets. Our deployment used Qwen3-8B for extraction and dreamlabs_2 for dialogue, with scene and pose powering real-time animation and additional tags improving routing, tone, and memory decisions, all without burdening the primary conversational flow.