$ cat metadata-extraction.txt

Turn-Level Metadata Extraction for Conversational Agents

Author: James, Dreamlabs
Keywords: conversational AI, metadata extraction, agent embodiment, latency-aware inference

Abstract

We describe a paired-LLM setup in which a larger model (the primary conversationalist) conducts the dialogue while a smaller model (the extractor) derives compact, structured metadata from each turn in near real time. In our prototype, the primary model is dreamlabs_2 (<30B params; internal), and the extractor is Qwen3-8B. The extractor consumes both the inputs to and outputs from the primary model each turn (rolling context ≈ 6 messages for latency control) and emits metadata that improves downstream UX, orchestration, and embodiment. While we employed scene (setting/location) and pose tags to drive real-time character animation, the approach generalizes to broader conversational signals.

1. Motivation

Most chat systems reason well but present as disembodied text. Lightweight metadata (produced alongside, not within, the main conversation) enables believable agents (e.g., animation hooks), steadier dialogue management (e.g., intent/stance tracking), and safer or more context-aware behavior, without burdening the primary model or increasing token bloat for every response.

2. System Summary

3. Metadata Scope (What We Extract)

Our extractor targets operationally useful signals rather than free-form summarization. The categories we found most valuable include:

  1. Scene / Setting (used in our deployment) - coarse location or environment implied by the dialogue (e.g., "dorm room," "office," "virtual/unspecified").
  2. Physical Pose (used in our deployment) - coarse body state suitable for animation state machines (e.g., "sitting," "standing," "walking," "unknown").
  3. Turn Intent & Act Type - e.g., user request class (ask/confirm/command), assistant act (explain/suggest/comply/escalate).
  4. Salient Entities & Topics (minimal) - only top-N entities/themes necessary for downstream routing.
  5. Temporal Anchors - mentions of time that ground the conversation (e.g., "later today," "tomorrow at 5").
  6. Dialogue Stance Signals - brief sentiment/arousal indicators useful for pacing or tone adaptation (kept coarse to avoid overreach).
  7. Tooling Hints (binary/lightweight) - whether the assistant appears to reference or require external tools/data next.
  8. Memory Candidates (flag only) - whether a snippet seems worth long-term storage under predefined rules.
  9. Safety/Policy Nudges (coarse) - presence of content that warrants a safer style or refusal path downstream (category only, not adjudication).

The extractor prioritizes precision and stability over coverage: it is allowed to output "unknown/none" when evidence is weak, avoiding hallucinated context.

4. Data Flow & Cadence

  1. Turn completes: user message → primary model reply.
  2. Extractor pass: Qwen3-8B reads the paired turn plus ~6 trailing messages and emits the metadata bundle with light confidence scores.
  3. Consumption:
    • Embodiment: scene/pose feed real-time animation and ambience.
    • Orchestration: intent/stance/tooling flags steer UI affordances and agent behaviors.
    • Memory/Safety: candidates and coarse policy flags inform separate subsystems.
  4. Optional feedback: the primary model may receive the current scene for location awareness, but we found this rarely necessary to maintain response quality.

5. Observations

6. Limitations & Scope Guardrails

7. Takeaway

A paired-LLM pattern (large model for conversation, small model for turn-level metadata) provides a pragmatic path to embodied, context-aware experiences with tight latency budgets. Our deployment used Qwen3-8B for extraction and dreamlabs_2 for dialogue, with scene and pose powering real-time animation and additional tags improving routing, tone, and memory decisions, all without burdening the primary conversational flow.

home · research · about