Research

Making AI agents reliable enough to ship.

Our research team works on the layer above the model: the loops, evaluations, memory designs, and verification patterns that turn frontier reasoning into a teammate you can actually trust with real work.

What we focus on

The reliability problems behind every production agent

Verify-before-act

Catching wrong actions before they happen

Renderell agents propose, then verify against a separate model and a set of constraints. Internal evals show this cuts hallucinated tool calls by >75% in the workflows that matter.

Escalation under uncertainty

Knowing when to ask

An agent that always tries is dangerous. An agent that knows when to stop and ask is useful. We've built calibrated uncertainty estimation into the decision loop.

Routing & cost

Right model per step

Cheaper models handle classification and lookups; frontier models handle planning and decisions. Our router picks per step — typical jobs run at 30% of single-model cost with higher accuracy.

Long-horizon memory

Agents that remember what you taught them

Per-user, per-account, per-workflow memory. Preferences, glossaries, prior decisions, and edge cases persist — so the agent compounds in usefulness instead of resetting every session.

Tool-use reliability

Typed connectors over freeform APIs

Typed tools with schema-checked arguments fail loudly instead of silently. Idempotency keys and dry-runs make actions safe to retry without double-billing your customer or double-filing your ticket.

Continuous evaluation

The eval set is the product

We treat task-specific evals as a first-class artifact. Every customer workflow gets its own eval set; every model upgrade is gated by it. No silent regressions.

Why agents need infrastructure

A great model isn't a great agent.

Models keep getting smarter. The reliability gap between "impressive demo" and "production teammate" hasn't closed at the same pace — because that gap is filled by infrastructure, not intelligence.

You need a reasoning loop that doesn't double-act. Memory that doesn't bleed across users. Tools that fail loudly. Approvals that route to the right human. Rollback when an action lands wrong. We've spent years on each of these, separately and together — and Renderell is the system that wires them up.

Reliability is the moat

The frontier model is rented from the labs. The reliability layer is built by us — and it's what customers actually pay for.

Smaller specialists win

One mega-prompt is brittle. A graph of small renderers, each tuned to one task, beats it across every quality and cost axis.

Audits are a feature

If an agent did something, you should be able to see what, why, and how to undo it — in one click, not one investigation.

Recent notes

From the research team

Verify-before-act loops

How we cut hallucinated tool calls by 78% on our internal eval set — and why that number is misleading without the right benchmark.

Routing under cost constraints

Picking the right model per step is harder than picking the cheapest. A discussion of what we tried and what stuck.

Memory designs for agents

Notes on per-user vs per-workflow memory, fact extraction vs retrieval, and the tradeoffs we landed on.

Evaluating agent reliability

Why pass-rate isn't enough, and what we measure instead: action-fidelity, escalation-precision, rollback-rate.

Tool schemas as guardrails

How strongly-typed tools change the safety profile of an agent — and why we generate ours from OpenAPI specs.

Long-horizon workflows

Agents that run for hours or days need different architecture than agents that run for seconds. Notes on what changes.

Want to dig deeper?

Read the research notes, or talk to the team about applying these ideas to your domain.