About the role

Shipping AI products is not the same as shipping software. The model layer changes every few weeks. The same prompt produces different answers tomorrow. A "small" model upgrade can silently break the feature your top customer depends on. The discipline that catches all of this - before customers do - is evals.

PingAura is hiring our first Evals Engineer to own that layer end to end. Our AI Coworker ships to paying enterprise customers and runs across OpenAI, Google Gemini, and Anthropic. We need someone who treats evals as a product, not a script - someone who can build the bench, run it as a gate on every release, and turn quality into a measurable curve.

We are pre-seed and backed by 14 CXO angel investors, Google for Startups, and AWS. The product is live. The evals problem is real and ours to solve.

This is not a prompt-tweaking role. It is for someone who can build the systems around evals: datasets, graders, traces, dashboards, regression gates, and the reliability loops that decide whether an AI feature is ready to ship.

Responsibilities

Own the eval bench. Design eval datasets, write graders (including LLM-as-judge), and turn customer escalations into permanent regression tests
Gate every model upgrade, prompt change, and agent behavior change on eval pass. No eval, no merge
Measure quality, latency, and cost across model versions and providers. Surface the tradeoffs the team can act on
Detect regressions across releases - when something breaks, your bench catches it first
Partner with the AI Agent Engineer on tool-use, structured output, and long-horizon agent evaluation
Build the data pipeline: capture production traces on Langfuse, label them, and feed them back into the bench
Make evals first-class internally - dashboards, alerts, weekly quality reports the whole team reads
Instrument human-in-the-loop labeling for the cases evals alone cannot judge

You may be a good fit if

You have 2+ years working with large language models in production, with at least one year focused on quality, evaluation, or AI testing
You have built or substantially extended an eval harness for a real product - not a course project or paper reproduction
You think in distributions, not anecdotes. "The agent failed once" is not a bug report you accept
You write clean, reproducible TypeScript and can build a small framework when one is needed
You can work across APIs, databases, background jobs, observability, and deployment workflows when evals need production plumbing
You use AI coding tools such as Cursor, Claude Code, Codex, or similar tools seriously in your development workflow
You are comfortable reviewing AI-generated code instead of blindly accepting it
You are comfortable with statistics - confidence intervals, sample sizes, A/B significance - enough to defend a quality claim under scrutiny
You understand the difference between offline evals (golden sets), online evals (production traces), and human evals, and you know when to use which
You have used at least one eval framework (Langfuse, Braintrust, RAGAS, DeepEval, OpenAI Evals, Helm, or comparable) and have opinions about it

Strong candidates may also have

An ML research background or experience with LLM-as-judge calibration
Work on agentic evaluation: multi-step, tool-using, long-horizon
Working knowledge of Python for eval scripts, automation, or AI tooling
Experience with PostgreSQL, Redis, Supabase, or similar infrastructure
Background in search, content, or marketing technology
Published work or open-source contributions on AI evaluation methodology

What we work with

Language: TypeScript across the stack
Web: Next.js 16 (App Router), React 19, Server Actions, Tailwind, Shadcn UI
Database: PostgreSQL on Supabase with row-level security; pg_cron and pgmq for scheduled and queued work
Cache and rate limiting: Redis on Memorystore - caching, distributed rate limiting, queue patterns
AI: OpenAI, Gemini, and Anthropic via a multi-provider routing layer
Observability: Langfuse for LLM traces, Sentry for errors, plus standard cloud logging and monitoring
Cloud: GCP for compute, data, and storage; AWS for CDN
Workflow: Turborepo monorepo, pnpm, Cursor and Claude Code daily

Compensation

Competitive salary - you're joining at the ground floor
Founding-tier ownership over a discipline that will only grow in importance
Direct collaboration with the AI Agent Engineer, the founders, and our enterprise customers

Why this team

Every AI product team eventually faces one question: is the product better or worse than last week. Evals are how you answer it. You will build the function inside PingAura that makes the answer trustworthy.

Evals Engineer