All positions

Evals Engineer

Engineering
Remote / Mumbai, IndiaFull-time

Own the eval bench at PingAura — measure how well our AI Coworker actually works, catch regressions before customers do, and make evals the gate on every release.

About the role

Shipping AI products is not the same as shipping software. The model layer changes every few weeks. The same prompt produces different answers tomorrow. A "small" model upgrade can silently break the feature your top customer depends on. The discipline that catches all of this — before customers do — is evals.

PingAura is hiring our first Evals Engineer to own that layer end to end. Our AI Coworker ships to paying enterprise customers and runs across OpenAI, Gemini, and Anthropic. We need someone who treats evals as a product, not a script — someone who can build the bench, run it as a gate on every release, and turn quality into a measurable curve.

We are pre-seed and backed by 14 CXO angel investors, Google for Startups, and AWS. The product is live. The evals problem is real and ours to solve.


Responsibilities

  • Own the eval bench. Design eval datasets, write graders (including LLM-as-judge), and turn customer escalations into permanent regression tests
  • Gate every model upgrade, prompt change, and agent behavior change on eval pass. No eval, no merge
  • Measure quality, latency, and cost across model versions and providers. Surface the tradeoffs the team can act on
  • Detect regressions across releases — when something breaks, your bench catches it first
  • Partner with the AI Agent Engineer on tool-use, structured output, and long-horizon agent evaluation
  • Build the data pipeline: capture production traces on Langfuse, label them, and feed them back into the bench
  • Make evals first-class internally — dashboards, alerts, weekly quality reports the whole team reads
  • Instrument human-in-the-loop labeling for the cases evals alone cannot judge

You may be a good fit if

  • You have 2+ years working with large language models in production, with at least one year focused on quality, evaluation, or AI testing
  • You have built or substantially extended an eval harness for a real product — not a course project or paper reproduction
  • You think in distributions, not anecdotes. "The agent failed once" is not a bug report you accept
  • You write clean, reproducible Python or TypeScript and can build a small framework when one is needed
  • You are comfortable with statistics — confidence intervals, sample sizes, A/B significance — enough to defend a quality claim under scrutiny
  • You understand the difference between offline evals (golden sets), online evals (production traces), and human evals, and you know when to use which
  • You have used at least one eval framework (Langfuse, Braintrust, RAGAS, DeepEval, OpenAI Evals, Helm, or comparable) and have opinions about it

Strong candidates may also have

  • An ML research background or experience with LLM-as-judge calibration
  • Work on agentic evaluation: multi-step, tool-using, long-horizon
  • Background in search, content, or marketing technology
  • Published work or open-source contributions on AI evaluation methodology

What we work with

  • Language: TypeScript across the stack; Python for ML and eval tooling
  • Web: Next.js 16 (App Router), React 19, Server Actions, Tailwind, Shadcn UI
  • Database: PostgreSQL on Supabase with row-level security; pg_cron and pgmq for scheduled and queued work
  • Cache and rate limiting: Redis on Memorystore — caching, distributed rate limiting, queue patterns
  • AI: OpenAI, Gemini, and Anthropic via a multi-provider routing layer
  • Observability: Langfuse for LLM traces, Sentry for errors, plus standard cloud logging and monitoring
  • Cloud: GCP for compute, data, and storage; AWS for CDN
  • Workflow: Turborepo monorepo, pnpm, Cursor and Claude Code daily

Compensation

  • Competitive salary — you're joining at the ground floor
  • Founding-tier ownership over a discipline that will only grow in importance
  • Direct collaboration with the AI Agent Engineer, the founders, and our enterprise customers

Why this team

Every AI product team eventually faces one question: is the product better or worse than last week. Evals are how you answer it. You will build the function inside PingAura that makes the answer trustworthy.

Interested in this role?

Apply now and join our founding team.

Apply Now

Questions about this role? Email us at careers@pingaura.ai