AI Agent Quality Engineer - Chaps

Budapest

Apply now Apply later

About Craft & Chaps

At Craft, we rethink productivity from first principles. Our products disappear into the background so people can do their life’s work—fast, joyfully, and without friction.

Chaps is our new AI-first product, focused on turning a constellation of large-language-model agents into a seamless personal productivity assistant.


About the role

Our AI Product team is looking for an engineer who obsesses over making multi-agent systems robust, observable, and continuously improving. You’ll build the test harnesses, evaluation pipelines, and monitoring layers that keep dozens of collaborating agents on-task, on-budget, and on-time.

In practice, that means:

  • Designing automated evals that exercise complete agent workflows—catching regressions before they reach users.

  • Instrumenting every prompt, tool-call, and model hop with rich telemetry so we can trace root causes in minutes, not days.

  • Creating feedback loops that turn logs, user ratings, and synthetic tests into better prompts and safer behaviors.

  • Future-proofing agentic systems by allowing quality to evolve with LLM intelligence.

You will partner with product, research, and infra to ship an AI assistant users can trust—no surprises, no downtime.

What we’re looking for

You must have:

  • Hands-on experience with LLM evaluation frameworks (e.g., OpenAI Evals, LangSmith, LLM-Harness) and a track record of turning eval results into product-ready gating.

  • Observability chops—you’ve wired up tracing/metrics for distributed systems (OpenTelemetry, Prometheus, Grafana) and know how to set SLOs that actually matter.

  • Prompt-engineering fluency—few-shot, function-calling, RAG orchestration—and an instinct for spotting ambiguity or jailbreak vectors.

  • Production-grade Python/TypeScript skills and comfort shipping through CI/CD (GitHub Actions, Terraform, Docker/K8s).

  • A bias for experimentation: you automate A/B tests, cost–latency trade-off studies, and rollback safeguards as part of the dev cycle.

It would be great if you have:

  • Experience scaling multi-agent planners or tool-using agents in real products.

  • Familiarity with vector databases, semantic diff tooling, or RLHF/RLAIF pipelines.

  • A knack for weaving human feedback (support tickets, thumbs-downs) into automated regression tests.

Our Culture

  • Think differently. We value novel ideas over legacy playbooks—and we give you room to explore.

  • People first. You instrument systems so users never feel the bumps; you collaborate so teammates never feel stuck.

  • Pragmatic craftsmanship. We ship fast, but we measure twice—data accuracy, latency budgets, and reliability all matter.

  • Clear communicators. You translate metrics into stories that product managers and designers understand, sparking better decisions.

Join us if you want to make AI that works—every request, every time.

Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  1  0  0

Tags: A/B testing CI/CD Distributed Systems Docker Engineering GitHub Grafana Kubernetes LLMs OpenAI Pipelines Python RAG Research RLHF Terraform TypeScript

Region: Europe
Country: Hungary

More jobs like this