Applied Ai Engineer

Infer

Date listed

2 months ago

Employment Type

Full time

Found on:

YCombinator Startups

Keywords: pandas llm huggingface scikit ai agents ml python pytorch prompting

About us

Infer is building the operating system for insurance agencies. We make AI agents(including voice agents) that handle the work agencies have always done by hand: qualifying inbound leads, helping producers during live calls, auditing calls after, running renewals, and bringing churned customers back.

Our long bet is that AI eventually sells insurance directly. Agencies are the wedge because that is where the work, the data, and the customer relationships actually live. Get good there, and the rest follows.

We are a YC company and have raised from Stellaris Venture partners and others. Founders are: Vaibhav, Urvin and Suneel. Vaibhav was an architect and AI researcher(at Purdue) now a licensed insurance agent. Urvin worked at BCG, is a surfer with six pack abs. Suneel is an IITian and a philomath.

A few reasons to join us:

We like pushing each other on team to test the limits because that's when you rediscover yourself.
We’re paranoid about making customers succeed (we challenge whats already good)
We love people who question-challenge-build.
We're highly transparent founders to work with & love getting challenged.
Finally, we love people who’re interdisciplinary.

About the role

We're hiring an Applied AI Engineer to own the system that tells us whether our voice agents are getting better, and to keep them getting better on their own.

Voice quality is the product. If an agent stutters, hallucinates a quote, or misses a disclosure, we lose trust, deals, and sometimes compliance footing. The system that catches all of that before customers do is the most important infrastructure we will build this year.

Today we run thousands of conversations a day with real prospects. We need a harness that scores every change end to end, a benchmark suite that runs against any new model the day it drops, a red-team pipeline that probes our agents for failure modes, and self-improvement loops that feed production failures back into the eval set.

This is an evals and infrastructure role with deep LLM work. You will touch audio, but the center of gravity is the harness and the loops around it. Think of the harness as CI for voice conversations: it runs synthetic and real calls through our stack and scores agent behavior at every layer (STT, LLM, tools, TTS, full call outcomes), so we catch regressions before customers do. New models are coming out every few weeks, so the question is not just whether ours is good today, but whether we can tell within a week if a new open source release should replace it.

What you'll do

Building and maintaining the eval framework that scores voice agent quality across transcription, LLM reasoning, tool use, TTS, and full-conversation outcomes
Design voice agent behavior: system prompts, tool use, conversation flow, error recovery, and guardrails for real-time interactions
Drive STT and TTS accuracy improvements by comparing providers, tuning configurations, and running rigorous A/B experiments the team can act on.
Drive TTS quality improvements voice selection, latency vs. fidelity tradeoffs, prosody, edge cases
Curate and grow our evaluation datasets, including hard-case mining from production traffic
You'll build benchmarks we can run against any new model in days, run a red-team pipeline that probes for jailbreaks, hallucinated quotes, and compliance failures,
Partner with backend engineers to wire eval signals into CI so regressions get caught before they ship
Wire eval signals into CI so regressions block merges, and build self-improvement loops where hard cases from production auto-feed the eval set and our prompts optimize themselves over time.

What success looks like

Day 30

You understand how our agents work across prompts, tools, evals, telephony, and customer systems.
You have shipped a v1 of evals with at least one end-to-end metric the team trusts.
You are sitting in on customer call reviews and tagging failure modes by hand to learn where the real problems live.
You have one new model (open or closed) benchmarked against our production stack with numbers we can defend.

Day 60

The eval system runs on updates and blocks merges that regress on a known set of cases.
We have a first red-team suite covering at least three classes of failure modes (jailbreaks, hallucinated quotes, compliance), running on a schedule.
Hard-case mining from production calls is automated, so the eval set grows without anyone triaging every example by hand.
At least one open source model (Qwen, DeepSeek, or similar) is benchmarked against our production stack with a defensible recommendation on whether to switch.

Day 90

We can swap in any new LLM and have a numbers-backed answer on whether to ship it within a week.
DSPy or GEPA-style prompt optimization is running over at least one production voice flow, and you have shown measurable lift.
Self-improvement v1 is live for at least one failure pattern. The same problem does not get solved twice because the system feeds the fix back into the platform.
You are spotting failure patterns across customer accounts and turning them into product fixes the rest of the team builds on.

Must-haves

ML engineering experience shipping production systems
Strong Python and a working ML stack (PyTorch, Huggingface, pandas, scikit-learn)
Hands-on experience designing LLM-based agents: prompting, tool/function calling, multi-turn state, structured outputs
Hands-on experience building evals or eval frameworks for ML, LLM, or voice systems. Built LLM-as-judge eval pipelines and know their failure modes
Practical experience with ASR/STT comparing providers, fine-tuning, or running open models like Whisper
Practical experience with TTS systems (ElevenLabs or open models)
Comfortable working with audio data: sample rates, codecs, noise, alignment

Nice-to-haves

Designed voice agents specifically handled barge-in, interruption recovery, disfluencies, and natural turn-taking at the prompt/behavior layer
Experience with diarization, VAD, or endpointing models
Audio dataset curation, labeling, or annotation pipelines
Trained or fine-tuned ASR or TTS models from scratch or on domain audio
Experience with active learning or data-flywheel patterns over production traffic
Open-source contributions to AI/ML frameworks
Familiarity with cost/latency tradeoffs across model providers for real-time voice