Senior Ml Infrastructure Engineer

Parametric

Date listed

2 months ago

Employment Type

Full time

Found on:

YCombinator Startups

Keywords: jenkins aws docker dagster agents bash eks helm vpc node github python ml terraform kubernetes argo s3

About Us

Parametric is building robots to reliably automate frontline physical labor, starting with laundry folding. We are moving beyond traditional hard-coded automation by developing generalizable, learning-based agents capable of operating in unstructured environments. We have spent the last few months validating our core technology and fundraising, and we are now building a team to solve one of the most persistent challenges in robotics: manipulating deformable objects at production scale.

About the Role

As a Senior ML Infrastructure Engineer, you will build the distributed nervous system of our research operations. You will design the underlying architecture that transforms raw robot logs into deployed policies, managing the complexity of asynchronous workflows at scale.

This is a deep systems engineering role. You will be responsible for orchestrating the interplay between massive data ingestion, large-scale distributed training, and parallelized simulation. You will build the tooling that ensures our experiments are reproducible, our compute is saturated, and our lineage is traceable.

What You’ll Do

Orchestrate Complex DAG Workflows: Design and implement fault-tolerant, multi-stage DAGs (using Airflow, Argo, or Temporal) that coordinate asynchronous tasks across distributed systems—from log ingestion and synthetic data generation to training and evaluation.
Architect Parallelized Compute Clusters: Provision and manage high-performance compute clusters (Kubernetes/Slurm) tailored for multi-node distributed training. You will optimize for scheduling efficiency, handling spot instance interruptions and maximizing GPU interconnect throughput.
Engineer A/B Experimentation Infrastructure: Build the platform for rigorous model comparison, enabling simultaneous A/B testing of policies in simulation and real-world shadow mode. You will implement automated statistical profiling to gate deployments based on key performance metrics.
Enforce End-to-End Data Lineage: Architect a strict provenance system that tracks the full lifecycle of a model artifact—mapping every deployed policy back to the exact dataset version, hyperparameters, and training code commit used to generate it.
Optimize Training Throughput: Debug and resolve bottlenecks in distributed training loops (e.g., NCCL timeouts, straggler nodes, I/O blocking) to ensure our research team iterates at maximum velocity.

What We’re Looking For

Three or more years (or equivalent) working in devops, ML infrastructure, or platform engineering roles
AWS Core Infrastructure: Expert-level proficiency with AWS primitives. You have deep experience automating EC2 lifecycles (specifically GPU instances like P4/P5), managing Spot Fleets for cost-efficient compute, and tuning S3 for high-throughput parallel data access.
Modern Workflow Orchestration: Strong experience building event-driven DAGs using tools like Temporal, Dagster, or Airflow. You understand how to manage complex dependencies, retries, and state in long-running distributed workflows.
Production Kubernetes: Hands-on experience architecting and managing EKS clusters. You are comfortable writing custom Helm charts, configuring cluster autoscalers (Karpenter/Cluster Autoscaler), and managing resource quotas for multi-tenant research teams.
Infrastructure as Code (IaC): You don't click buttons in the console. You have a portfolio of complex infrastructure defined in Terraform, Pulumi, or AWS CDK.
CI/CD & Automation: You have built robust pipelines (GitHub Actions, Jenkins) that handle not just code deployment, but also Docker image builds and artifact versioning.
Scripting & Linux Internals: Strong proficiency in Python and Bash. You can debug networking issues (VPC, Security Groups, DNS) and optimize Linux kernel parameters for high-performance computing.

Parametric PBC is a public benefit corporation building robots to benefit all humans. We’re a proud equal-opportunity employer and encourage applications from all individuals regardless of race, color, religion, sex, gender, national origin, disability, age, or veteran status.

We firmly believe the best version of the future includes everyone, so we encourage you to apply even if you don’t strictly meet all the requirements.