Date listed
5 days agoEmployment Type
Full timeRemote
YesFound on:
AI and HPC infrastructure suffers from scarcity and high costs, so when failures happen they are costly in terms of time and money. Cluster productivity directly determines research output and revenue. Achieving high utilization and throughput is increasingly challenging due to the complexity of workloads, hardware, and operations.
Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure. We enable transparent and fast migration of GPU workloads across instances, without losing work. Workloads automatically migrate to achieve new levels of reliability and throughput while accelerating time to results. Our system is at the kernel/OS level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo. Today, we're deploying into leading inference platforms, neoclouds, enterprise, and research clusters.
Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI. Our research appears in NeurIPS and CVPR. We published some of the earliest formal methods for guaranteeing convergence in distributed training. At Shopify we've deployed warehouse automation and robot fleets building behavior trees, fleet control planes, and OTA infrastructure that performs reliably over constrained networks. We bring repeat founder experience having built and exited a healthcare AI company.
As a Forward Deployed Engineer at Cedana, you’ll lead and own technical engagement from end to end. You’ll engage with customers to understand and deploy on their environments: from production SLURM at a university, bare-metal Kubernetes at an inference provider, hybrid setup at a Fortune 100 Pharma enterprise. You’ll rapidly understand their key pain points, and use Cedana to solve their problems. For each customer you own everything from the OS up: SLURM plugins, Kubernetes operators, node configuration, networking, and observability.
This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world’s leading research and commercial customers to deliver a breakthrough solution.
Cedana is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status
Forward Deployed Engineer , Minicor
Full Time Employment1 month ago
Forward Deployed Engineer , Aravolta
Full Time Employment1 month ago
Forward Deployed Engineer , Arch Works
2 months ago
Forward Deployed Engineer , Feather
Full Time Employment2 months ago
Forward Deployed Engineer , Infer
Full Time Employment3 weeks ago
Newsletter
Let's simplify your job search. Receive your tailored set of opportunities today.
Subscribe to our Jobs