Upbound.io is a cloud native computing startup on a mission to create the first open and community driven cloud computing platform. We are passionately leading a CNCF open source control plane effort, Crossplane.io, while building a commercial SaaS offering that enables enterprises to build, deploy and manage cloud native infrastructure for applications.
As a Senior Site Reliability Engineer (SRE) at Upbound, you’ll be a vital part of the production services the company is building its business on. You’ll be applying engineering principles to design and build highly reliable and scaled infrastructure and services, deployment pipelines and processes to frequently and safely release updates, and monitoring and alerting systems to ensure it all stays healthy.
In this role, you will be…
- Taking ownership of the health and reliability of the live production service and infrastructure, ensuring that SLOs/SLAs are consistently met
- Designing, building, and automating critical portions of the Upbound Cloud service infrastructure
- Troubleshooting and problem-solving effectively to remediate infrastructure related issues that affect service health
- Reporting and fixing bugs in private and public projects.
- Providing routine maintenance and support of Kubernetes based infrastructure, including extending Kubernetes API and functionality via CRD/Controller applications
- Entrusted to make technology decisions for the business, procuring the right technology and designing and implementing a self-service solution for the teams that consume Upbound infrastructure
- Collaborating with the development teams to assess and recommend technologies that support company organizational needs
- Balancing tradeoffs between enterprise and open source technologies to better serve Upbound
- Supporting the full project lifecycle - discovery, analysis, architecture, design, documentation, building, migration, automation, and production-readiness
You are a good fit if you have...
- Worked in teams that have deeply internalized SRE philosophies in their culture, environments, and processes
- Written lots of code and automation in modern languages (Go is ideal)
- Managed production Kubernetes deployments or have been responsible for deploying/managing workloads running on Kubernetes in production
- Architected and deployed highly scaled and reliable services, solutions, and infrastructure in multiple major cloud providers
- You are intimately familiar with public cloud infrastructure: AWS, Azure, and GCP
- Incorporated modern operational and application delivery tools and methodologies into your production deployment workflows, like those from HashiCorp (e.g. Terraform), CI/CD, IaC, and GitOps
It is a plus if…
- You have worked in a startup and distributed/remote team before, and understand the unique challenges of a startup environment.
- You are actively involved or have contributed to the upstream Kubernetes community.
- You have a history of speaking at technology conferences, blogging/writing technical articles, and/or contributing to a popular open source project.
While building amazing technology is important, Upbound has an intense commitment to building a great culture. With company values like Champion Others, Be Collaborative, Stay on Target, Stay Hungry, Have Fun and Empower Others; you'll find yourself in a place where learning, growth, impact, and fun finally intersect. Similar to the open source community we serve, we look to each other to constantly iterate and improve on what we're building and you will be a key contributor in this effort.
We encourage people of all backgrounds, gender identities, ethnicities, ages, or any other descriptors that make you uniquely you, to apply with enthusiasm and confidence. Upbound is a place where you can be 100% comfortable being you.