We are currently looking for skilled Site Reliability Engineers to develop, maintain, and support container orchestration (k8s), distributed ML workloads, network services, storage layers, and configuration of a petabyte scale AWS storage and Kafka stream stack.
- Develop and maintain high performance k8s clusters across multiple regions
- Develop and maintain telemetry infrastructure and visualization for metrics, logs, and distributed tracing.
- Develop and maintain the operational configuration of a petabyte scale storage and stream analysis service
- Work with Audio and Speech AI Engineers to accelerate development and deployment of heterogeneous analysis and distributed training pipelines
- Participate in the definition and management of SLIs, SLOs and error budgets for infrastructure and production services.
- Design and implement infrastructure-as-code pipelines
- 5+ Years Linux experience configuring, supporting, and optimizing
- 2+ Years AWS SRE experience designing, implementing, and support cloud-based infrastructure
- 2+ Years experience architecting, deploying, and supporting k8s in cloud environments.
- 2+ years experience designing and supporting distributed systems.
- Familiarity running distributed ML workloads in cluster orchestrated environments
- Experience building and supporting telemetry infrastructure (Grafana, Prometheus, and similar tools)
- Experience designing and implementing infrastructure as code pipelines
- Experience in one of more languages such as Python, Java, Go
- 2+ Years PubSub Experience (Kafka, SQS)