Staff Site Reliability Engineer

Mcd Tech Labs | Mountain View

Date listed

2 months ago

Employment Type

Full time

Glassdoor Rating

3/5 (27 reviews)

We are currently looking for skilled Site Reliability Engineers to develop, maintain, and support container orchestration (k8s), distributed ML workloads, network services, storage layers, and configuration of a petabyte scale AWS storage and Kafka stream stack.

Responsibilities:

  • Develop and maintain high performance k8s clusters across multiple regions
  • Develop and maintain telemetry infrastructure and visualization for metrics, logs, and distributed tracing.
  • Develop and maintain the operational configuration of a petabyte scale storage and stream analysis service
  • Work with Audio and Speech AI Engineers to accelerate development and deployment of heterogeneous analysis and distributed training pipelines
  • Participate in the definition and management of SLIs, SLOs and error budgets for infrastructure and production services.
  • Design and implement infrastructure-as-code pipelines

Required Qualifications:

  • 5+ Years Linux experience configuring, supporting, and optimizing
  • 2+ Years AWS SRE experience designing, implementing, and support cloud-based infrastructure
  • 2+ Years experience architecting, deploying, and supporting k8s in cloud environments.
  • 2+ years experience designing and supporting distributed systems.

Desired Qualifications:

  • Familiarity running distributed ML workloads in cluster orchestrated environments
  • Experience building and supporting telemetry infrastructure (Grafana, Prometheus, and similar tools)
  • Experience designing and implementing infrastructure as code pipelines
  • Experience in one of more languages such as Python, Java, Go
  • 2+ Years PubSub Experience (Kafka, SQS)


Findwork Copyright © 2021

Newsletter


Let's simplify your job search. Receive your tailored set of opportunities today.

Subscribe to our Jobs