Site Reliability Engineer

Insight Engines | Oregon City, United States

At Insight Engines, cloud operations is the backbone of delivering the engineering team’s product to the world. Our goal is to be able to sleep peacefully through the night while terabytes of data fly through our systems. We are looking for someone to help us maintain this large scale data processing platform and lead site reliability engineering.

But enough about us, let’s talk about you. Do you enjoy developing, deploying, and monitoring auto-scaling clusters in the cloud? How about digging into and analyzing metrics for process optimization and resource tuning? As an integral member of our technology team, you’ll do all of the above for everything from ETL pipelines to OLAP datastores -- along with operationalizing the infrastructure that powers our groundbreaking natural language platform. You’ll wear different hats, touch many parts of our system, and have a significant impact on our products.

The kinds of problems you’ll work on include:

  • Scaling high volume data systems
  • Deploying, maintaining, and owning systems
  • Designing, implementing, and maintaining robust monitoring and alerting to improve performance and reliability

Technologies we use on the operations team:

  • Apache Druid
  • Apache Kafka
  • Apache Spark
  • ArgoCD
  • Docker
  • Elasticsearch
  • Git
  • Go
  • Grafana
  • Jaeger
  • Kubernetes
  • Postgres
  • Prometheus
  • Python

As you can tell, we’re big fans of open source. We don’t expect you to have deep knowledge of 100% of these technologies -- but, if you have a growth mindset combined with experience with several of those or related technologies and a solid understanding of networking and GNU/Linux fundamentals, we would love to hear from you!

When applying, tell us about your real-world experience with GNU/Linux, cloud computing, distributed systems, as well as monitoring and maintaining those systems. Women, People of Color, Minorities, and LGBTQIA+ candidates are encouraged to apply.


  • BS in a related discipline, or 3+ years equivalent technology experience
  • 3+ years GNU/Linux and/or remote system administration experience or equivalent
  • Design, operation, and maintenance of robust, large-scale distributed systems
  • Cloud deployment experience on AWS, Azure, GCP, or similar platform
  • Experience with automation, configuration management, and developing infrastructure as code
  • Experience with software development a plus (Go, Python, Java, or equivalent)
  • Authorized to work in the United States
  • This is a remote opportunity with the preference of candidates located in the Pacific Time Zone or +/- 3 hours.

Company benefits

  • Open vacation policy
  • Health care insurance
  • Dental & vision insurance
  • Life insurance
  • Short-term & long-term disability insurance
  • Health care FSA
  • Transit & parking FSA
  • Free lunch at SF office
  • Flexible work hours
  • Holiday time off