The Core Platform Team is responsible for many areas within Medium’s day-to-day operation. This includes managing Medium’s infrastructure, where we work hard to implement SRE (Site Reliability Engineering) methodologies to reduce toil wherever possible. While we often work through reactive, operational problems, we also are software engineers that aim to build or improve tooling wherever possible.
The Core Platform is also the first line of defense when a production incident arises. We prioritize reliability and the user experience above all else.

Key Responsibilities

As a Senior Core Platform engineer, you will build and support tooling that includes, but is not limited to, CI/CD pipelines, in-house and 3rd party Kubernetes Operators / Controllers, and IaC that interacts with AWS and other cloud services. You will also be part of the on-call rotation. During the on-call rotation, you are responsible for taking action on all alerts. Outside of that rotation, you will be involved in many areas; from working with product engineers to ensuring reliability as features are shipped, to supporting in-house tools that tie into infrastructure, developer experience, and beyond.

A senior engineer has significant mastery of their craft. They mentor others and help to expand their impact, while also having the knowledge and expertise to design and execute technical solutions. They have a strong sense of ownership over the environment — when they see a problem, they do what’s necessary to make sure it gets handled.
  • Identify areas of improvement (like toil, or inefficiencies) and propose solutions (including coding new tools or updating existing ones)
  • Monitor, triage, and break-fix Medium’s infrastructure and production Kubernetes-based environment
  • Work in the week-long on-call rotation with fellow engineers to ensure site stability and availability
  • Become embedded in product code to enhance observability and reliability where possible
  • Mentor other engineers and participate in code reviews
  • Lead and manage production incidents when they arise, engage with engineers and other teams to bring incidents to resolution
  • Clearly document technical decisions and collaborate on solutions through written and verbal communication
  • Independently understand and work with existing codebases and system configurations

Skills, Knowledge and Expertise

Technical Skills and Competencies:
  • 3+ years experience in software engineering, specifically in Golang or similar backend language and interacting with API’s
  • 5+ years experience of Site Reliability Engineering or equivalent experience supporting a production environment
  • You understand backend infrastructure
  • You are comfortable with- and familiar with how to troubleshoot within a Kubernetes environment
  • You have experience with AWS or other cloud providers
Non-Technical Skills
  • Strong communication and interpersonal skills, fostering a positive and collaborative work environment
  • Can respectfully challenge peers, and welcome when others challenge your thinking
  • Remains calm and collected under pressure while troubleshooting production incidents
  • Proactive and inquisitive, questioning established practices with the aim of identifying and implementing improvements

We Would Especially Love It If:

  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef)
  • Familiarity with monitoring and alerting tools (e.g., Prometheus, Grafana) and ability to troubleshoot issues using logs and other diagnostic data
  • You are familiar with Site Reliability Engineering principles, and understand the impact and importance of reducing toil
Job Overview
Job alerts

Subscribe to our weekly job alerts below and never miss the latest jobs

Sign in

Sign Up

Forgotten Password

Job Quick Search

Cart

Basket

Share