Staff ML Infrastructure SRE

New

Skills

Ansible Devops Docker Engineer Helm Kubernetes Python Pytorch Security Terraform

The Wikimedia Foundation seeks a Staff Site Reliability Engineer (SRE) specializing in Machine Learning Infrastructure. Join a distributed, remote-first team supporting the platforms that empower Wikimedia’s Machine Learning Engineers and Researchers to build, deploy, and operate production-grade ML models. You will play a vital role in scaling, securing, and evolving the ML infrastructure that supports Wikipedia and other Wikimedia projects worldwide.

Job Overview

As a Staff SRE, you will architect, develop, and maintain robust ML infrastructure, collaborating closely with engineers, researchers, and open-source communities. Your work will directly impact the reliability, scalability, and performance of Wikimedia’s ML systems, supporting the Foundation’s mission to make free knowledge accessible globally.

Key Responsibilities
  • Design, implement, and maintain scalable infrastructure for ML model training, deployment, monitoring, and scaling.
  • Enhance reliability, availability, and scalability of ML systems to ensure seamless workflows for internal users.
  • Collaborate with ML engineers, product teams, SREs, and volunteers to gather requirements and resolve operational challenges.
  • Monitor and optimize system performance, capacity, and security to uphold high service standards.
  • Provide expert guidance, documentation, and mentorship on infrastructure management and operational best practices.
Required Skills & Qualifications
  • 7+ years in SRE, DevOps, or infrastructure engineering roles with experience in production ML systems.
  • Expertise in on-premises ML infrastructure (Kubernetes, Docker, GPU acceleration, distributed training systems).
  • Proficiency with automation and configuration management tools (Terraform, Ansible, Helm, Argo CD).
  • Experience implementing observability, monitoring, and logging solutions (Prometheus, Grafana, ELK stack).
  • Familiarity with Python-based ML frameworks (PyTorch, TensorFlow, scikit-learn).
  • Strong English communication skills and ability to work effectively across global, remote teams.
  • Collaborative, proactive, and motivated approach with a commitment to open-source communities.

Job Type: Remote

Salary: Not Disclosed

Experience: Entry

Duration: 12 Months

Share this job:

Similar Jobs

Senior DevOps Engineer Role

Posted 10 days ago

Modernize and automate IT infrastructure

Enable scalable cloud and on-premise solutions

Ansible Architecture AWS Devops

Senior Antifraud Engineer Role

Posted 11 days ago

Develop and maintain antifraud systems

Lead and mentor engineering teams

Ansible AWS Azure CI/CD

Junior Cloud Native Engineer

Posted 12 days ago

Develop and maintain cloud native infrastructure

Automate workflows using CI/CD tools

Ansible Docker Github Actions Kubernetes

Linux Security Engineer

Posted 14 days ago

Strengthen Ubuntu's cryptography and security foundation

Ensure compliance with international security standards

Ansible Applied Cryptography C programming Devops

Windows Platform Engineer

Posted 17 days ago

Automate and manage Windows infrastructure

Leverage cloud and IaC technologies

Active directory Ansible AWS Azure

Cloud DevOps Security Manager

Posted 19 days ago

Lead vulnerability and patch management programs

Drive automation and operational excellence

Ansible Architecture AWS Azure

Juju Go Software Engineer

Posted 19 days ago

Develop and enhance Juju platform features.

Collaborate with global, remote engineering teams.

Ansible Cloud Computing Distributed systems Docker

Remote DevOps Engineer Role

Posted 20 days ago

Build and maintain scalable infrastructure for HFT.

Automate deployments and streamline CI/CD workflows.

Ansible Architecture AWS Devops

Staff Software Engineer Wallet

Posted 22 days ago

Develop and maintain wallet transaction APIs using Golang and AWS

Lead squads in technical decision-making and system design

Ansible Api Development AWS Azure

Senior DevOps Infrastructure Engineer

Posted 34 days ago

Design and implement local VM-based environments

Build and simulate microservices architecture

Ansible Bash CI/CD Devops

Lead Cloud Engineer Role

Posted 34 days ago

Provide expert cloud engineering leadership and advocacy for clients.

Drive automation and process improvements for efficient managed services.

Ansible AWS Azure Cloud

Staff Software Engineer Wallet

Posted 43 days ago

Develop and maintain wallet payment solutions using Golang and AWS.

Lead technical decision-making and mentor engineering teams.

Ansible Api Development AWS Azure

Senior Antifraud Engineer

Posted 43 days ago

Develop and maintain antifraud systems using modern technologies

Ensure high system reliability, scalability, and security

Ansible AWS Azure CI/CD

Linux Security Engineer

Posted 46 days ago

Enhance Ubuntu's cryptography for security certifications

Develop automation tools for security compliance auditing

Ansible Applied Cryptography C programming Devops

Senior Database Administrator Role

Posted 48 days ago

Manage and optimize multiple database systems

Implement robust backup and disaster recovery strategies

Ansible AWS Kubernetes MariaDB

Lead Cloud Engineer

Posted 49 days ago

Lead and manage a cloud engineering team

Architect and maintain scalable cloud solutions

Ansible AWS Azure Chef

Senior Cloud Platform Engineer

Posted 50 days ago

Deliver high-quality cloud infrastructure projects

Lead workload migrations and automation initiatives

Ansible AWS Azure Ci/cd Pipelines

Senior Cloud Platform Engineer

Posted 50 days ago

Deliver complex cloud infrastructure projects

Migrate workloads to Google Cloud

Ansible AWS Azure Devops

Staff Software Engineer

Posted 70 days ago

Develop and maintain Pay By Wallet platform using Golang and AWS.

Lead technical decisions for system stability and reliability.

Ansible Api Development AWS Cloud Platforms

Linux Security Systems Engineer

Posted 72 days ago

Integrate and manage security products

Automate security and operational workflows

Ansible AWS Azure Configuration Management

Site Reliability & Observability Engineer

Posted 72 days ago

Implement cloud observability solutions for customers

Automate and optimize system reliability and scalability

Ansible AWS Cloudformation Datadog

Backend Engineer GitLab Delivery

Posted 72 days ago

Build and maintain deployment infrastructure and tooling

Ensure secure, scalable, and reliable GitLab deployments

Ansible AWS Bash Chef

Senior Linux Platform Engineer

Posted 73 days ago

Design and manage scalable Linux infrastructure

Automate provisioning and configuration via IaC tools

Ansible AWS Azure Devops

Linux Crypto Security Engineer

Posted 78 days ago

Enhance Ubuntu's security and cryptography to meet international standards

Develop automation tools for compliance auditing and remediation

Ansible Applied Cryptography C programming Devops

Senior Cloud Platform Engineer

Posted 80 days ago

Lead cloud infrastructure projects

Architect and migrate workloads to GCP

Ansible AWS Cloud Devops

Senior Cloud Platform Engineer

Posted 80 days ago

Deliver high-quality infrastructure projects on GCP.

Lead and mentor cloud engineering teams.

Ansible AWS Cloud Devops

Senior Software Engineer Pismo

Posted 84 days ago

Develop and maintain anti-fraud systems

Lead technical decisions and mentor engineers

Ansible AWS CI/CD Engineer

Cloud Infrastructure Security Engineer

Posted 87 days ago

Build and secure cloud infrastructure.

Automate and streamline infrastructure processes.

Ansible Azure Engineer Gcp

Rubrik Backup Engineer IV

Posted 94 days ago

Architect and implement Rubrik backup solutions

Lead technical escalation and troubleshooting

Ansible AWS Azure Devops

Site Reliability Engineer III

Posted 99 days ago

Implement observability solutions for clients

Automate and maintain scalable systems

Ansible AWS Datadog Devops

Senior Frontend Web Developer

Posted 99 days ago

Develop scalable, modern web applications

Implement and maintain testing frameworks

Ansible API Babel Containers

SRE Engineering Manager

Posted 127 days ago

Lead and mentor globally distributed SRE teams

Ensure infrastructure reliability and incident response

Ansible Automation Cloud Computing Devops

SRE Engineering Manager

Posted 127 days ago

Lead and manage globally distributed SRE teams

Ensure high reliability and scalability of Wikimedia infrastructure

Ansible Automation Cloud Computing Devops

SRE Engineering Manager Role

Posted 127 days ago

Lead and mentor SRE teams globally.

Ensure reliability of Wikimedia infrastructure.

Ansible Cloud Computing Devops Docker

SRE Engineering Manager Role

Posted 127 days ago

Lead and mentor globally distributed SRE teams

Ensure reliability and scalability of Wikimedia infrastructure

Ansible Cloud Computing Devops Docker

SRE Engineering Manager

Posted 127 days ago

Lead and mentor globally distributed SRE teams

Manage and optimize Wikimedia's production infrastructure

Ansible Cloud Computing Devops Docker

Rubrik Backup Engineer IV

Posted 129 days ago

Design and implement enterprise backup solutions.

Provide technical escalation and subject matter expertise.

Ansible Architecture AWS Azure

Junior DevOps & Automation

Posted 129 days ago

Support and improve CI/CD pipelines

Automate processes under supervision

Ansible Automation Bash CI/CD

Senior Site Reliability Engineer

Posted 131 days ago

Implement and manage observability solutions

Optimize performance and reliability of cloud systems

Ansible AWS Devops Engineer

DeFi Platform Reliability Engineer

Posted 137 days ago

Ensure high reliability of DeFi price feeds and related infrastructure.

Automate infrastructure deployment and management.

Ansible Blockchain Cloud infrastructure Devops

Senior Python Engineer Remote

Posted 142 days ago

Design and build a new database system

Automate infrastructure workflows using Python

Ansible AWS CI/CD Containerization

MongoDB Data Engineer II

Posted 143 days ago

Provide 24/7 database and system support

Monitor and resolve complex customer issues

Ansible Chef Debian Elasticsearch

Senior Cloud Platform Engineer

Posted 143 days ago

Architect and deliver cloud infrastructure solutions

Lead migrations and modernization to GCP

Ansible Azure Devops Docker

Senior AWS DevOps Engineer

Posted 143 days ago

Automate AWS cloud deployments

Architect and implement best practices

Ansible Architecture AWS Chef

Platform Engineer – Hospitality AI

Posted 176 days ago

Enhance customer experience through robust platform engineering.

Support and scale cloud infrastructure for hospitality solutions.

Ansible CI/CD Distributed systems Grafana

Junior Blockchain Node Operator

Posted 181 days ago

Operate and maintain blockchain nodes

Automate and optimize infrastructure workflows

Ansible Blockchain CI/CD Docker

Backend Software Designer Role

Posted 185 days ago

Develop robust backend systems for satellite operations

Ensure system performance, security, and availability

Ansible AWS Cloud Platforms Devops

DevSecOps Engineer Opportunity

Posted 187 days ago

Maintain and secure infrastructure

Automate deployment and DevOps pipelines

Ansible AWS Docker Jenkins

Remote DevOps Engineer Roles

Posted 190 days ago

Recruit remote DevOps engineers

Automate and manage infrastructure

Ansible AWS Azure Docker

SRE Environment Automation

Posted 207 days ago

Automate operational tasks for multiple GitLab environments

Deploy and operate microservices on Kubernetes

Ansible Automation AWS Cloud
overtime