Senior Infrastructure & DevOps Engineer

Pittsburgh, Pennsylvania, United States
Full-Time
Remote

Job Description:

Overview

To build and maintain the automated production line for PHIN's Physical Superintelligence. You will own the plumbing that allows our simulation engine to seamlessly scale, ensuring that our team can deploy updates multiple times a day and ingest massive amounts of simulation data without friction.

Core Responsibilities

Greenfield Observability: Architect and implement a comprehensive logging, monitoring, and alerting stack across our platform from the ground up.
Compute Architecture, Scaling & FinOps: Provision, manage, and optimize highly concurrent scaling clusters. Act as a cloud-agnostic thinker to direct future architecture and implement rigorous FinOps practices to minimize the cost of running thousands of simultaneous jobs.
Infrastructure as Code (IaC): Own, maintain, and expand our Terraform footprint.
Continuous Deployment (CD): Design and maintain high-velocity CI/CD pipelines supporting multiple deployments per day. Ensure "code to production" is a seamless, automated journey.
Backend Robustness: Manage the API layer that sits between the infrastructure and the application layer. Read and refactor services to optimize data movement, squash bottlenecks, and maintain security.
Data Pipeline Architecture: Build the underlying pipelines to move, store, and process the massive datasets generated by atomic-scale simulations.
Platform DevEx & MLOps: Build self-serve tooling and event-driven pipelines that empower the entire organization. Create seamless abstractions so our developers can focus on what they do best.
DevOps & Intelligence Automation: Ruthlessly automate manual toil. Use and build AI-driven tools to manage logs, infrastructure provisioning, and business workflows.
Standard Enterprise Security: Implement and maintain security best practices (SOC2/ISO focus) required for enterprise-grade contracts.

Candidate Profile

Experience: 5–8 years as a high-output Individual Contributor in Infrastructure or Backend roles.
Generalist Capability: Comfortable touching any part of the system—from networking and security to API design and data engineering. Familiarity with Python and TypeScript/Node.js.
Cloud & HPC Familiarity: Deep experience with major cloud providers. Familiarity with high-performance computing (HPC) schedulers like Slurm is a major plus.
Tool Agnostic: Not married to one framework; you choose the best tool for the job (K8s, Serverless, HPC Schedulers, etc.).
AI-Native: Expert user of intelligence tools (Claude, Cursor, Codex, Copilot, Agents, etc.) to 10x your own productivity and automate business tasks.
ML Collaboration: Previous experience working closely with machine learning teams, supporting ML workflows, or building MLOps pipelines is highly desirable.