Senior Solutions Architect, Platform Infrastructure - PST (Remote)

United States

Weights & Biases

Weights & Biases, developer tools for machine learning

View all jobs at Weights & Biases

Apply now Apply later

At Weights & Biases, our mission is to build the best tools for AI developers. We founded our company on the insight that while there were excellent tools for developers to build better code, there were no similarly great tools to help ML practitioners build better models. Starting with our first experiment tracking product, we have since expanded our solution into a comprehensive AI developer platform for organizations focused on building their own deep learning models and generative AI applications.
Weights & Biases is a Series C company with $250M in funding and over 200 employees. We proudly serve over 1,000 customers and more than 30 foundation model builders including customers such as OpenAI, NVIDIA, Microsoft, and Toyota.
The Senior Solutions Architect role at Weights & Biases is a unique hybrid, blending the technical expertise of a Site Reliability Engineer (SRE) with the communication and advisory skills of a Solutions Architect. In this role, you will focus on all aspects of the Weights & Biases Platform, managing customer deployments across various cloud infrastructures and on-prem environments to ensure scalability, reliability, and operational excellence.
You will work closely with customers to debug issues, provide best practices, and help them unlock the full potential of Weights & Biases. Additionally, you will produce technical content such as blog posts, documentation updates, and internal enablement material to support the Field Engineering team. This role requires deep collaboration with Support, Product, and Engineering teams to drive product improvements based on customer insights.

Responsibilities:

  • Deployment & Operations:
  • Work with customer operations teams to provision Weights & Biases services in Dedicated Cloud, Private Cloud, and on-prem environments.
  • Manage complex infrastructure implementations, partnering with highly skilled customer engineers.
  • Monitor and ensure the reliability, performance, and scalability of customer deployments using SRE best practices.
  • Debugging & Troubleshooting:
  • Diagnose and resolve issues in customer environments, documenting resolutions to accelerate future problem-solving.
  • Provide hands-on support for containerized and distributed systems using Docker, Kubernetes, and related technologies.
  • Customer Engagement:
  • Lead technical discussions with customers, acting as a trusted advisor for infrastructure reliability and operational excellence.
  • Deliver training sessions, product demos, and workshops to help customers maximize the value of Weights & Biases.
  • Collaborate with customers to uncover desired outcomes and recommend solutions tailored to their needs.
  • Enablement & Collaboration:
  • Partner with AI Solution Engineers to streamline post-sales processes, including onboarding, adoption, and training.
  • Collaborate with Sales Engineering to ensure a seamless transition from POC to onboarding.
  • Provide insights to the Product team based on customer feedback to influence the product roadmap.

Requirements:

  • Based in the Pacific Standard Time (PST) timezone.
  • A proven track record of systematically diagnosing and resolving infrastructure issues.
  • Prior experience in a customer-facing technical role.
  • Expertise with Docker, Kubernetes, Helm charts, networking, and cloud-managed services (e.g., MySQL, Object Stores).
  • Strong fundamentals in Infrastructure as Code (IaC), preferably Terraform.
  • Proficiency with at least one cloud platform (AWS, GCP, Azure); experience with multiple platforms is a plus.
  • Strong Linux/Unix command line experience.
  • Basic proficiency in Python and familiarity with ML workflows or tools.
  • Exceptional communication skills, both written and verbal, with the ability to simplify complex topics for diverse audiences.
  • Proven ability to prioritize and manage multiple competing tasks in a dynamic environment.

Strong plus

  • Deep proficiency in Kubernetes design patterns, including Operators.
  • Familiarity with data engineering and MLOps tooling.
  • Experience as an educator or facilitator for technical training sessions, workshops, or demos.
  • SaaS, web service, or distributed systems operations experience.

Our Benefits:

  • 🏝️ Flexible time off
  • 🩺 Medical, Dental, and Vision for employees and Family Coverage
  • 🏠 Remote first culture with in-office flexibility in San Francisco
  • 💵 Home office budget with a new high-powered laptop
  • 🥇 Truly competitive salary and equity
  • 🚼 12 weeks of Parental leave (U.S. specific)
  • 📈 401(k) (U.S. specific)
  • Supplemental benefits may be available depending on your location
  • Explore benefits by country
We encourage you to apply even if your experience doesn't perfectly align with the job description as we seek out diverse and creative perspectives. Team members who love to learn and collaborate in an inclusive environment will flourish with us. We are an equal opportunity employer and do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. If you need additional accommodations to feel comfortable during your interview process, reach out at careers@wandb.com.
#LI-Remote
Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  1  0  0
Category: Architecture Jobs

Tags: AWS Azure Deep Learning Distributed Systems Docker Engineering GCP Generative AI Helm Kubernetes Linux Machine Learning MLOps MySQL OpenAI Python Terraform Weights & Biases

Perks/benefits: Career development Competitive pay Equity / stock options Flex vacation Gear Health care Medical leave Parental leave

Regions: Remote/Anywhere North America
Country: United States

More jobs like this