Software Engineer, Machine Learning Platform & Operations
CHICAGO, IL, US, 60654-4203
Grainger
Grainger is your premier industrial supplies and equipment provider with over one million products to keep you up and running. Use Grainger.com for fast and easy ordering with next-day delivery available. Rely on our product experts for 24/7...
Work Location Type: Hybrid
As a leading industrial distributor with operations primarily in North America, Japan and the United Kingdom, We Keep The World Working® by serving more than 4.5 million customers worldwide with products delivered through innovative technology and deep customer relationships. With 2023 sales of $16.5 billion, we’re dedicated to providing value for customers, fostering an engaging culture for team members and driving strong financial results.
Our welcoming workplace enables you to learn, grow and make a difference by keeping businesses running and their people safe. As a 2024 Glassdoor Best Place to Work and a Great Place to Work-Certified™ company, we’re looking for passionate people to join our team as we continue leading the industry over our next 100 years.
The Machine Learning Operations & Platform team is focused on enabling machine learning scientists and engineers at Grainger to continuously develop, deploy, monitor, and refine machine learning models as well as improving the ML software development process. You will work with machine learning, data engineering, network, security, and platform engineering teams to build core components of a scalable, self-service machine learning platform that powers customer-facing applications. You will play an important part in developing the tools and services that form the backbone of Grainger’s AI driven features leveraging methods in Deep Learning, Natural Language Processing / Generative AI, Computer Vision, and beyond. This is an exciting opportunity to join a team and fuel the next phase in Grainger Technology Group’s data- and AI-driven modernization.
Our team is organized around three focus areas:
- Machine Learning Operations & Infrastructure: Build and maintain core infrastructure components (i.e., Kubernetes clusters) and tooling enabling self-service development and deployment of a variety of applications leveraging GitOps practices.
- Machine Learning Platform: Design and develop user-friendly software systems and interfaces supporting all stages of the machine learning development lifecycle.
- Machine Learning Effectiveness & Enablement: Guide, partner, and consult with machine learning, product, and business domain teams from across the organization to foster responsible, scalable, and efficient development of high-quality ML systems.
As such, we seek individuals with the ability to contribute to one or more of these focus areas. Successful candidates need not have prior hands-on experience with machine learning system development if they have relevant experience in other areas. Similarly, individuals with ML expertise and a solid understanding of software engineering best practices can be successful in this role without deep Kubernetes or DevOps knowledge, for example. While the details below are reflective of the skills present across our team, candidates need not possess them all. We are ultimately interested in speaking with curious individuals with an intuition for technical problem solving and are hiring across levels and focus areas.
- Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
- Design, monitor, and improve cloud infrastructure solutions which support applications executing at scale.
- Collaborate with machine learning, network, security, infrastructure, and platform engineers to ensure performant access to data, compute, and networked services.
- Ensure a rigorous deployment process using DevOps standards and mentor users in software development best practices.
- Partner with teams across the business to drive broader adoption of ML, enabling teams to improve the pace and quality of ML system development.
- Bachelor’s degree and 5+ years’ relevant work experience or an equivalent combination of education and experience.
- Strong software engineering fundamentals and experience developing production grade software; experience with Python, Golang, or similar language preferred.
- Working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred.
- Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm).
- Experience with DevOps or GitOps practices and tools (e.g. GitHub Actions, ArgoCD)
- Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, DataDog, ELK).
- Familiarity with containerization as well as container management and orchestration technologies (e.g., Docker, Kubernetes).
- Ability to work collaboratively in a team environment.
Bonus:
- Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs).
- Experience working with machine learning applications and associated frameworks/tools, particularly for monitoring and observability.
- Experience with big data technologies, distributed computing frameworks, and/or streaming data processing tools (e.g., Spark, Kafka, Presto, Flink).
- Experience deploying, evaluating, and testing, or otherwise supporting, GenAI applications and their components (e.g., LLMs, Vector DBs, etc.)
Don’t meet every single qualification? Studies show people are hesitant to apply if they don’t meet all requirements listed in a job posting. If you feel you don’t have all the desired experience, but it otherwise aligns with your background and you’re excited about this role, we encourage you to apply. You could be a great candidate for this or other roles on our team.
We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender, gender identity or expression, or veteran status. We are proud to be an equal opportunity workplace.
We are committed to fostering an inclusive, accessible environment that includes both providing reasonable accommodations to individuals with disabilities during the application and hiring process as well as throughout the course of one’s employment. With this in mind, should you need a reasonable accommodation during the application and selection process, please advise us so that we can provide appropriate assistance.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Ansible AWS Big Data Computer Vision Deep Learning DevOps Distributed Systems Docker ELK Engineering Flink Generative AI GitHub Golang Grafana Helm Industrial Kafka Kubernetes LLMs Machine Learning ML models NLP Python Security Spark Splunk Streaming Terraform Testing
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.