Site Reliability Engineer
New York, New York, United States - Remote
Dstillery
Dstillery is the leading AI custom audience solutions company, empowering brands to maximize customer data and reach their target audiences. Call now!Dstillery is the leading AI ad targeting company. We empower brands and agencies to target their best prospects for high-performing programmatic advertising campaigns. Backed by our award-winning Data Science, Dstillery has earned 24 patents (and counting) for the AI technology that powers our precise, scalable audiences. Our newest technology, ID-free®, is patented, privacy-safe behavioral targeting that reaches 100% of ad impressions and can be used with any Dstillery product. Our premier user segment product, Custom AI Audiences, is a just-for-your-brand targeting solution that refreshes hundreds of millions of users every 24 hours to deliver the best performance.
We currently handle billions of events per day, we're growing strong and need someone who can help us scale our systems to handle lots of data. We're still pretty small on the engineering team, and everyone gets their hands dirty and makes a real impact.
Our engineering culture is focused on shipping scalable, practical systems. We use small, agile teams that can touch any part of the system. We place a high value on maintaining a good work-life balance, avoiding grinds and focusing on getting things done rather than putting in long hours.
We are looking to hire a Site Reliability Engineer to join our team in supporting both our on-premises and cloud infrastructure. In this role, you will be working both on standalone tasks as well as collaborating with other SRE members on larger, more complex projects.
Responsibilities
- Contribute to initiatives aligned with the systems roadmap in a collaborative team environment.
- Work cross-functionally with software engineers, ML engineers, and data scientists to build and support reliable systems.
- Build and refine monitoring and alerting systems to ensure high availability and performance..
- Lead incident response, conduct root cause analysis, and drive remediation to prevent recurrence.
- Participate in design sessions, code reviews, and knowledge sharing.
- Advocate for SRE principles and best practices, including infrastructure as code and automation.
- Contribute to and enhance our evolving systems documentation.
- Participate in a scheduled rotation to support production systems during office hours.
Qualifications
We are looking for a candidate who has:
- Familiarity with security best practices and experience implementing security measures across infrastructure.
- Experience in performance tuning and optimizing systems for scalability and efficiency.
- Experience in designing and implementing disaster recovery and business continuity plans.
- Strong communication skills to effectively collaborate with cross-functional teams.
- Strong analytical and problem-solving skills to troubleshoot complex issues.
- Ability to mentor junior team members and share knowledge to foster a collaborative learning environment.
and also has experience in a significant subset of the following tools, and interest in learning the rest:
- Linux system administration on RHEL derivatives
- Deployment and monitoring across bare metal, cloud VMs, cloud-native platforms, and Kubernetes.
- Configuration management tools such as: Salt, Ansible
- Infrastructure as code, eg: Terraform
- Linux installation tools, eg: Cobbler
- VM image building tools, eg: Packer
- Open source networking, eg: quagga/frr, keepalived, iptables
- Cloud networking on AWS and GCP
- Automation using Python
- Source control management using Git
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Agile Ansible AWS Engineering GCP Git Kubernetes Linux Machine Learning Open Source Privacy Python Security Terraform
Perks/benefits: Team events
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.