LLM Ops Engineer - Serverless & CI/CD (AWS)
India - Remote
Expedite Commerce
This isn't your average DevOps role. This isn't just about pipelines or cloud provisioning. This is about engineering the backbone of Agentic AI systems that drive the next generation of enterprise SaaS—where conversational interfaces, dynamic UIs, and intelligent agents operate seamlessly on AWS Serverless infrastructure, with deep integration into Salesforce and cross-agent protocols.
This is for builders with something to prove. For engineers who’ve gone beyond cloud fluency to orchestrate complex, multi-agent ecosystems—who want to shape how enterprise applications are deployed, debugged, scaled, and observed in real time.
If you’re driven by deep automation, passionate about creating fault-tolerant agentic systems, and thrive where innovation is the expectation—not the exception—you’re in the right place. Join us to redefine SaaS infrastructure and champion a new era of AI-powered, product-led enterprise experiences.
The Role
We are seeking a hands-on Agentic AI Ops Engineer who thrives at the intersection of cloud infrastructure, AI agent systems, and DevOps automation. In this role, you will build and maintain the CI/CD infrastructure for Agentic AI solutions using Terraform on AWS, while also developing, deploying, and debugging intelligent agents and their associated tools. This position is critical to ensuring scalable, traceable, and cost-effective delivery of agentic systems in production environments.
The Responsibilities
CI/CD Infrastructure for Agentic AI
- Design, implement, and maintain CI/CD pipelines for Agentic AI applications using Terraform, AWS CodePipeline, CodeBuild, and related tools.
- Automate deployment of multi-agent systems and associated tooling, ensuring version control, rollback strategies, and consistent environment parity across dev/test/prod.
Agent Development & Debugging
- Collaborate with ML/NLP engineers to develop and deploy modular, tool-integrated AI agents in production.
- Lead the effort to create debuggable agent architectures, with structured logging, standardized agent behaviors, and feedback integration loops.
- Build agent lifecycle management tools that support quick iteration, rollback, and debugging of faulty behaviors.
Monitoring, Tracing & Reliability
- Implement end-to-end observability for agents and tools, including runtime performance metrics, tool invocation traces, and latency/accuracy tracking.
- Design dashboards and alerting mechanisms to capture agent failures, degraded performance, and tool bottlenecks in real-time.
- Build lightweight tracing systems that help visualize agent workflows and simplify root cause analysis.
Cost Optimization & Usage Analysis
- Monitor and manage cost metrics associated with agentic operations including API call usage, toolchain overhead, and model inference costs.
- Set up proactive alerts for usage anomalies, implement cost dashboards, and propose strategies for reducing operational expenses without compromising performance.
Collaboration & Continuous Improvement
- Work closely with product, backend, and AI teams to evolve the agentic infrastructure design and tool orchestration workflows.
- Drive the adoption of best practices for Agentic AI DevOps, including retraining automation, secure deployments, and compliance in cloud-hosted environments.
- Participate in design reviews, postmortems, and architectural roadmap planning to continuously improve reliability and scalability.
Requirements
- 2+ years of experience in DevOps, MLOps, or Cloud Infrastructure with exposure to AI/ML systems.
- Deep expertise in AWS serverless architecture, including hands-on experience with:
- AWS Lambda – function design, performance tuning, cold-start optimization.
- Amazon API Gateway – managing REST/HTTP APIs and integrating with Lambda securely.
- Step Functions – orchestrating agentic workflows and managing execution states.
- S3, DynamoDB, EventBridge, SQS – event-driven and storage patterns for scalable AI systems.
- Strong proficiency in Terraform to build and manage serverless AWS environments using reusable, modular templates.
- Experience deploying and managing CI/CD pipelines for serverless and agent-based applications using AWS CodePipeline, CodeBuild, CodeDeploy, or GitHub Actions.
- Hands-on experience with agent and tool development in Python, including debugging and performance tuning in production.
- Solid understanding of IAM roles and policies, VPC configuration, and least-privilege access control for securing AI systems.
- Deep understanding of monitoring, alerting, and distributed tracing systems (e.g., CloudWatch, Grafana, OpenTelemetry).
- Ability to manage environment parity across dev, staging, and production using automated infrastructure pipelines.
- Excellent debugging, documentation, and cross-team communication skills.
Benefits
- Equity participation program.
- Health Insurance, PTO, and Leave time
- Ongoing paid professional training and certifications
- Fully Remote work Opportunity
- Strong Onboarding & Training program
Work Timings - 1 pm -10 pm IST
Next Steps
We’re looking for someone who embodies the spirit of a boundary-pushing Principal Architect—ready to own ambitious projects, craft scalable multi-cloud solutions, and skillfully integrate AI where it truly elevates outcomes.
- Apply Now: Send us your resume and a brief summary of your experience leading teams, including notable multi-platform or AI-driven projects.
- Show Us Your Ingenuity: Be prepared to discuss your boldest cross-platform solutions, how you integrated new technologies, and how you overcame tough technical hurdles.
- Collaborate & Ideate: If selected, you’ll workshop a real-world scenario with our leadership—so we can see firsthand how you approach challenges across AWS, AI, and beyond.
This is your opportunity to shape the future of enterprise solutions—across AWS, emerging AI platforms, and the occasional Salesforce ecosystem. We can’t wait to hear from you!
Our Belief
We believe extraordinary things happen when technology and human creativity unite. By empowering teams with cloud solutions, AI insights, and thoughtful architecture, we free them to focus on meaningful relationships, innovative strategies, and real impact. It’s more than just code—it’s about sparking a revolution in how people interact with systems, solve problems, and propel businesses forward.
If this resonates with you—if you’re driven, daring, and ready to build the next wave of multi-platform innovation—then let’s do this. Apply now and help us shape the future.
About Expedite Commerce
At Expedite Commerce, we believe that people achieve their best when technology enables them to build relationships and explore new ideas. So we build systems that free you up to focus on your customers and drive innovations. We have a great commerce platform that changes the way you do business!
See more about us at expeditecommerce.com. You can also read about us on G2/products/expedite-commerce, and on Salesforce Appexchange/ExpediteCommerce.
EEO Statement
All qualified applicants to Expedite Commerce are considered for employment without regard to race, color, religion, age, sex, sexual orientation, gender identity, national origin, disability, veteran's status or any other protected characteristic.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: APIs Architecture AWS CI/CD DevOps DynamoDB Engineering GitHub Grafana Lambda LLMOps LLMs Machine Learning MLOps Model inference NLP Pipelines Python Salesforce Step Functions Terraform
Perks/benefits: Equity / stock options Health care Insurance Startup environment
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.