Platform Engineer (GenAI)
Quận 3, Vietnam
TechX
Đối tác Cao cấp của AWS tại Việt Nam | Learn how TechX and our services pioneering business transformation with Data an GenAI as the leading AWS Advanced Partner in Vietnam🚀 Overview
Join TechX as we continue expanding our AI infrastructure team and delivering impactful GenAI-powered products for enterprise and industry clients.
We are looking for an experienced Platform Engineer to build and operate the core infrastructure that powers the safe, reliable, and efficient delivery of our GenAI solutions. This role is at the heart of how we scale AI applications in production environments — ensuring observability, automation, cost control, and compliance for our large language model (LLM) operations.
⚡ Note: This is not a prompt engineering or model tuning role. Instead, you will architect and manage the infrastructure that enables AI teams to operate Gemini Pro/Flash models at scale.
🎯 Key Responsibilities
✅ Own LLMs-Oriented Platform Architecture
Design platform components that abstract LLM's (eg, Gemini) APIs into a consistent, testable, and production-ready interface.
Handle retries, latency tracking, fallback switching, and configuration routing logic.
✅ Design Multi-Version Prompt Configuration Management
Manage prompt and parameter versions across deployments.
Track version statuses (active, canary, deprecated), maintain changelogs, and ensure rollback safety.
✅ Build Observability & Cost Intelligence for Gemini Usage
Define structured logs and metrics for Gemini interactions.
Monitor latency, feedback scores, token usage, and cost estimates.
Develop dashboards and alerts to catch performance regressions or anomalies.
✅ Enable Safe, Automated Rollbacks
Implement health scoring, statistical deviation logic, and automated rollback mechanisms.
Maintain robust audit logs, cooldown strategies, and “last known good” states.
✅ Secure Integration & Configuration Safety
Manage API keys and configuration securely using GCP-native tools (Secret Manager, IAM).
Enforce log redaction and PII masking.
Design version-aware deployment hooks and readiness checks.
🛠️ Key Requirements
Must-Have Skills
- GCP + Gemini Integration: Proven experience integrating with Google Gemini APIs (Pro/Flash), with a deep understanding of request structures, cost models, latency behaviors, and operational best practices.
- Python Engineering: Strong Python backend development skills, particularly with asynchronous frameworks like FastAPI or similar, capable of building robust and scalable backend services.
- Observability Design: Expertise in designing structured logging and metrics for APIs, using formats like JSON or EMF, and implementing structured feedback tracking systems to ensure reliable monitoring and performance analysis.
- Prompt and Configuration Versioning: Hands-on experience working with version-controlled configuration systems or registries, such as YAML or JSON-based setups, GitOps workflows, or similar, to manage prompt versions and deployment safety.
- Automation and CLI Tooling: Ability to develop internal tooling and automation scripts (e.g., CLI tools for configuration management or rollback operations), including audit logging and safety mechanisms.
- Security and Compliance: Familiarity with GCP Identity and Access Management (IAM), secure API key handling, log masking and redaction strategies for PII, configuration gating, and readiness for audit compliance in production environments.
✨ Extra / Nice-to-Have Skills
Experience working with OpenAI, Claude, or AWS Bedrock (in addition to Gemini).
Experience designing model abstraction layers or runtime LLM routing.
Exposure to token cost modeling or billing/reporting APIs for LLMs.
Familiarity with AI security best practices in cloud environments.
🤝 Collaboration Scope
Work closely with Prompt Engineers to monitor version health and feedback.
Partner with AI Architects to optimize Gemini performance and integration.
Coordinate with Product & Operations for cost reporting, SLAs, and system health.
Engage with the DevOps (AWS) Team for hybrid observability and CI/CD processes.
🌟 Experience Level
4–6+ years in backend engineering, platform engineering, or SRE roles.
Prior experience deploying and monitoring AI/ML workloads (GCP preferred; multi-cloud a plus).
Bonus: Direct hands-on usage of Gemini APIs or managing LLM configurations in production.
💎 Why Join TechX?
Take ownership of Gemini observability and integration at scale.
Lead the GCP / Gemini-first strategy while collaborating across hybrid cloud environments.
Be part of a forward-thinking team, building mission-critical GenAI platforms for regulated industries.
Competitive salary, modern engineering culture, and career growth opportunities.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: APIs Architecture AWS CI/CD Claude DevOps Engineering FastAPI GCP Gemini Generative AI JSON LLMs Machine Learning ML infrastructure OpenAI Prompt engineering Python Security Statistics
Perks/benefits: Career development Competitive pay
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.