GPU Engineer
Remote, NY, US
AllCloud
AllCloud combines the expertise of cloud integration with custom solutions for proven success across top technologies including AWS and Salesforce.Description
GPU Engineer
Location: US / Canada (Eastern Time) - Home based
Job Type: Full-time, Permanent
About AllCloud
AllCloud is a global professional services company providing organizations with cloud enablement and transformation tools. As an AWS Premier Consulting Partner and audited MSP, a Salesforce Platinum Partner, and a Snowflake Premier Partner, AllCloud helps clients connect their front and back offices by building a new operating model to harness the benefits of cloud technology and data and analytics.
Job Summary
We are seeking an experienced GPU Engineer to join our innovative AI team at AllCloud. This role will be responsible for designing, implementing, and optimizing GPU-based infrastructure for large-scale LLM training and inference. The ideal candidate will have deep expertise in GPU architecture, parallel computing, and performance optimization for machine learning workloads. You'll work closely with our LLM Architects and ML Engineers to build and maintain the high-performance computing environment required for training our custom transformer-based language models.
Responsibilities
- Design and implement scalable GPU clusters on AWS infrastructure for distributed LLM training
- Optimize GPU memory usage, computational throughput, and inter-node communication for transformer model training
- Configure and tune GPU acceleration libraries (CUDA, cuDNN, NCCL) for maximum performance
- Implement mixed precision training and other optimization techniques to improve training efficiency
- Architect and deploy GPU-based inference solutions that balance latency, throughput, and cost
- Create benchmarking tools to measure and improve model training and inference performance
- Establish monitoring and management systems for GPU resources to maximize utilization and reliability
- Collaborate with LLM Architects to implement parallelization strategies (model, data, pipeline parallelism)
- Troubleshoot hardware and software issues affecting GPU performance
- Keep current with advancements in GPU technology and AI accelerator hardware
Requirements
Summary of Key Requirements
- 5+ years of experience optimizing GPU infrastructure for machine learning workloads
- Advanced knowledge of NVIDIA GPU architecture and CUDA programming
- Strong understanding of HPC computing, AI network architecture, and physical layer management.
- Experience with AWS GPU instances (e.g., P4d, P5, G5) and AWS Batch for ML workloads
- Strong background in distributed computing and parallel processing techniques
- Familiarity with transformer architecture and deep learning frameworks like PyTorch or TensorFlow
- Expertise in performance profiling and bottleneck identification in GPU workloads
- Experience with containerization (Docker) and orchestration (Kubernetes)
- Understanding of memory optimization techniques for large language models
- Bachelor's degree in Computer Science, Electrical Engineering, or related field (Master's preferred)
Certifications
- AWS Certified Solutions Architect - Professional (Strongly Preferred)
- NVIDIA-Certified Professional: Accelerated Data Science (Preferred)
- NVIDIA-Certified Professional: AI Infrastructure or AI Networking (NCP-AIN) (Preferred)
Why work for us?
Our team inspires progress in each other and in our customers through our relentless pursuit of excellence; you will work with leaders who promote learning and personal development.
AllCloud is an Equal Opportunity Employer and considers applicants for employment without regard to race, color, religion, sex, orientation, national origin, age, disability, genetics or any other basis forbidden under federal, provincial, or local law.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Architecture AWS Computer Science Consulting CUDA cuDNN Deep Learning Docker Engineering GPU HPC Kubernetes LLMs Machine Learning ML infrastructure Model training PyTorch Salesforce Snowflake TensorFlow
Perks/benefits: Career development
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.