AI/HPC System Performance Engineer
Tasks
- Analyze and resolve performance regressions in distributed training
- Automate performance reporting and root cause analysis
- Build monitoring tooling and automation
- Conduct technical design reviews for system changes
- Design capacity and scalability experiments
- Develop performance analysis frameworks and dashboards
- Establish service level objectives for cluster network performance
- Identify network compute and memory bottlenecks
- Investigate RDMA fabrics and collective communication issues
- Mentor engineers on HPC performance methodologies
- Profile and benchmark AI training and inference workloads
Perks/Benefits
- N/A
Skills/Tech-stack
AI Workload Optimization | AI workload | Alerting | C++ | Capacity Planning | Collective communication | GPU Utilization | Instrumentation | Latency | MPI | Monitoring | NCCL | Network bandwidth | Performance Analysis | PyTorch | RDMA | Scalability testing | Telemetry | TensorFlow | Workload optimization
Education
Regions
Countries
States
Cities
Related jobs
-
Software Engineer, Machine Learning USD 213K-293KAI ethics | API Design | Agent Orchestration | Artificial Intelligence | Bias MitigationSenior-level Full TimeSunnyvale, CA | Remote, US | … R3h ago
-
C++ | Data Analysis | Data Structures | Data structures algorithms | Distributed ComputingSenior-level Full TimeMountain View, CA, USA3h ago
-
Artificial Intelligence | Audio generation | C++ | Computer Vision | Data PreparationSenior-level Full TimeSunnyvale, CA, USA; Kirkland, WA, USA3h ago
-
Embedded Software Engineer-Innovation Team USD 90K-134KBare Metal | C# | C++ | CAN | Component datasheets401k | Dental insurance | Employee product discounts | Health insurance | Life insuranceMid-level Full TimeCottage Grove, WI, US11h ago
-
Senior Embedded Software Engineer-Innovation Team USD 115K-174KBare Metal | C# | C++ | CAN | Device Drivers401k | Dental insurance | Employee product discounts | Health insurance | International travelSenior-level Full TimeCottage Grove, WI, US11h ago
-
Senior Robot Infrastructure Engineer USD 100K-300KAWS | Alerting | C plus plus | Cloud pipelines | EC2Senior-level Full TimeSan Francisco12h ago
-
Forward Deployed AI Engineer USD 110K-160KAI Agents | APIs | Artificial Intelligence | Cloud infrastructure | LLM401k employer match | Family building benefits | Flexible time off | Free OneMedical memberships | Healthcare plansEntry-level Full TimeNew York, NY, United States13h ago
-
Agent systems | Artificial Intelligence | C++ | Imitation Learning | LLM401k company match | Dental insurance | Flexible work schedules | Life insurance | Medical insuranceSenior-level Full TimeLos Angeles, USA; Mercer Island, USA14h ago
-
Full Stack Software Engineer, Data USD 160K-225KAngular | Build Automation | C# | CI/CD | Continuous integrationExtended hours | Travel | Weekend availabilitySenior-level Full TimeStarbase, TX14h ago
-
Embedded Software Engineer USD 150K-186K*nix | Authentication | Bash | Bazel | Build AutomationOn site work 4 day schedule | Proof of vaccination requiredSenior-level Full TimeSunnyvale, CA, United States14h ago
-
Senior Computational Engineer USD 155K-200KBash | Boundary conditions | C++ | CAD to Mesh to Solve to Post Processing | CFD ValidationDental insurance | Equity | Flexible paid time off | Health insurance | Occasional weekend workSenior-level Full TimePortland, OR14h ago
-
Bandits | Causal Inference | Constrained optimization | Instrumental Variables | MLflow401k matching | Basic life insurance | Commuter benefits match | Disability insurance | Family-forming assistanceMid-level Full TimeSan Francisco, CA; Sunnyvale, CA; Seattle, …15h ago
-
Software Engineer, Infrastructure - Autonomy & Robotics USD 159K-235KApache Flink | Apache Spark | C++ | Continuous integration | Data ProcessingMid-level Full TimeSan Francisco, CA15h ago
-
Senior/Staff Robotics and Controls Engineer USD 155K-205KBody Dynamics | C++ | Collision Avoidance | Constrained optimization | Kalman FilterEquity | On-site work | PTO | Performance bonusSenior-level Full TimeLos Angeles, California15h ago
-
Full Stack Software Engineer, Data USD 125K-175KAngular | Build systems | C# | CI/CD | Computer Vision401k retirement plan | Employee stock purchase plan | Medical/Dental/Vision insurance | Paid Holidays | Paid parental leaveEntry-level Full TimeHawthorne, CA15h ago
-
Senior AI Engineer - Contract USD 136K-172KBehavior Trees | C# | C++ | CPU Optimization | Game AICareer improvement plan | Company events | Flexible work arrangements | Generous time-off policy | Medical, dental & vision coverageSenior-level Full TimeIrvine, CA R15h ago
-
Staff Machine Learning Engineer USD 152K-277KAWS | Airflow | Apache Spark | BigQuery | Cloud Computing401k plan | Commuter benefits | Dental insurance | Electric vehicle charging station | Employee assistance programSenior-level Full TimeMountain View, USA17h ago
-
Machine Learning Engineer USD 153K-222KAWS | CI/CD | Cloud platform | Deep learning | Federated LearningMid-level Full TimePeachtree Corners, GA, United States17h ago
-
Senior Data Infrastructure Engineer USD 140K-160KAWS | Amazon Redshift | Apache Airflow | Apache Kafka | AzureEquity | Health insurance | Paid time off | Parental leave | Retirement planSenior-level Full TimeBurlington, Massachusetts, United States18h ago
-
AI Analyst USD 80K-120KAWS | Azure | Computer Vision | Data Analysis | Deep learning401k employer match | AD&D insurance | Dental insurance | Health insurance | Life insuranceMid-level Full TimeRemote, United States R18h ago
-
AI Architect – Production LLM Systems (Azure) USD 90K-148KAPI Design | Agent Orchestration | Alerting | Azure | Cost OptimizationEmployee assistance program | Employee stock purchase plan | Flexible work options | Hybrid work options | Paid time offSenior-level Full TimeDallas, TX, United States18h ago
-
Senior Software Engineer-Data Engineer USD 124K-228KAWS Glue | Airflow | Alerting | Apache Spark | CI/CD401k match | Comprehensive healthcare | Disability coverage | Employee stock purchase plan | Life insuranceSenior-level Full TimeBellevue; Seattle19h ago
-
Staff Data Engineer USD 180K-200KApache Flink | Apache Kafka | Apache Spark | Batch Processing | Cloud Native401k plan | Health insurance | Paid family leave | Paid leave | Paid sabbaticalSenior-level Full TimeNew York, New York, United States19h ago
-
Staff Software Engineer - AI Research Infrastructure USD 199K-270KBackend Services | CI | Cluster management | Data Pipelines | Distributed SystemsSenior-level Full TimeNew York City, New York; San …19h ago
-
AI Researcher, LLMs USD 200K-300KDataset curation | Distributed Training | Distributed inference | Fine Tuning | GPU ComputingEntry-level Full TimeLondon, United Kingdom; New York, NY, …19h ago