Data Center Technician
Menlo Park
Lamini
Lamini helps enterprises reduce hallucinations by 95%, enabling them to build smaller, faster LLMs and agents based on their proprietary data. Lamini can be deployed in secure environments —on-premise (even air-gapped) or VPC—so your data...
Lamini enables every enterprise to safely, quickly, and cost-effectively build their own Expert AI. Our customers own their own models, trained on their data. Lamini optimizes for Expert AI workloads with minimal hallucination, enterprise-grade security, and enterprise flexibility, running on any infrastructure. Our team is made up of highly committed engineers, researchers, and tech industry veterans excited by mission and technology. We’re backed by leading VCs as well as computing and technology companies.
We are looking for a skilled Data Center Technician to oversee the physical and technical aspects of our GPU cluster. This role is essential for maintaining a stable and efficient computing environment, optimizing system performance, and minimizing downtime. You’ll be hands-on with hardware, responsible for troubleshooting, maintenance, and upgrades, and collaborate with our engineering teams to support their research and production workloads.
We are looking for a skilled Data Center Technician to oversee the physical and technical aspects of our GPU cluster. This role is essential for maintaining a stable and efficient computing environment, optimizing system performance, and minimizing downtime. You’ll be hands-on with hardware, responsible for troubleshooting, maintenance, and upgrades, and collaborate with our engineering teams to support their research and production workloads.
Key Responsibilities:
- Cluster Management: Oversee day-to-day operations of our GPU cluster, including hardware and software maintenance, troubleshooting, and repairs to ensure optimal performance.
- Deployment & Configuration: Assist with the deployment, configuration, and calibration of GPU servers, racks, and networking equipment.
- Hardware Upgrades: Implement and support hardware upgrades, including new GPU installations, networking installations, and other critical infrastructure updates.
- Monitoring & Optimization: Continuously monitor system performance, capacity, and health using tools and alerts, and take proactive steps to optimize resource allocation and prevent downtime.
- Troubleshooting: Quickly diagnose and resolve hardware and network issues, coordinating with team members to minimize disruptions.
- Documentation: Maintain accurate records of configurations, maintenance schedules, and hardware inventory for efficient and organized data center management.
- Collaboration: Serve as datacenter liaison for vendor support personnel and manage support tickets with hardware vendors as needed. Work closely with AI researchers and engineers to understand their hardware requirements and support them in running large-scale ML and DL workloads. Ability to prioritize and communicate issues, as well as provide clear and accurate SLAs.
- Oncall: Ability to work in an environment that operates 24/7 with an ability to participate in on-call rotation and provide after-hours support as needed
Requirements:
- Technical Education/Experience: Bachelor’s degree in Computer Science, IT, Electrical Engineering, or a related field, or equivalent hands-on experience.
- Data Center Expertise: 2+ years of experience in a data center environment, with a strong understanding of server maintenance, networking, and hardware troubleshooting.
- GPU Knowledge: Experience working with GPU hardware, preferably in an AI or high-performance computing environment (experience with AMD GPUs is a plus).
- Networking Skills: Familiarity with networking concepts (TCP/IP, DNS, DHCP, RoCE, redundancy) and experience with network hardware in a data center setting.
- Problem-Solving Skills: Strong analytical skills and the ability to quickly diagnose and resolve technical issues.
- Team Player: Effective communication skills and the ability to work collaboratively with engineering and research teams.
Preferred Skills:
- Scripting & Automation: Basic scripting skills (Python, Bash) to automate routine tasks.
- Hands-On Data Center Experience: A deep understanding of server hardware, BMC-based manageability, BIOS setting and firmware deployment. Familiarity with Infiniband switches and network topology.
- Monitoring Tools: Familiarity with monitoring and logging tools, such as Prometheus, Grafana, or similar. Basic Linux system administration expertise.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Job stats:
0
0
0
Tags: Computer Science Engineering GPU Grafana InfiniBand Linux Machine Learning Python Research Security Testing
Region:
North America
Country:
United States
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Sr. Data Engineer jobsBusiness Intelligence Developer jobsPower BI Developer jobsBI Developer jobsStaff Data Scientist jobsStaff Machine Learning Engineer jobsPrincipal Software Engineer jobsData Science Intern jobsDevOps Engineer jobsJunior Data Analyst jobsData Science Manager jobsSoftware Engineer II jobsData Manager jobsData Analyst Intern jobsLead Data Analyst jobsStaff Software Engineer jobsBusiness Data Analyst jobsAI/ML Engineer jobsAccount Executive jobsSr. Data Scientist jobsData Specialist jobsData Governance Analyst jobsSenior Backend Engineer jobsBusiness Intelligence Analyst jobsData Engineer III jobs
Consulting jobsMLOps jobsAirflow jobsOpen Source jobsEconomics jobsLinux jobsKPIs jobsKafka jobsTerraform jobsJavaScript jobsGitHub jobsData Warehousing jobsPostgreSQL jobsRDBMS jobsNoSQL jobsScikit-learn jobsStreaming jobsComputer Vision jobsClassification jobsBanking jobsPrompt engineering jobsPhysics jobsGoogle Cloud jobsRAG jobsOracle jobs
Pandas jobsHadoop jobsdbt jobsBigQuery jobsScala jobsR&D jobsLooker jobsData warehouse jobsGPT jobsReact jobsScrum jobsLangChain jobsPySpark jobsDistributed Systems jobsELT jobsMicroservices jobsIndustrial jobsCX jobsJira jobsSAS jobsRedshift jobsOpenAI jobsModel training jobsTypeScript jobsJenkins jobs