Data Center Technician

Menlo Park

Lamini

Lamini is the enterprise LLM platform for existing software teams to quickly develop and control their own LLMs. Lamini has built-in best practices for specializing LLMs on billions of proprietary documents to improve performance, reduce...

View all jobs at Lamini

Apply now Apply later

Lamini enables every enterprise to safely, quickly, and cost-effectively build their own Expert AI. Our customers own their own models, trained on their data. Lamini optimizes for Expert AI workloads with minimal hallucination, enterprise-grade security, and enterprise flexibility, running on any infrastructure. Our team is made up of highly committed engineers, researchers, and tech industry veterans excited by mission and technology. We’re backed by leading VCs as well as computing and technology companies.
We are looking for a skilled Data Center Technician to oversee the physical and technical aspects of our GPU cluster. This role is essential for maintaining a stable and efficient computing environment, optimizing system performance, and minimizing downtime. You’ll be hands-on with hardware, responsible for troubleshooting, maintenance, and upgrades, and collaborate with our engineering teams to support their research and production workloads.

Key Responsibilities:

  • Cluster Management: Oversee day-to-day operations of our GPU cluster, including hardware and software maintenance, troubleshooting, and repairs to ensure optimal performance.
  • Deployment & Configuration: Assist with the deployment, configuration, and calibration of GPU servers, racks, and networking equipment.
  • Hardware Upgrades: Implement and support hardware upgrades, including new GPU installations, networking installations, and other critical infrastructure updates.
  • Monitoring & Optimization: Continuously monitor system performance, capacity, and health using tools and alerts, and take proactive steps to optimize resource allocation and prevent downtime.
  • Troubleshooting: Quickly diagnose and resolve hardware and network issues, coordinating with team members to minimize disruptions.
  • Documentation: Maintain accurate records of configurations, maintenance schedules, and hardware inventory for efficient and organized data center management.
  • Collaboration: Serve as datacenter liaison for vendor support personnel and manage support tickets with hardware vendors as needed. Work closely with AI researchers and engineers to understand their hardware requirements and support them in running large-scale ML and DL workloads. Ability to prioritize and communicate issues, as well as provide clear and accurate SLAs.
  • Oncall: Ability to work in an environment that operates 24/7 with an ability to participate in on-call rotation and provide after-hours support as needed

Requirements:

  • Technical Education/Experience: Bachelor’s degree in Computer Science, IT, Electrical Engineering, or a related field, or equivalent hands-on experience.
  • Data Center Expertise: 2+ years of experience in a data center environment, with a strong understanding of server maintenance, networking, and hardware troubleshooting.
  • GPU Knowledge: Experience working with GPU hardware, preferably in an AI or high-performance computing environment (experience with AMD GPUs is a plus).
  • Networking Skills: Familiarity with networking concepts (TCP/IP, DNS, DHCP, RoCE, redundancy) and experience with network hardware in a data center setting.
  • Problem-Solving Skills: Strong analytical skills and the ability to quickly diagnose and resolve technical issues.
  • Team Player: Effective communication skills and the ability to work collaboratively with engineering and research teams.

Preferred Skills:

  • Scripting & Automation: Basic scripting skills (Python, Bash) to automate routine tasks.
  • Hands-On Data Center Experience: A deep understanding of server hardware, BMC-based manageability, BIOS setting and firmware deployment. Familiarity with Infiniband switches and network topology.
  • Monitoring Tools: Familiarity with monitoring and logging tools, such as Prometheus, Grafana, or similar. Basic Linux system administration expertise.
At Lamini AI, we are committed to providing an environment of mutual respect where equal employment opportunities are available to all applicants without regard to race, color, religion, sex, pregnancy (including childbirth, lactation and related medical conditions), national origin, age, physical and mental disability, marital status, sexual orientation, gender identity, gender expression, genetic information (including characteristics and testing), military and veteran status, and any other characteristic protected by applicable law. Lamini AI believes that diversity and inclusion among our employees is critical to our success as a company, and we seek to recruit, develop and retain the most talented people from a diverse candidate pool. Selection for employment is decided on the basis of qualifications, merit, and business need.
Apply now Apply later

* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰

Job stats:  0  0  0

Tags: Computer Science Engineering GPU Grafana InfiniBand Linux Machine Learning Python Research Security Testing

Region: North America
Country: United States

More jobs like this