Data Center Technician
Menlo Park
Lamini
Lamini is the enterprise LLM platform for existing software teams to quickly develop and control their own LLMs. Lamini has built-in best practices for specializing LLMs on billions of proprietary documents to improve performance, reduce...
Lamini enables every enterprise to safely, quickly, and cost-effectively build their own Expert AI. Our customers own their own models, trained on their data. Lamini optimizes for Expert AI workloads with minimal hallucination, enterprise-grade security, and enterprise flexibility, running on any infrastructure. Our team is made up of highly committed engineers, researchers, and tech industry veterans excited by mission and technology. We’re backed by leading VCs as well as computing and technology companies.
We are looking for a skilled Data Center Technician to oversee the physical and technical aspects of our GPU cluster. This role is essential for maintaining a stable and efficient computing environment, optimizing system performance, and minimizing downtime. You’ll be hands-on with hardware, responsible for troubleshooting, maintenance, and upgrades, and collaborate with our engineering teams to support their research and production workloads.
We are looking for a skilled Data Center Technician to oversee the physical and technical aspects of our GPU cluster. This role is essential for maintaining a stable and efficient computing environment, optimizing system performance, and minimizing downtime. You’ll be hands-on with hardware, responsible for troubleshooting, maintenance, and upgrades, and collaborate with our engineering teams to support their research and production workloads.
Key Responsibilities:
- Cluster Management: Oversee day-to-day operations of our GPU cluster, including hardware and software maintenance, troubleshooting, and repairs to ensure optimal performance.
- Deployment & Configuration: Assist with the deployment, configuration, and calibration of GPU servers, racks, and networking equipment.
- Hardware Upgrades: Implement and support hardware upgrades, including new GPU installations, networking installations, and other critical infrastructure updates.
- Monitoring & Optimization: Continuously monitor system performance, capacity, and health using tools and alerts, and take proactive steps to optimize resource allocation and prevent downtime.
- Troubleshooting: Quickly diagnose and resolve hardware and network issues, coordinating with team members to minimize disruptions.
- Documentation: Maintain accurate records of configurations, maintenance schedules, and hardware inventory for efficient and organized data center management.
- Collaboration: Serve as datacenter liaison for vendor support personnel and manage support tickets with hardware vendors as needed. Work closely with AI researchers and engineers to understand their hardware requirements and support them in running large-scale ML and DL workloads. Ability to prioritize and communicate issues, as well as provide clear and accurate SLAs.
- Oncall: Ability to work in an environment that operates 24/7 with an ability to participate in on-call rotation and provide after-hours support as needed
Requirements:
- Technical Education/Experience: Bachelor’s degree in Computer Science, IT, Electrical Engineering, or a related field, or equivalent hands-on experience.
- Data Center Expertise: 2+ years of experience in a data center environment, with a strong understanding of server maintenance, networking, and hardware troubleshooting.
- GPU Knowledge: Experience working with GPU hardware, preferably in an AI or high-performance computing environment (experience with AMD GPUs is a plus).
- Networking Skills: Familiarity with networking concepts (TCP/IP, DNS, DHCP, RoCE, redundancy) and experience with network hardware in a data center setting.
- Problem-Solving Skills: Strong analytical skills and the ability to quickly diagnose and resolve technical issues.
- Team Player: Effective communication skills and the ability to work collaboratively with engineering and research teams.
Preferred Skills:
- Scripting & Automation: Basic scripting skills (Python, Bash) to automate routine tasks.
- Hands-On Data Center Experience: A deep understanding of server hardware, BMC-based manageability, BIOS setting and firmware deployment. Familiarity with Infiniband switches and network topology.
- Monitoring Tools: Familiarity with monitoring and logging tools, such as Prometheus, Grafana, or similar. Basic Linux system administration expertise.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Job stats:
0
0
0
Tags: Computer Science Engineering GPU Grafana InfiniBand Linux Machine Learning Python Research Security Testing
Region:
North America
Country:
United States
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.
Staff Machine Learning Engineer jobsData Scientist II jobsPrincipal Data Engineer jobsStaff Data Scientist jobsBI Developer jobsData Manager jobsJunior Data Analyst jobsResearch Scientist jobsData Science Manager jobsBusiness Data Analyst jobsData Engineer III jobsSenior AI Engineer jobsLead Data Analyst jobsData Specialist jobsData Science Intern jobsSr. Data Scientist jobsPrincipal Software Engineer jobsData Analyst Intern jobsBI Analyst jobsSoftware Engineer II jobsSoftware Engineer, Machine Learning jobsAzure Data Engineer jobsData Analyst II jobsJunior Data Engineer jobsSenior Data Scientist, Performance Marketing jobs
Snowflake jobsEconomics jobsLinux jobsOpen Source jobsBanking jobsHadoop jobsComputer Vision jobsRDBMS jobsJavaScript jobsPhysics jobsMLOps jobsData Warehousing jobsKafka jobsKPIs jobsR&D jobsNoSQL jobsAirflow jobsGoogle Cloud jobsStreaming jobsScala jobsClassification jobsData warehouse jobsOracle jobsScikit-learn jobsPostgreSQL jobs
GitHub jobsCX jobsSAS jobsTerraform jobsScrum jobsPySpark jobsData Mining jobsPandas jobsDistributed Systems jobsIndustrial jobsRobotics jobsBigQuery jobsLooker jobsJira jobsUnstructured data jobsE-commerce jobsJenkins jobsRedshift jobsPharma jobsdbt jobsData strategy jobsReact jobsMicroservices jobsNumPy jobsMySQL jobs