HPC Systems & Operations Manager
Hyde Park Campus
University of Chicago
One of the world’s leading research universities, the University of Chicago inspires scholars to pursue field-defining research, while providing a transformative education for students.Department
About the Department
Job Summary
The University of Chicago is seeking a highly qualified HPC Systems & Operations Manager to oversee the systems and operations team responsible for designing, configuring, deploying, and maintaining the Research Computing Center (RCC) High Performance Computing (HPC) systems, as well as managing facility operations. This is hands-on role will involve active participation in day-to-day systems operations. The individual in this position will also be involved in the procurement and management of HPC hardware and software.
This is a hybrid position requiring 3 days onsite.
Responsibilities
Lead the design, configuration, deployment, and management of RCC HPC systems.
Ensure the stability, integrity, and efficient operation of RCC HPC systems that support core organizational functions.
Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.
Manage a growing team of HPC system administrators and systems programmers to ensure reliable service delivery.
Oversee the project management of the team’s initiatives, ensuring that all projects receive the necessary management oversight and resources for successful completion.
Serve as the primary point of contact for other university units regarding systems and operations-related matters.
Diagnose and resolve system operational problems promptly and effectively. Coordinating with vendors to address hardware and software issues.
Foster automation within HPC systems.
Troubleshoot and identify failed hardware, implement parts replacement and resolve system failures.
Develop and implement strategies for HPC data management, backup, disaster recovery, and security.
Create standard operating procedures for routine and complex system tasks.
Maintain and monitor the security of HPC systems and servers, implementing robust security measures, as necessary.
Provide technical leadership, guidance, and support to the HPC systems and operations team.
Manages a single team’s progress by maintaining accurate and up-to-date logs, ensures that all projects have the necessary management oversight and approvals for successful completion.
Ensures the implementation of approved best practices and information technology policies that result in the highest quality systems administration.
Performs other related work as needed.
Minimum Qualifications
Education:
Minimum requirements include a college or university degree in related field.---
Work Experience:
---
Certifications:
---
Preferred Qualifications
Education:
Advanced degree strongly preferred.
Experience:
A minimum of seven years of Linux system administration experience in a large, distributed computing environment.
At least three year’s experience in providing support for Linux HPC cluster used for scientific research strongly preferred.
Technical Skills or Knowledge:
Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
Proficiency in the installation, maintenance, operation, tuning and troubleshooting of Linux and related systems and software.
Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
Experience in configuring, installing and troubleshooting MPI and OpenMP.
Experience with at least one HPC cluster management tool (e.g., XCAT, Confluent, Warewulf, or Bright).
Experience in configuring, administering, and supporting network storage subsystems.
Hands-on experience with at least one parallel file systems (e.g., Spectrum Scale-GPFS, Lustre, BeeGFS, or Ceph).
Direct experience working with Infiniband, including a working knowledge of Infiniband concepts, OFED layers, subnet managers, as well as Gigabit Ethernet.
Experience with networking and security.
Experience with systems automation tools such as Ansible or Puppet.
Experience with versioning tools such as Git or Subversion.
Experience configuring, installing, maintaining and using monitoring and optimization tools.
Strong knowledge of scripting languages such as Python or bash.
Preferred Competencies
Ability to communicate effectively.
Ability to work well with faculty and researchers.
Ability to identify and gain expertise in appropriate new technologies and/or software tools.
Ability to function as part of an interactive team while demonstrating self-initiative to achieve project's goals and Research Computing Center's mission.
Strong analytical skills and problem-solving ability.
Applicpation Documents
Cover letter (preferred)
Resume (required)
When applying, the document(s) MUST be uploaded via the My Experience page, in the section titled Application Documents of the application.
Job Family
Role Impact
FLSA Status
Pay Frequency
Scheduled Weekly Hours
Benefits Eligible
Drug Test Required
Health Screen Required
Motor Vehicle Record Inquiry Required
Posting Statement
The University of Chicago is an Affirmative Action/Equal Opportunity/Disabled/Veterans and does not discriminate on the basis of race, color, religion, sex, sexual orientation, gender, gender identity, national or ethnic origin, age, status as an individual with a disability, military or veteran status, genetic information, or other protected classes under the law. For additional information please see the University's Notice of Nondiscrimination.
Staff Job seekers in need of a reasonable accommodation to complete the application process should call 773-702-5800 or submit a request via Applicant Inquiry Form.
We seek a diverse pool of applicants who wish to join an academic community that places the highest value on rigorous inquiry and encourages a diversity of perspectives, experiences, groups of individuals, and ideas to inform and stimulate intellectual challenge, engagement, and exchange.
All offers of employment are contingent upon a background check that includes a review of conviction history. A conviction does not automatically preclude University employment. Rather, the University considers conviction information on a case-by-case basis and assesses the nature of the offense, the circumstances surrounding it, the proximity in time of the conviction, and its relevance to the position.
The University of Chicago's Annual Security & Fire Safety Report (Report) provides information about University offices and programs that provide safety support, crime and fire statistics, emergency response and communications plans, and other policies and information. The Report can be accessed online at: http://securityreport.uchicago.edu. Paper copies of the Report are available, upon request, from the University of Chicago Police Department, 850 E. 61st Street, Chicago, IL 60637.
* Salary range is an estimate based on our AI, ML, Data Science Salary Index 💰
Tags: Ansible Data management Git HPC InfiniBand Linux OpenMP Puppet Python Research Security Statistics
Perks/benefits: Health care
More jobs like this
Explore more career opportunities
Find even more open roles below ordered by popularity of job title or skills/products/technologies used.