Linux Administrator

Allentown, PA

Full Time USD 80K - 100K

In Time Tec

View all jobs at In Time Tec

Apply now Apply later

Posted 3 weeks ago

In Time Tec is a global, people-first software services company dedicated to creating lasting value for our partners, employees, and the communities we serve. We’re not just a company—we’re a high-performing team that attributes our success to the core values that guide us: trust, transparency, integrity, and leadership. We’re looking for talented individuals who share these values and are eager to help build on the strong foundation we’ve already established, leveraging our proven infrastructure, resources, and expertise to reach new levels. 

Here’s how our values shape our work: 

Trust: We build strong and authentic relationships with our clients and our team by doing what we say we’ll do, on time, with quality, consistently and reliably.

Transparency: We believe in open communication at all levels, with clients and teammates. We share information honestly to drive informed decisions and create mutual success.

Integrity: We define integrity as workability—the ability to make things work effectively in every scenario, situation, and setting, for everyone involved. For the system to function as designed, every component must have integrity. This foundation provides seamless delivery and powerful results.

Leadership: Leadership isn’t about titles—it’s about inspiring others, taking ownership, figuring out solutions, and making an impact. We empower each team member to lead from wherever they are, driving growth and results.

If you are considering this position, you should read our blog posts, look at our socials, listen to our podcasts, and read reviews to get a better sense of who we are.  Here are some resources to explore:  

https://www.intimetec.com/

https://www.glassdoor.com/Reviews/In-Time-Tec-Reviews-E484136.htm

https://www.youtube.com/watchv=03gQSvmG2Ps&feature=youtu.be

https://www.youtube.com/watch?v=mG3zwQhrigo

Role Overview:

We are seeking a skilled and dedicated Linux Administrator. The primary focus of this role is to support and maintain a GPU cluster and backend infrastructure, ensuring seamless operation for AI work processes. The successful candidate will be responsible for ensuring that all hardware and software components function effectively, providing support to data scientists and external contractors as they develop and utilize AI solutions.

Key Responsibilities: 

Manage and Maintain GPU Cluster: Oversee the installation, configuration, and maintenance of GPU hardware and software. Ensure optimal performance and availability of the GPU cluster for AI workloads.

System Administration: Perform routine system administration tasks on Linux servers, including updates, patches, and security configurations. Monitor system performance and troubleshoot issues as they arise.

Support AI Work Processes: Collaborate with data scientists and external contractors to provide technical support and ensure that AI development and deployment processes run smoothly. Assist in the setup and configuration of AI tools and frameworks.

Infrastructure Management: Maintain and optimize the backend infrastructure, including storage, networking, and compute resources. Ensure high availability and reliability of all systems.

Automation and Scripting: Develop and maintain scripts and automation tools to streamline system administration tasks and improve efficiency.

Security and Compliance: Implement and enforce security best practices to protect sensitive data and ensure compliance with company policies and industry standards.

Documentation and Reporting: Create and maintain comprehensive documentation for system configurations, procedures, and troubleshooting guides. Provide regular reports on system performance and utilization.

Technical Skills:

Deep proficiency in Linux system administration (preferably RHEL, CentOS, or Ubuntu Server).

Experience with kernel tuning, filesystem management, and system hardening.

Expertise in managing system logs, cron jobs, system services (systemd), and boot process diagnostics.

Hands-on experience with GPU cluster configuration and maintenance, particularly NVIDIA-based architectures.

Familiarity with GPU monitoring and debugging tools (e.g., nvidia-smi, DCGM).

Understanding of multi-GPU workload scheduling using platforms like SLURM, Kubernetes, or similar.

Strong understanding of compute, storage, and networking components of an AI compute cluster.

Experience with NFS, Ceph, or other distributed/shared file systems.

Knowledge of Ethernet, Infiniband, or RoCE networks as applicable to high-performance compute (HPC) setups.

Proficiency in RAID configuration, LVM, and system provisioning tools (e.g., Cobbler, PXE boot environments).

Expertise in scripting languages such as Bash, Python, or Perl for automation of routine tasks.

Experience with configuration management and infrastructure-as-code tools such as Ansible, Puppet, or Terraform.

Ability to automate provisioning, patching, and application deployment pipelines.

Familiarity with AI frameworks (e.g., TensorFlow, PyTorch, CUDA) and dependencies needed for successful deployment and execution.

Comfortable working alongside data science teams to install, optimize, and troubleshoot machine learning frameworks.

Experience with containerization technologies like Docker, and orchestration using Docker Compose or Kubernetes for AI workloads.

Knowledge of Linux security tools (e.g., SELinux, AppArmor, auditd, Fail2ban).

Familiarity with vulnerability scanning, system patching, and maintaining compliance with organizational and regulatory standards (e.g., CIS benchmarks, ISO 27001).