Lab Facilities & Resources

Access high-performance computing resources and best practices for research.

The Cyber Innovation Compute Cluster

The Cyber Innovation Cluster is optimized for Distributed Data Parallel (DDP) PyTorch training, managed via SLURM.

🖥️

Architecture

1 Login Node, 4 Compute Nodes

GPU Hardware

NVIDIA RTX 5070 Ti (Single GPU/node)

⚙️

CPU Hardware

~20 Cores per Compute Node

📊

Orchestration

SLURM Workload Manager

Cluster Documentation Guide

Prepared by Ashim Dahal. Covers DDP, SLURM, and Multi-Node Training.

Download PDF Guide

SLURM Quick Reference

sbatch <script.sh>

Submit a job to the cluster queue.

squeue -u $USER

List your currently active/queued jobs.

scancel <job_id>

Cancel a specific job.

sinfo -Nel

View node information and status.

Data Management & Storage Hygiene

# Upload datasets with progress tracking

rsync -avhp /local/path/ user@login:~/datasets/folder/

# Purge Pip & Conda caches

pip cache purge && conda clean --all -y

Cluster Etiquette & Policies

  • Never run training on the login node. Use it only for compiling/submitting.
  • Request only what you need. Release nodes promptly after experiments.
  • Maintain reproducible environments (Conda/requirements.txt).
  • Clean up shared storage (Hugging Face, Torch, Pip caches).
  • Use physical desk nodes (c1, c2) for interactive work.