Lab Facilities & Resources

Access high-performance computing resources, technical documentation, and best practices for conducting research at the Cyber Innovation Lab.

The Cyber Innovation Compute Cluster

The Cyber Innovation Cluster is a dedicated high-performance computing (HPC) environment designed to accelerate advanced machine learning and AI research. Managed via the SLURM workload manager, the cluster is specifically optimized for Distributed Data Parallel (DDP) PyTorch training. This infrastructure allows lab members to seamlessly scale complex models and run extensive multi-node experiments.

🖥️

Architecture

1 Login Node
4 Compute Nodes

GPU Hardware

NVIDIA RTX 5070 Ti
(Single GPU per node)

⚙️

CPU Hardware

~20 Cores per Compute Node

📊

Orchestration

SLURM Workload Manager

Cluster Documentation Guide

Prepared by Ashim Dahal. Covers DDP, SLURM, and Multi-Node Training.

Download PDF Guide

SLURM Quick Reference

Essential commands for submitting and monitoring your jobs on the cluster.

sbatch <script.sh>

Submit a job to the cluster queue.

squeue -u $USER

List your currently active and queued jobs.

scancel <job_id>

Cancel a specific job.

sinfo -Nel

View detailed node information and cluster status.

Data Management & Storage Hygiene

Shared storage fills quickly. Please use the following commands to safely transfer data and purge your caches after large runs.

# Upload large datasets to the login node with progress tracking
rsync -avhp /local/path/ <user>@login:~/datasets/folder/

# Download results to your local machine
scp -r <user>@login:~/projects/cil-ddp/results/ ./results

# Purge Pip Cache to save space
pip cache purge

# Clean Conda packages and tarballs
conda clean --all -y

Cluster Etiquette & Policies

To ensure fair access and optimal performance for all lab members, please adhere to the following guidelines:

  • Never run training on the login node. Use it only to compile code, move data, and submit SLURM jobs via `sbatch`.
  • Request only what you need. Monitor your resource usage and release nodes promptly when experiments finish.
  • Maintain reproducible environments. Always use Conda environments (e.g., `~/myenv`) and export `requirements.txt` or `environment.yml` files.
  • Clean up shared storage. Delete temporary files and clear Hugging Face, Torch, Pip, and Conda caches regularly.
  • Log out when finished. The cluster uses passwordless SSH inside the fabric. Logging out prevents accidental reuse of your session and frees resources.
  • Desk Node Usage. Use physical desk nodes (c1, c2) for interactive work. Steer massive, multi-node queued jobs toward non-desk nodes (c3, c4) when possible using `#SBATCH --nodelist`.