Lab Facilities & Resources
Access high-performance computing resources, technical documentation, and best practices for conducting research at the Cyber Innovation Lab.
The Cyber Innovation Compute Cluster
The Cyber Innovation Cluster is a dedicated high-performance computing (HPC) environment designed to accelerate advanced machine learning and AI research. Managed via the SLURM workload manager, the cluster is specifically optimized for Distributed Data Parallel (DDP) PyTorch training. This infrastructure allows lab members to seamlessly scale complex models and run extensive multi-node experiments.
Architecture
1 Login Node
4 Compute Nodes
GPU Hardware
NVIDIA RTX 5070 Ti
(Single GPU per node)
CPU Hardware
~20 Cores per Compute Node
Orchestration
SLURM Workload Manager
SLURM Quick Reference
Essential commands for submitting and monitoring your jobs on the cluster.
sbatch <script.sh>
Submit a job to the cluster queue.
squeue -u $USER
List your currently active and queued jobs.
scancel <job_id>
Cancel a specific job.
sinfo -Nel
View detailed node information and cluster status.
Data Management & Storage Hygiene
Shared storage fills quickly. Please use the following commands to safely transfer data and purge your caches after large runs.
rsync -avhp /local/path/ <user>@login:~/datasets/folder/
# Download results to your local machine
scp -r <user>@login:~/projects/cil-ddp/results/ ./results
# Purge Pip Cache to save space
pip cache purge
# Clean Conda packages and tarballs
conda clean --all -y
Cluster Etiquette & Policies
To ensure fair access and optimal performance for all lab members, please adhere to the following guidelines:
- Never run training on the login node. Use it only to compile code, move data, and submit SLURM jobs via `sbatch`.
- Request only what you need. Monitor your resource usage and release nodes promptly when experiments finish.
- Maintain reproducible environments. Always use Conda environments (e.g., `~/myenv`) and export `requirements.txt` or `environment.yml` files.
- Clean up shared storage. Delete temporary files and clear Hugging Face, Torch, Pip, and Conda caches regularly.
- Log out when finished. The cluster uses passwordless SSH inside the fabric. Logging out prevents accidental reuse of your session and frees resources.
- Desk Node Usage. Use physical desk nodes (c1, c2) for interactive work. Steer massive, multi-node queued jobs toward non-desk nodes (c3, c4) when possible using `#SBATCH --nodelist`.