System Overview
System Identity
whoami
Walid Abu Al-Afia
Computational Engineer @ St. Jude Children's Research Hospital
M.S. Computer Science @ The University of Texas at Austin
CPU Cores
GPUs
Cluster Environments
Data Migrated
Active Roles
systemctl statusQuick Actions
System Uptime
since June 2022Recent Events
tail -f /var/log/careerExperience
Job Queue
squeue -u walid --format="%.8i %.12P %.30j %.8T %.12M %.6D"- Designs and implements large-scale monitoring infrastructure using Prometheus and Grafana for comprehensive cluster observability
- Piloted secure adoption of AI Agents across all St. Jude HPC clusters; negotiated contracts with OpenAI, Anthropic, GitHub, Cursor
- Serves as the institutional resource and subject matter expert for AI Agents at St. Jude
- Architected 20,000-line MLOps-focused Python package for low-code/no-code ML model training
- Expanded Open OnDemand from single to multi-cluster deployment spanning 4 environments (Slurm HPC, SCCE/GDPR, colocation, main HPC)
- Led SCCE cluster (GDPR-compliant) and Slurm HPC Model Training cluster implementation
- Sole resource for 19 CryoSPARC instances; built On-The-Fly cryo-EM/ET processing pipeline
- Conducted 4-month, 30 PB data migration to dedicated Imaging Storage system
- Transformed module installation system from bare-metal to container-based builds
- Manages internship program, reviewing 300-500 applications yearly
- Conducts interviews and coordinates hiring of 15 interns annually
- Mentors interns throughout the summer program
- Established intern-to-full-time pipeline creating career tracks for permanent positions
- Built and deployed Open OnDemand instance serving 20,000+ core cluster
- Authored multiple Interactive Applications (Maestro, VMD, Scipion)
- Managed all software and module installations across RHEL7/RHEL8
- Organized and taught seminars on HPC programming tools
- Optimized parallel programs (MPI, OpenMP, CUDA) for researchers
- Built Prometheus + Grafana metrics collection environment
- Designed cluster monitoring dashboards for resource utilization insights
- Performed routine module installations on RHEL7 HPC Cluster
- Gained proficiency in LSF workload manager administration
- Developed VR application using Unity + Meta SDK for pediatric patient training
- Built REST API for AlphaFold-based protein structure prediction
- Compiled deployment documentation for HPC engineering team
- CS Head Tutor: hired, trained, managed 9 tutors; built TutoringBot queue system
- CS Tutor: tutored ~20 students in Python, Java, C, data structures, algorithms
- Cloud Admin: managed JupyterHub Kubernetes cluster on GCP; integrated OneLogin SSO
- Research Fellow: co-authored IROS 2023 paper on human-robot interaction; built AR app in Unity
- Created Bitcoin trading bot with live price data and technical indicators
- Integrated cryptocurrency exchange APIs for automated trading
- Developed data processing pipelines in Python and Java
Resource Allocation Timeline
sacct --starttime=2018-06-01 --format=JobName,Start,End,StateSkills & Technologies
GPU Utilization
nvidia-smi+-----------------------------------------------------------------------------------------+ | WALID-SMI 550.127 Driver Version: 550.127 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | Skill | Proficiency | Utilization | |=========================================+========================+======================| | Python | Expert |95% | | C/C++ | Advanced |80% | | Rust | Intermediate |55% | | Go | Intermediate |50% | | Bash/Shell | Expert |92% | | JavaScript/React | Advanced |70% | | Java | Advanced |75% | |-----------------------------------------+------------------------+----------------------| | Slurm/LSF | Expert |95% | | MPI/OpenMP/CUDA | Advanced |82% | | Prometheus/Grafana | Expert |90% | | Docker/K8s/Apptainer | Advanced |80% | | MLOps/Deep Learning | Advanced |78% | | AI Agents/LLMs | Advanced |85% | | CryoSPARC/Cryo-EM | Expert |90% | +-----------------------------------------------------------------------------------------+
Loaded Modules
module avail--- /opt/languages ---
--- /opt/hpc ---
--- /opt/devops ---
--- /opt/ml ---
--- /opt/editors ---
Language Interfaces
ip link showCertifications
/etc/certsNVIDIA Certified Associate
AI Infrastructure and Operations
2025Focus Areas
research interestsEducation
System Log
journalctl -u education --no-pagerCoursework Modules
lsmod | grep coursework--- UT Austin (Graduate) ---
--- Rhodes College (Undergraduate) ---
Awards & Honors
achievements unlockedResearch
Publication Record
SELECT * FROM publications ORDER BY year DESCDevelopment and Evaluation of Exploratory Experiences to Facilitate Reasoning About Robotic Systems
Research Keywords
indexed termsExternal Data Source
curl -s https://ieeexplore.ieee.orgProjects
Running Services
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"mlops-toolkit
20,000-line MLOps Python package for low-code ML model training with comprehensive Jupyter notebooks
hpc-monitoring-stack
Prometheus + Grafana monitoring infrastructure for CPU, GPU, and Slurm job-level metrics
ondemand-multi-cluster
Open OnDemand deployment spanning 4 cluster environments with LSF and Slurm support
cryosparc-fleet
19 CryoSPARC instances with On-The-Fly processing pipeline for structural biology research
vr-patient-training
Unity + Meta SDK VR application for pediatric radiation oncology patient preparation
alphafold-api
REST API endpoint for AlphaFold-based protein structure prediction on HPC
Container Registry
github.com/walidabualafiagithub.com/walidabualafia
Distributed systems, HPC tools, system-level programming, and more
View RepositoriesBuild Pipeline
CI/CD StatusMore projects in development. Stay tuned.
Contact
Network Interfaces
ip addr showConnection Status
health checkOpen to relocation • Remote-friendly
Download Artifacts
build outputabualafia-curriculum-vitae.pdf
Latest build • 6 pages
Node Location
hostname --fqdnMemphis, TN
Open to Relocation
Originally from Amman, Jordan