Documentation Index
Fetch the complete documentation index at: https://hubify.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
RunPod Integration
RunPod is the primary GPU compute provider for Hubify Labs. This guide covers connecting your RunPod account, configuring pods, and optimizing for cost.Connecting RunPod
Create a RunPod account
Sign up at runpod.io and add billing information.
Available GPU Types
| GPU | VRAM | Best For | Approx. Cost/hr |
|---|---|---|---|
| H200 | 141 GB | Large models, full-dataset anomaly detection | $3.89 |
| H100 | 80 GB | MCMC chains, training, most experiments | $2.49 |
| A100 | 80 GB | General GPU compute | $1.64 |
| A40 | 48 GB | Medium workloads, figure generation | $0.79 |
| RTX 4090 | 24 GB | Small models, prototyping | $0.44 |
Pod Configuration
Default Settings
Docker Images
Hubify provides pre-built images with common scientific packages:| Image | Contents |
|---|---|
hubify/base:latest | Python 3.11, CUDA 12, PyTorch 2.1 |
hubify/cosmo:latest | Base + Cobaya, GetDist, Astropy, HEALPy |
hubify/ml:latest | Base + Transformers, Accelerate, Datasets |
hubify/astro:latest | Base + Astropy, Photutils, SEP, Source Extractor |
SSH Access
Performance Tips
- Use DataLoader for GPU inference:
num_workers=16,pin_memory=True,prefetch_factor=4gives a 32x speedup over serial processing - Pre-stage large datasets on persistent storage so pods start instantly
- Use spot instances for non-urgent experiments (set
--spotflag) - Match GPU to workload: do not use an H200 for figure generation
Cost Management
Persistent Storage
Upload datasets to RunPod persistent storage so they survive pod restarts:Troubleshooting
Pod stuck in provisioning
Pod stuck in provisioning
The requested GPU type may be sold out. Try a different GPU or region:
Out of memory (OOM)
Out of memory (OOM)
Upgrade to a GPU with more VRAM, or reduce batch size. H200 (141 GB) handles the largest workloads.
Spot instance preempted
Spot instance preempted
Spot instances can be reclaimed. Use checkpointing for long experiments: