|
|
Weights & Biases is a tool for tracking machine learning experiments, visualizing and sharing results.
|
|
|
In the PIK cluster, where compute nodes have no internet access, syncing data to the W&B servers needs to be handled via the login node.
|
|
|
Weights & Biases is a tool for tracking machine learning experiments, visualizing and sharing results. In the PIK cluster, where compute nodes have no internet access, syncing data to the W&B servers needs to be handled via the login node.
|
|
|
|
|
|
## Getting Started
|
|
|
|
|
|
**Step 1: Create a W&B Account**
|
|
|
|
|
|
- Academic researchers can apply for a free academic W&B account, which includes 100GB of free storage (https://wandb.ai/site/pricing/)
|
|
|
|
|
|
**Step 2: Install W&B**
|
|
|
|
|
|
- Install the W&B package using pip
|
|
|
|
|
|
```bash
|
... | ... | @@ -13,6 +15,7 @@ pip install wandb |
|
|
```
|
|
|
|
|
|
**Step 3: Configure W&B**
|
|
|
|
|
|
- Login to your W&B account using the following command
|
|
|
|
|
|
```bash
|
... | ... | @@ -22,6 +25,7 @@ wandb login |
|
|
- Enter the API key, which can be obtained from: https://wandb.ai/authorize. You can also find it under user settings in your W&B account.
|
|
|
|
|
|
**Step 4: Sync Data from the Login Node**
|
|
|
|
|
|
- To synchronize data collected offline on compute nodes to the W&B servers, use the following command from the login node
|
|
|
|
|
|
```bash
|
... | ... | @@ -30,7 +34,36 @@ wandb sync /path/to/offline-run-data |
|
|
|
|
|
## PyTorch Integration Example
|
|
|
|
|
|
**Step 1: Import W&B Library**
|
|
|
|
|
|
```python
|
|
|
import wandb
|
|
|
```
|
|
|
|
|
|
**Step 2: Initialize W&B**
|
|
|
|
|
|
- Make sure to save data locally during training by setting the mode to `offline`
|
|
|
|
|
|
```python
|
|
|
wandb.init(project="my_project", entity="my_entity", mode="offline")
|
|
|
```
|
|
|
|
|
|
**Step 3: Training Loop**
|
|
|
|
|
|
```python
|
|
|
for epoch in range(num_epochs):
|
|
|
training_loss = 0
|
|
|
for data, target in enumerate(train_loader):
|
|
|
optimizer.zero_grad()
|
|
|
output = model(data)
|
|
|
loss = criterion(output, target)
|
|
|
loss.backward()
|
|
|
optimizer.step()
|
|
|
training_loss += loss.item()
|
|
|
# Log training metrics to W&B
|
|
|
wandb.log({"epoch": epoch, "train/loss": training_loss/len(trainloader)})
|
|
|
```
|
|
|
|
|
|
## PyTorch Lightning Integration Example
|
|
|
|
|
|
## Uploading Model Checkpoints to W&B |
|
|
\ No newline at end of file |
|
|
|