Losses & Optimizers
API reference for loss functions and parameter optimizers.
Loss
class vanillanets.losses.LossBase class. Subclasses implement forward(y_pred, y_true), returning a per-sample array, and backward(dvalues, y_true), setting .dinputs.
def calculate(self, output, y):
sample_losses = self.forward(output, y)
return np.mean(sample_losses)model.fit() and model.evaluate() call .calculate() for the reported scalar loss.
BinaryCrossEntropy
class vanillanets.losses.BinaryCrossEntropy(Loss)Forward - both sides clipped to [1e-7, 1 - 1e-7]:
p = np.clip(y_pred, 1e-7, 1 - 1e-7)
sample_losses = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))Backward - gradient normalized twice, by n_outputs then by n_samples:
p = np.clip(dvalues, 1e-7, 1 - 1e-7)
self.dinputs = -(y_true / p - (1 - y_true) / (1 - p)) / outputs
self.dinputs = self.dinputs / samplesPairing: Sigmoid output activation, y_true shape (n, 1).
CategoricalCrossEntropy
class vanillanets.losses.CategoricalCrossEntropy(Loss)Accepts y_true as either one-hot (n, C) or integer labels (n,).
Forward
p = np.clip(y_pred, 1e-7, 1 - 1e-7)
if y_true.ndim == 1:
correct_confidences = p[range(samples), y_true]
else:
correct_confidences = np.sum(p * y_true, axis=1)
sample_losses = -np.log(correct_confidences)Backward
if y_true.ndim == 1:
y_true = np.eye(labels)[y_true]
p = np.clip(dvalues, 1e-7, 1 - 1e-7)
self.dinputs = (-y_true / p) / samplesThis is the standalone, Jacobian-based backward path - used only when the fused shortcut below isn't active.
Pairing: Softmax output activation.
SparseCategoricalCrossEntropy
class vanillanets.losses.SparseCategoricalCrossEntropy(CategoricalCrossEntropy)
passAn alias for CategoricalCrossEntropy, for callers who want explicit naming when working with integer-encoded labels (y_true shape (n,)). forward and backward are inherited unchanged - CategoricalCrossEntropy already detects y_true.ndim == 1 and handles it directly. There is no behavioral difference between:
CategoricalCrossEntropy().calculate(y_pred, integer_labels)
SparseCategoricalCrossEntropy().calculate(y_pred, integer_labels)Because it subclasses CategoricalCrossEntropy, isinstance(loss, CategoricalCrossEntropy) is True - the fused Softmax+CrossEntropy backward pass (below) activates for SparseCategoricalCrossEntropy exactly as it does for CategoricalCrossEntropy.
Pairing: Softmax output activation, integer y_true.
MeanSquaredError
class vanillanets.losses.MeanSquaredError(Loss)Forward - mean over the output dimension, per sample:
sample_losses = np.mean((y_true - y_pred) ** 2, axis=-1)Backward
self.dinputs = -2 * (y_true - dvalues) / outputs
self.dinputs = self.dinputs / samplesPairing: Linear output activation.
Fused Softmax + CrossEntropy backward pass
class vanillanets.softmax_loss.Activation_Softmax_Loss_CategoricalCrossentropymodel.finalize() activates this shortcut when:
isinstance(self.layers[-1], Softmax) and isinstance(self.loss, CategoricalCrossEntropy)Since SparseCategoricalCrossEntropy subclasses CategoricalCrossEntropy, this condition holds for both. When active:
self.loss.backward()is never called.- The gradient is computed directly:
dL/dz = (ŷ - y) / n_samples- Backprop runs over
self.layers[:-1]- the trailingSoftmax's ownbackward()is skipped, since its gradient is already folded into the result above.
This avoids the O(batch_size × n_classes²) Softmax Jacobian for both one-hot and integer label formats.
Optimizer_SGD
class vanillanets.optimizers.Optimizer_SGD(learning_rate=1.0, decay=0.0, momentum=0.0)pre_update_lr()
if self.decay:
self.current_learning_rate = self.learning_rate / (1 + self.decay * self.iterations)update_params(layer) - without momentum:
weight_updates = -lr * layer.dweights
bias_updates = -lr * layer.dbiasesWith momentum (weight_momentums/bias_momentums lazily initialized to zeros on first call):
weight_updates = momentum * layer.weight_momentums - lr * layer.dweights
bias_updates = momentum * layer.bias_momentums - lr * layer.dbiasespost_update_params()
self.iterations += 1Optimizer_Adam
class vanillanets.optimizers.Optimizer_Adam(learning_rate=0.001, decay=0.0,
epsilon=1e-7, beta_1=0.9, beta_2=0.999)weight_momentums, weight_cache, bias_momentums, bias_cache are lazily initialized to zeros on first call (checked via hasattr(layer, 'weight_cache')).
m_w = beta_1 * m_w + (1 - beta_1) * dweights
v_w = beta_2 * v_w + (1 - beta_2) * dweights**2
m_w_hat = m_w / (1 - beta_1 ** (iterations + 1))
v_w_hat = v_w / (1 - beta_2 ** (iterations + 1))
weights += -lr * m_w_hat / (sqrt(v_w_hat) + epsilon)(Biases follow the same update using dbiases.) self.iterations is incremented in post_update_params() - the first update for every layer uses iterations=0.
Which layers get updated
model.fit() calls optimizer.update_params(layer) only for layers where hasattr(layer, 'weights') is True. A custom stateful layer must expose .weights, .biases, .dweights, .dbiases under those exact names to receive updates - it is silently skipped otherwise, with no error.
Choosing a combination
| Task | Output activation | Loss |
|---|---|---|
| Binary classification | Sigmoid | BinaryCrossEntropy |
| Multiclass (one-hot labels) | Softmax | CategoricalCrossEntropy |
| Multiclass (integer labels) | Softmax | SparseCategoricalCrossEntropy |
| Regression | Linear | MeanSquaredError |
Example
from vanillanets.losses import SparseCategoricalCrossEntropy
from vanillanets.optimizers import Optimizer_Adam
model.set(
loss=SparseCategoricalCrossEntropy(),
optimizer=Optimizer_Adam(learning_rate=0.001)
)
model.finalize() # fused Softmax+CrossEntropy backward pass active