Losses & Optimizers

Loss

class vanillanets.losses.Loss

Base class. Subclasses implement forward(y_pred, y_true), returning a per-sample array, and backward(dvalues, y_true), setting .dinputs.

def calculate(self, output, y):
    sample_losses = self.forward(output, y)
    return np.mean(sample_losses)

model.fit() and model.evaluate() call .calculate() for the reported scalar loss.

BinaryCrossEntropy

class vanillanets.losses.BinaryCrossEntropy(Loss)

Forward - both sides clipped to [1e-7, 1 - 1e-7]:

p = np.clip(y_pred, 1e-7, 1 - 1e-7)
sample_losses = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))

Backward - gradient normalized twice, by n_outputs then by n_samples:

p = np.clip(dvalues, 1e-7, 1 - 1e-7)
self.dinputs = -(y_true / p - (1 - y_true) / (1 - p)) / outputs
self.dinputs = self.dinputs / samples

Pairing: Sigmoid output activation, y_true shape (n, 1).

CategoricalCrossEntropy

class vanillanets.losses.CategoricalCrossEntropy(Loss)

Accepts y_true as either one-hot (n, C) or integer labels (n,).

Forward

p = np.clip(y_pred, 1e-7, 1 - 1e-7)
if y_true.ndim == 1:
    correct_confidences = p[range(samples), y_true]
else:
    correct_confidences = np.sum(p * y_true, axis=1)
sample_losses = -np.log(correct_confidences)

Backward

if y_true.ndim == 1:
    y_true = np.eye(labels)[y_true]
p = np.clip(dvalues, 1e-7, 1 - 1e-7)
self.dinputs = (-y_true / p) / samples

This is the standalone, Jacobian-based backward path - used only when the fused shortcut below isn't active.

Pairing: Softmax output activation.

SparseCategoricalCrossEntropy

class vanillanets.losses.SparseCategoricalCrossEntropy(CategoricalCrossEntropy)
    pass

An alias for CategoricalCrossEntropy, for callers who want explicit naming when working with integer-encoded labels (y_true shape (n,)). forward and backward are inherited unchanged - CategoricalCrossEntropy already detects y_true.ndim == 1 and handles it directly. There is no behavioral difference between:

CategoricalCrossEntropy().calculate(y_pred, integer_labels)
SparseCategoricalCrossEntropy().calculate(y_pred, integer_labels)

Because it subclasses CategoricalCrossEntropy, isinstance(loss, CategoricalCrossEntropy) is True - the fused Softmax+CrossEntropy backward pass (below) activates for SparseCategoricalCrossEntropy exactly as it does for CategoricalCrossEntropy.

Pairing: Softmax output activation, integer y_true.

MeanSquaredError

class vanillanets.losses.MeanSquaredError(Loss)

Forward - mean over the output dimension, per sample:

sample_losses = np.mean((y_true - y_pred) ** 2, axis=-1)

Backward

self.dinputs = -2 * (y_true - dvalues) / outputs
self.dinputs = self.dinputs / samples

Pairing: Linear output activation.

Fused Softmax + CrossEntropy backward pass

class vanillanets.softmax_loss.Activation_Softmax_Loss_CategoricalCrossentropy

model.finalize() activates this shortcut when:

isinstance(self.layers[-1], Softmax) and isinstance(self.loss, CategoricalCrossEntropy)

Since SparseCategoricalCrossEntropy subclasses CategoricalCrossEntropy, this condition holds for both. When active:

self.loss.backward() is never called.
The gradient is computed directly:

dL/dz = (ŷ - y) / n_samples

Backprop runs over self.layers[:-1] - the trailing Softmax's own backward() is skipped, since its gradient is already folded into the result above.

This avoids the O(batch_size × n_classes²) Softmax Jacobian for both one-hot and integer label formats.

Optimizer_SGD

class vanillanets.optimizers.Optimizer_SGD(learning_rate=1.0, decay=0.0, momentum=0.0)

pre_update_lr()

if self.decay:
    self.current_learning_rate = self.learning_rate / (1 + self.decay * self.iterations)

update_params(layer) - without momentum:

weight_updates = -lr * layer.dweights
bias_updates   = -lr * layer.dbiases

With momentum (weight_momentums/bias_momentums lazily initialized to zeros on first call):

weight_updates = momentum * layer.weight_momentums - lr * layer.dweights
bias_updates   = momentum * layer.bias_momentums   - lr * layer.dbiases

post_update_params()

self.iterations += 1

Optimizer_Adam

class vanillanets.optimizers.Optimizer_Adam(learning_rate=0.001, decay=0.0,
                                             epsilon=1e-7, beta_1=0.9, beta_2=0.999)

weight_momentums, weight_cache, bias_momentums, bias_cache are lazily initialized to zeros on first call (checked via hasattr(layer, 'weight_cache')).

m_w = beta_1 * m_w + (1 - beta_1) * dweights
v_w = beta_2 * v_w + (1 - beta_2) * dweights**2

m_w_hat = m_w / (1 - beta_1 ** (iterations + 1))
v_w_hat = v_w / (1 - beta_2 ** (iterations + 1))

weights += -lr * m_w_hat / (sqrt(v_w_hat) + epsilon)

(Biases follow the same update using dbiases.) self.iterations is incremented in post_update_params() - the first update for every layer uses iterations=0.

Which layers get updated

model.fit() calls optimizer.update_params(layer) only for layers where hasattr(layer, 'weights') is True. A custom stateful layer must expose .weights, .biases, .dweights, .dbiases under those exact names to receive updates - it is silently skipped otherwise, with no error.

Choosing a combination

Task	Output activation	Loss
Binary classification	`Sigmoid`	`BinaryCrossEntropy`
Multiclass (one-hot labels)	`Softmax`	`CategoricalCrossEntropy`
Multiclass (integer labels)	`Softmax`	`SparseCategoricalCrossEntropy`
Regression	`Linear`	`MeanSquaredError`

Example

from vanillanets.losses import SparseCategoricalCrossEntropy
from vanillanets.optimizers import Optimizer_Adam

model.set(
    loss=SparseCategoricalCrossEntropy(),
    optimizer=Optimizer_Adam(learning_rate=0.001)
)
model.finalize()  # fused Softmax+CrossEntropy backward pass active

On this page