Activation Functions

Explore the activation functions available in VanillaNets and learn when to use each one in classification and regression models.

Activation functions introduce non-linearity into a neural network, allowing it to learn patterns and relationships that cannot be represented through linear transformations alone.

All activation classes in VanillaNets expose a consistent interface:

activation.forward(inputs)
activation.backward(dvalues)

Available Activations

Activation	Output Range	Common Use Case
`ReLU`	`[0, ∞)`	Default hidden-layer activation
`LeakyReLU`	`(-∞, ∞)`	Alternative to ReLU with improved gradient flow
`Sigmoid`	`[0, 1]`	Binary classification outputs
`Tanh`	`[-1, 1]`	Zero-centered activations
`Softmax`	`[0, 1]`	Multiclass classification outputs
`Linear`	`(-∞, ∞)`	Regression outputs

Linear

class vanillanets.activations.Linear

Forward

self.output = inputs

Backward

self.dinputs = dvalues.copy()

ReLU

class vanillanets.activations.ReLU

Forward

self.output = np.maximum(0, inputs)

Backward

self.dinputs = dvalues.copy()
self.dinputs[self.inputs <= 0] = 0

The mask is <= 0, not < 0 - inputs exactly equal to 0 receive zero gradient.

LeakyReLU

class vanillanets.activations.LeakyReLU

Forward

self.output = np.where(inputs > 0, inputs, 0.1 * inputs)

Backward

self.dinputs = dvalues.copy()
self.dinputs[self.inputs <= 0] *= 0.1

The 0.1 negative slope is hardcoded - not a constructor argument.

Sigmoid

class vanillanets.activations.Sigmoid

Forward

self.output = 1 / (1 + np.exp(-inputs))

Backward

self.dinputs = dvalues * self.output * (1 - self.output)

Pairs with BinaryCrossEntropy as the output activation for binary classification.

Tanh

class vanillanets.activations.Tanh

Forward

self.output = np.tanh(inputs)

Backward

self.dinputs = dvalues * (1 - self.output ** 2)

Implemented via np.tanh, not the algebraic (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ) form - the algebraic form overflows to NaN for large |x| since both numerator and denominator reach inf.

Softmax

class vanillanets.activations.Softmax

Forward - row-wise max subtraction for numerical stability:

exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)

Backward - full per-sample Jacobian, computed in a loop over the batch:

for i, (out, dv) in enumerate(zip(self.output, dvalues)):
    out = out.reshape(-1, 1)
    jacobian = np.diagflat(out) - out @ out.T
    self.dinputs[i] = jacobian @ dv

This is O(batch_size × n_classes²). It is bypassed entirely when Softmax is the final layer and the loss is CategoricalCrossEntropy or SparseCategoricalCrossEntropy - see Losses & Optimizers for the fused backward pass.

Choosing an activation

The choice of activation function is largely determined by the task being solved.

Task	Hidden layers	Output layer	Paired loss
Binary classification	ReLU / LeakyReLU	Sigmoid	`BinaryCrossEntropy`
Multiclass classification	ReLU / LeakyReLU	Softmax	`CategoricalCrossEntropy` / `SparseCategoricalCrossEntropy`
Regression	ReLU / LeakyReLU	Linear	`MeanSquaredError`

On this page