vanillanets
Components

Activation Functions

Explore the activation functions available in VanillaNets and learn when to use each one in classification and regression models.

Activation functions introduce non-linearity into a neural network, allowing it to learn patterns and relationships that cannot be represented through linear transformations alone.

All activation classes in VanillaNets expose a consistent interface:

activation.forward(inputs)
activation.backward(dvalues)

Available Activations

ActivationOutput RangeCommon Use Case
ReLU[0, ∞)Default hidden-layer activation
LeakyReLU(-∞, ∞)Alternative to ReLU with improved gradient flow
Sigmoid[0, 1]Binary classification outputs
Tanh[-1, 1]Zero-centered activations
Softmax[0, 1]Multiclass classification outputs
Linear(-∞, ∞)Regression outputs

Linear

class vanillanets.activations.Linear

Forward

self.output = inputs

Backward

self.dinputs = dvalues.copy()

ReLU

class vanillanets.activations.ReLU

Forward

self.output = np.maximum(0, inputs)

Backward

self.dinputs = dvalues.copy()
self.dinputs[self.inputs <= 0] = 0

The mask is <= 0, not < 0 - inputs exactly equal to 0 receive zero gradient.

LeakyReLU

class vanillanets.activations.LeakyReLU

Forward

self.output = np.where(inputs > 0, inputs, 0.1 * inputs)

Backward

self.dinputs = dvalues.copy()
self.dinputs[self.inputs <= 0] *= 0.1

The 0.1 negative slope is hardcoded - not a constructor argument.

Sigmoid

class vanillanets.activations.Sigmoid

Forward

self.output = 1 / (1 + np.exp(-inputs))

Backward

self.dinputs = dvalues * self.output * (1 - self.output)

Pairs with BinaryCrossEntropy as the output activation for binary classification.

Tanh

class vanillanets.activations.Tanh

Forward

self.output = np.tanh(inputs)

Backward

self.dinputs = dvalues * (1 - self.output ** 2)

Implemented via np.tanh, not the algebraic (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ) form - the algebraic form overflows to NaN for large |x| since both numerator and denominator reach inf.

Softmax

class vanillanets.activations.Softmax

Forward - row-wise max subtraction for numerical stability:

exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)

Backward - full per-sample Jacobian, computed in a loop over the batch:

for i, (out, dv) in enumerate(zip(self.output, dvalues)):
    out = out.reshape(-1, 1)
    jacobian = np.diagflat(out) - out @ out.T
    self.dinputs[i] = jacobian @ dv

This is O(batch_size × n_classes²). It is bypassed entirely when Softmax is the final layer and the loss is CategoricalCrossEntropy or SparseCategoricalCrossEntropy - see Losses & Optimizers for the fused backward pass.

Choosing an activation

The choice of activation function is largely determined by the task being solved.

TaskHidden layersOutput layerPaired loss
Binary classificationReLU / LeakyReLUSigmoidBinaryCrossEntropy
Multiclass classificationReLU / LeakyReLUSoftmaxCategoricalCrossEntropy / SparseCategoricalCrossEntropy
RegressionReLU / LeakyReLULinearMeanSquaredError

On this page