Activation Functions
Explore the activation functions available in VanillaNets and learn when to use each one in classification and regression models.
Activation functions introduce non-linearity into a neural network, allowing it to learn patterns and relationships that cannot be represented through linear transformations alone.
All activation classes in VanillaNets expose a consistent interface:
activation.forward(inputs)
activation.backward(dvalues)Available Activations
| Activation | Output Range | Common Use Case |
|---|---|---|
ReLU | [0, ∞) | Default hidden-layer activation |
LeakyReLU | (-∞, ∞) | Alternative to ReLU with improved gradient flow |
Sigmoid | [0, 1] | Binary classification outputs |
Tanh | [-1, 1] | Zero-centered activations |
Softmax | [0, 1] | Multiclass classification outputs |
Linear | (-∞, ∞) | Regression outputs |
Linear
class vanillanets.activations.LinearForward
self.output = inputsBackward
self.dinputs = dvalues.copy()ReLU
class vanillanets.activations.ReLUForward
self.output = np.maximum(0, inputs)Backward
self.dinputs = dvalues.copy()
self.dinputs[self.inputs <= 0] = 0The mask is
<= 0, not< 0- inputs exactly equal to0receive zero gradient.
LeakyReLU
class vanillanets.activations.LeakyReLUForward
self.output = np.where(inputs > 0, inputs, 0.1 * inputs)Backward
self.dinputs = dvalues.copy()
self.dinputs[self.inputs <= 0] *= 0.1The
0.1negative slope is hardcoded - not a constructor argument.
Sigmoid
class vanillanets.activations.SigmoidForward
self.output = 1 / (1 + np.exp(-inputs))Backward
self.dinputs = dvalues * self.output * (1 - self.output)Pairs with BinaryCrossEntropy as the output activation for binary classification.
Tanh
class vanillanets.activations.TanhForward
self.output = np.tanh(inputs)Backward
self.dinputs = dvalues * (1 - self.output ** 2)Implemented via
np.tanh, not the algebraic(eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)form - the algebraic form overflows toNaNfor large|x|since both numerator and denominator reachinf.
Softmax
class vanillanets.activations.SoftmaxForward - row-wise max subtraction for numerical stability:
exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
self.output = exp_values / np.sum(exp_values, axis=1, keepdims=True)Backward - full per-sample Jacobian, computed in a loop over the batch:
for i, (out, dv) in enumerate(zip(self.output, dvalues)):
out = out.reshape(-1, 1)
jacobian = np.diagflat(out) - out @ out.T
self.dinputs[i] = jacobian @ dvThis is O(batch_size × n_classes²). It is bypassed entirely when Softmax is the final layer and the loss is CategoricalCrossEntropy or SparseCategoricalCrossEntropy - see Losses & Optimizers for the fused backward pass.
Choosing an activation
The choice of activation function is largely determined by the task being solved.
| Task | Hidden layers | Output layer | Paired loss |
|---|---|---|---|
| Binary classification | ReLU / LeakyReLU | Sigmoid | BinaryCrossEntropy |
| Multiclass classification | ReLU / LeakyReLU | Softmax | CategoricalCrossEntropy / SparseCategoricalCrossEntropy |
| Regression | ReLU / LeakyReLU | Linear | MeanSquaredError |