{ "cells": [ { "cell_type": "markdown", "id": "813f2756", "metadata": {}, "source": [ "# Multi-Layer Perceptron & Backpropagation\n", "\n", "This notebook demonstrates:\n", "- MLP architecture and forward propagation\n", "- Activation functions and their derivatives\n", "- Backpropagation algorithm step-by-step\n", "- Training a simple MLP on real data" ] }, { "cell_type": "code", "execution_count": null, "id": "ce2ab819", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.datasets import make_moons, make_circles\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "plt.style.use('seaborn-v0_8-darkgrid')\n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "id": "9d6c3d7e", "metadata": {}, "source": [ "## 1. Activation Functions and Their Derivatives\n", "\n", "Understanding activation functions is crucial for neural networks:\n", "- **Sigmoid**: $\\sigma(z) = \\frac{1}{1 + e^{-z}}$\n", "- **Tanh**: $\\tanh(z) = \\frac{e^z - e^{-z}}{e^z + e^{-z}}$\n", "- **ReLU**: $\\text{ReLU}(z) = \\max(0, z)$" ] }, { "cell_type": "code", "execution_count": null, "id": "0cf33da5", "metadata": {}, "outputs": [], "source": [ "# Activation functions and derivatives\n", "def sigmoid(z):\n", " return 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n", "\n", "def sigmoid_derivative(z):\n", " s = sigmoid(z)\n", " return s * (1 - s)\n", "\n", "def tanh(z):\n", " return np.tanh(z)\n", "\n", "def tanh_derivative(z):\n", " return 1 - np.tanh(z)**2\n", "\n", "def relu(z):\n", " return np.maximum(0, z)\n", "\n", "def relu_derivative(z):\n", " return (z > 0).astype(float)\n", "\n", "# Visualize activation functions\n", "z = np.linspace(-5, 5, 100)\n", "\n", "fig, axes = plt.subplots(2, 3, figsize=(15, 8))\n", "\n", "# Sigmoid\n", "axes[0, 0].plot(z, sigmoid(z), 'b-', linewidth=2, label='sigmoid(z)')\n", "axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)\n", "axes[0, 0].axhline(y=1, color='k', linestyle='--', alpha=0.3)\n", "axes[0, 0].set_title('Sigmoid Function', fontsize=12, fontweight='bold')\n", "axes[0, 0].set_xlabel('z')\n", "axes[0, 0].set_ylabel('σ(z)')\n", "axes[0, 0].grid(True, alpha=0.3)\n", "axes[0, 0].legend()\n", "\n", "axes[1, 0].plot(z, sigmoid_derivative(z), 'r-', linewidth=2, label=\"σ'(z)\")\n", "axes[1, 0].set_title('Sigmoid Derivative', fontsize=12, fontweight='bold')\n", "axes[1, 0].set_xlabel('z')\n", "axes[1, 0].set_ylabel(\"σ'(z)\")\n", "axes[1, 0].grid(True, alpha=0.3)\n", "axes[1, 0].legend()\n", "\n", "# Tanh\n", "axes[0, 1].plot(z, tanh(z), 'g-', linewidth=2, label='tanh(z)')\n", "axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)\n", "axes[0, 1].axhline(y=1, color='k', linestyle='--', alpha=0.3)\n", "axes[0, 1].axhline(y=-1, color='k', linestyle='--', alpha=0.3)\n", "axes[0, 1].set_title('Tanh Function', fontsize=12, fontweight='bold')\n", "axes[0, 1].set_xlabel('z')\n", "axes[0, 1].set_ylabel('tanh(z)')\n", "axes[0, 1].grid(True, alpha=0.3)\n", "axes[0, 1].legend()\n", "\n", "axes[1, 1].plot(z, tanh_derivative(z), 'orange', linewidth=2, label=\"tanh'(z)\")\n", "axes[1, 1].set_title('Tanh Derivative', fontsize=12, fontweight='bold')\n", "axes[1, 1].set_xlabel('z')\n", "axes[1, 1].set_ylabel(\"tanh'(z)\")\n", "axes[1, 1].grid(True, alpha=0.3)\n", "axes[1, 1].legend()\n", "\n", "# ReLU\n", "axes[0, 2].plot(z, relu(z), 'm-', linewidth=2, label='ReLU(z)')\n", "axes[0, 2].axhline(y=0, color='k', linestyle='--', alpha=0.3)\n", "axes[0, 2].set_title('ReLU Function', fontsize=12, fontweight='bold')\n", "axes[0, 2].set_xlabel('z')\n", "axes[0, 2].set_ylabel('ReLU(z)')\n", "axes[0, 2].grid(True, alpha=0.3)\n", "axes[0, 2].legend()\n", "\n", "axes[1, 2].plot(z, relu_derivative(z), 'c-', linewidth=2, label=\"ReLU'(z)\")\n", "axes[1, 2].set_title('ReLU Derivative', fontsize=12, fontweight='bold')\n", "axes[1, 2].set_xlabel('z')\n", "axes[1, 2].set_ylabel(\"ReLU'(z)\")\n", "axes[1, 2].grid(True, alpha=0.3)\n", "axes[1, 2].legend()\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"Key Observations:\")\n", "print(\"- Sigmoid: Saturates at 0 and 1, derivative max at 0.25\")\n", "print(\"- Tanh: Centered at 0, saturates at -1 and 1, stronger gradients\")\n", "print(\"- ReLU: No saturation for positive values, dead neurons for negative\")" ] }, { "cell_type": "markdown", "id": "82a62a7a", "metadata": {}, "source": [ "## 2. Simple MLP Implementation\n", "\n", "A 2-layer MLP with:\n", "- Input layer: 2 features\n", "- Hidden layer: 4 neurons with tanh activation\n", "- Output layer: 1 neuron with sigmoid activation" ] }, { "cell_type": "code", "execution_count": null, "id": "51d37e66", "metadata": {}, "outputs": [], "source": [ "class SimpleMLP:\n", " def __init__(self, input_size=2, hidden_size=4, output_size=1):\n", " # Initialize weights with Xavier initialization\n", " self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)\n", " self.b1 = np.zeros((1, hidden_size))\n", " self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)\n", " self.b2 = np.zeros((1, output_size))\n", " \n", " # For storing intermediate values during forward pass\n", " self.cache = {}\n", " \n", " def forward(self, X):\n", " \"\"\"Forward propagation\"\"\"\n", " # Layer 1: Input -> Hidden\n", " self.cache['Z1'] = np.dot(X, self.W1) + self.b1\n", " self.cache['A1'] = tanh(self.cache['Z1'])\n", " \n", " # Layer 2: Hidden -> Output\n", " self.cache['Z2'] = np.dot(self.cache['A1'], self.W2) + self.b2\n", " self.cache['A2'] = sigmoid(self.cache['Z2'])\n", " \n", " return self.cache['A2']\n", " \n", " def backward(self, X, y, learning_rate=0.01):\n", " \"\"\"Backpropagation\"\"\"\n", " m = X.shape[0]\n", " \n", " # Output layer gradients\n", " dZ2 = self.cache['A2'] - y\n", " dW2 = (1/m) * np.dot(self.cache['A1'].T, dZ2)\n", " db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)\n", " \n", " # Hidden layer gradients\n", " dA1 = np.dot(dZ2, self.W2.T)\n", " dZ1 = dA1 * tanh_derivative(self.cache['Z1'])\n", " dW1 = (1/m) * np.dot(X.T, dZ1)\n", " db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)\n", " \n", " # Update parameters\n", " self.W2 -= learning_rate * dW2\n", " self.b2 -= learning_rate * db2\n", " self.W1 -= learning_rate * dW1\n", " self.b1 -= learning_rate * db1\n", " \n", " return dW2, db2, dW1, db1\n", " \n", " def compute_loss(self, y_pred, y_true):\n", " \"\"\"Binary cross-entropy loss\"\"\"\n", " m = y_true.shape[0]\n", " loss = -(1/m) * np.sum(y_true * np.log(y_pred + 1e-8) + \n", " (1 - y_true) * np.log(1 - y_pred + 1e-8))\n", " return loss\n", " \n", " def train(self, X, y, epochs=1000, learning_rate=0.1, verbose=True):\n", " \"\"\"Training loop\"\"\"\n", " losses = []\n", " \n", " for epoch in range(epochs):\n", " # Forward pass\n", " y_pred = self.forward(X)\n", " \n", " # Compute loss\n", " loss = self.compute_loss(y_pred, y)\n", " losses.append(loss)\n", " \n", " # Backward pass\n", " self.backward(X, y, learning_rate)\n", " \n", " if verbose and (epoch % 100 == 0 or epoch == epochs - 1):\n", " accuracy = np.mean((y_pred > 0.5) == y)\n", " print(f\"Epoch {epoch:4d}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}\")\n", " \n", " return losses\n", "\n", "print(\"SimpleMLP class defined successfully!\")" ] }, { "cell_type": "markdown", "id": "2f948fb7", "metadata": {}, "source": [ "## 3. Training on Moons Dataset\n", "\n", "Let's train our MLP on a non-linearly separable dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "a8e305c3", "metadata": {}, "outputs": [], "source": [ "# Generate moons dataset\n", "X, y = make_moons(n_samples=300, noise=0.2, random_state=42)\n", "y = y.reshape(-1, 1) # Reshape for consistency\n", "\n", "# Standardize features\n", "scaler = StandardScaler()\n", "X = scaler.fit_transform(X)\n", "\n", "# Split data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Visualize dataset\n", "plt.figure(figsize=(10, 4))\n", "\n", "plt.subplot(1, 2, 1)\n", "plt.scatter(X_train[y_train.ravel()==0, 0], X_train[y_train.ravel()==0, 1], \n", " c='blue', label='Class 0', alpha=0.6, edgecolors='k')\n", "plt.scatter(X_train[y_train.ravel()==1, 0], X_train[y_train.ravel()==1, 1], \n", " c='red', label='Class 1', alpha=0.6, edgecolors='k')\n", "plt.title('Training Data', fontsize=14, fontweight='bold')\n", "plt.xlabel('Feature 1')\n", "plt.ylabel('Feature 2')\n", "plt.legend()\n", "plt.grid(True, alpha=0.3)\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.scatter(X_test[y_test.ravel()==0, 0], X_test[y_test.ravel()==0, 1], \n", " c='blue', label='Class 0', alpha=0.6, edgecolors='k')\n", "plt.scatter(X_test[y_test.ravel()==1, 0], X_test[y_test.ravel()==1, 1], \n", " c='red', label='Class 1', alpha=0.6, edgecolors='k')\n", "plt.title('Test Data', fontsize=14, fontweight='bold')\n", "plt.xlabel('Feature 1')\n", "plt.ylabel('Feature 2')\n", "plt.legend()\n", "plt.grid(True, alpha=0.3)\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(f\"Training samples: {X_train.shape[0]}\")\n", "print(f\"Test samples: {X_test.shape[0]}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "9626516f", "metadata": {}, "outputs": [], "source": [ "# Train the model\n", "mlp = SimpleMLP(input_size=2, hidden_size=8, output_size=1)\n", "losses = mlp.train(X_train, y_train, epochs=2000, learning_rate=0.5, verbose=True)\n", "\n", "# Evaluate on test set\n", "y_test_pred = mlp.forward(X_test)\n", "test_accuracy = np.mean((y_test_pred > 0.5) == y_test)\n", "print(f\"\\nTest Accuracy: {test_accuracy:.4f}\")" ] }, { "cell_type": "markdown", "id": "6bee4e6b", "metadata": {}, "source": [ "## 4. Visualizing Training Progress and Decision Boundary" ] }, { "cell_type": "code", "execution_count": null, "id": "a7310670", "metadata": {}, "outputs": [], "source": [ "# Plot loss curve\n", "plt.figure(figsize=(15, 5))\n", "\n", "plt.subplot(1, 3, 1)\n", "plt.plot(losses, linewidth=2)\n", "plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Binary Cross-Entropy Loss')\n", "plt.grid(True, alpha=0.3)\n", "\n", "plt.subplot(1, 3, 2)\n", "plt.plot(losses[100:], linewidth=2, color='orange')\n", "plt.title('Training Loss (After Epoch 100)', fontsize=14, fontweight='bold')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Loss')\n", "plt.grid(True, alpha=0.3)\n", "\n", "# Decision boundary\n", "plt.subplot(1, 3, 3)\n", "h = 0.02\n", "x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n", "y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n", "Z = mlp.forward(np.c_[xx.ravel(), yy.ravel()])\n", "Z = Z.reshape(xx.shape)\n", "\n", "plt.contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)\n", "plt.colorbar(label='Prediction Probability')\n", "plt.scatter(X_test[y_test.ravel()==0, 0], X_test[y_test.ravel()==0, 1], \n", " c='blue', label='Class 0', edgecolors='k', s=60)\n", "plt.scatter(X_test[y_test.ravel()==1, 0], X_test[y_test.ravel()==1, 1], \n", " c='red', label='Class 1', edgecolors='k', s=60)\n", "plt.title('Decision Boundary', fontsize=14, fontweight='bold')\n", "plt.xlabel('Feature 1')\n", "plt.ylabel('Feature 2')\n", "plt.legend()\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e56a5cec", "metadata": {}, "source": [ "## 5. Gradient Flow Visualization\n", "\n", "Understanding how gradients flow through the network" ] }, { "cell_type": "code", "execution_count": null, "id": "cd47bc5f", "metadata": {}, "outputs": [], "source": [ "# Train a fresh model and track gradient magnitudes\n", "mlp_grad = SimpleMLP(input_size=2, hidden_size=8, output_size=1)\n", "\n", "grad_magnitudes_W1 = []\n", "grad_magnitudes_W2 = []\n", "\n", "for epoch in range(500):\n", " y_pred = mlp_grad.forward(X_train)\n", " dW2, db2, dW1, db1 = mlp_grad.backward(X_train, y_train, learning_rate=0.5)\n", " \n", " # Track gradient magnitudes\n", " grad_magnitudes_W1.append(np.linalg.norm(dW1))\n", " grad_magnitudes_W2.append(np.linalg.norm(dW2))\n", "\n", "# Plot gradient magnitudes\n", "plt.figure(figsize=(12, 4))\n", "\n", "plt.subplot(1, 2, 1)\n", "plt.plot(grad_magnitudes_W1, label='Layer 1 (Input → Hidden)', linewidth=2)\n", "plt.plot(grad_magnitudes_W2, label='Layer 2 (Hidden → Output)', linewidth=2)\n", "plt.title('Gradient Magnitude Over Training', fontsize=14, fontweight='bold')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Gradient Norm')\n", "plt.legend()\n", "plt.grid(True, alpha=0.3)\n", "\n", "plt.subplot(1, 2, 2)\n", "plt.plot(grad_magnitudes_W1[50:], label='Layer 1', linewidth=2)\n", "plt.plot(grad_magnitudes_W2[50:], label='Layer 2', linewidth=2)\n", "plt.title('Gradient Magnitude (After Epoch 50)', fontsize=14, fontweight='bold')\n", "plt.xlabel('Epoch')\n", "plt.ylabel('Gradient Norm')\n", "plt.legend()\n", "plt.grid(True, alpha=0.3)\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "print(\"Gradient flow analysis:\")\n", "print(f\"Final W1 gradient magnitude: {grad_magnitudes_W1[-1]:.6f}\")\n", "print(f\"Final W2 gradient magnitude: {grad_magnitudes_W2[-1]:.6f}\")" ] }, { "cell_type": "markdown", "id": "3558dc78", "metadata": {}, "source": [ "## 6. Comparing Different Hidden Layer Sizes\n", "\n", "Effect of network capacity on learning" ] }, { "cell_type": "code", "execution_count": null, "id": "f57fa080", "metadata": {}, "outputs": [], "source": [ "# Compare different hidden layer sizes\n", "hidden_sizes = [2, 4, 8, 16, 32]\n", "results = []\n", "\n", "fig, axes = plt.subplots(2, 3, figsize=(18, 10))\n", "axes = axes.ravel()\n", "\n", "for idx, h_size in enumerate(hidden_sizes):\n", " mlp_temp = SimpleMLP(input_size=2, hidden_size=h_size, output_size=1)\n", " losses_temp = mlp_temp.train(X_train, y_train, epochs=1000, learning_rate=0.5, verbose=False)\n", " \n", " y_test_pred = mlp_temp.forward(X_test)\n", " test_acc = np.mean((y_test_pred > 0.5) == y_test)\n", " results.append({'hidden_size': h_size, 'test_accuracy': test_acc, 'final_loss': losses_temp[-1]})\n", " \n", " # Plot decision boundary\n", " h = 0.02\n", " x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n", " y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n", " Z = mlp_temp.forward(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " \n", " axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)\n", " axes[idx].scatter(X_test[y_test.ravel()==0, 0], X_test[y_test.ravel()==0, 1], \n", " c='blue', edgecolors='k', s=40, alpha=0.8)\n", " axes[idx].scatter(X_test[y_test.ravel()==1, 0], X_test[y_test.ravel()==1, 1], \n", " c='red', edgecolors='k', s=40, alpha=0.8)\n", " axes[idx].set_title(f'Hidden Size = {h_size}\\nAccuracy = {test_acc:.3f}', \n", " fontsize=12, fontweight='bold')\n", " axes[idx].set_xlabel('Feature 1')\n", " axes[idx].set_ylabel('Feature 2')\n", "\n", "# Remove extra subplot\n", "axes[5].axis('off')\n", "\n", "plt.tight_layout()\n", "plt.show()\n", "\n", "# Print results\n", "print(\"\\nComparison of Hidden Layer Sizes:\")\n", "print(\"=\"*50)\n", "for r in results:\n", " print(f\"Hidden Size: {r['hidden_size']:2d} | Test Acc: {r['test_accuracy']:.4f} | Loss: {r['final_loss']:.4f}\")" ] }, { "cell_type": "markdown", "id": "e119905c", "metadata": {}, "source": [ "## 7. Backpropagation Step-by-Step Example\n", "\n", "Manual calculation for a single training example" ] }, { "cell_type": "code", "execution_count": null, "id": "a7094692", "metadata": {}, "outputs": [], "source": [ "# Create a tiny network for demonstration\n", "print(\"Manual Backpropagation Calculation\")\n", "print(\"=\"*60)\n", "\n", "# Initialize small weights for clarity\n", "W1 = np.array([[0.5, 0.2], [0.1, 0.3]])\n", "b1 = np.array([[0.1, 0.2]])\n", "W2 = np.array([[0.4], [0.6]])\n", "b2 = np.array([[0.1]])\n", "\n", "# Single input example\n", "x = np.array([[1.0, 0.5]])\n", "y = np.array([[1.0]])\n", "\n", "print(\"\\nInput: x =\", x[0])\n", "print(\"Target: y =\", y[0, 0])\n", "\n", "# Forward pass\n", "print(\"\\n--- Forward Pass ---\")\n", "Z1 = np.dot(x, W1) + b1\n", "print(f\"Z1 = x @ W1 + b1 = {Z1[0]}\")\n", "\n", "A1 = tanh(Z1)\n", "print(f\"A1 = tanh(Z1) = {A1[0]}\")\n", "\n", "Z2 = np.dot(A1, W2) + b2\n", "print(f\"Z2 = A1 @ W2 + b2 = {Z2[0, 0]:.4f}\")\n", "\n", "A2 = sigmoid(Z2)\n", "print(f\"A2 = sigmoid(Z2) = {A2[0, 0]:.4f}\")\n", "\n", "loss = -(y * np.log(A2 + 1e-8) + (1 - y) * np.log(1 - A2 + 1e-8))\n", "print(f\"\\nLoss = {loss[0, 0]:.4f}\")\n", "\n", "# Backward pass\n", "print(\"\\n--- Backward Pass ---\")\n", "dZ2 = A2 - y\n", "print(f\"dZ2 = A2 - y = {dZ2[0, 0]:.4f}\")\n", "\n", "dW2 = np.dot(A1.T, dZ2)\n", "print(f\"dW2 = A1.T @ dZ2 = {dW2.ravel()}\")\n", "\n", "db2 = dZ2\n", "print(f\"db2 = {db2[0, 0]:.4f}\")\n", "\n", "dA1 = np.dot(dZ2, W2.T)\n", "print(f\"dA1 = dZ2 @ W2.T = {dA1[0]}\")\n", "\n", "dZ1 = dA1 * tanh_derivative(Z1)\n", "print(f\"dZ1 = dA1 * tanh'(Z1) = {dZ1[0]}\")\n", "\n", "dW1 = np.dot(x.T, dZ1)\n", "print(f\"dW1 = x.T @ dZ1 = \\n{dW1}\")\n", "\n", "db1 = dZ1\n", "print(f\"db1 = {db1[0]}\")\n", "\n", "print(\"\\n\" + \"=\"*60)\n", "print(\"Gradients computed successfully!\")" ] }, { "cell_type": "markdown", "id": "f7fc41d1", "metadata": {}, "source": [ "## 8. Key Takeaways\n", "\n", "1. **Forward Pass**: Compute activations layer by layer\n", " - $Z^{(l)} = W^{(l)} A^{(l-1)} + b^{(l)}$\n", " - $A^{(l)} = \\sigma(Z^{(l)})$\n", "\n", "2. **Backward Pass**: Compute gradients using chain rule\n", " - Start from output: $dZ^{(L)} = A^{(L)} - y$\n", " - Propagate back: $dZ^{(l)} = dA^{(l)} \\odot \\sigma'(Z^{(l)})$\n", "\n", "3. **Activation Functions**:\n", " - Sigmoid/Tanh: Can saturate, causing vanishing gradients\n", " - ReLU: Better gradient flow, but can have dead neurons\n", "\n", "4. **Network Capacity**: More hidden neurons → more complex decision boundaries\n", "\n", "5. **Gradient Magnitude**: Monitor to detect vanishing/exploding gradients" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }