{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "813f2756",
   "metadata": {},
   "source": [
    "# Multi-Layer Perceptron & Backpropagation\n",
    "\n",
    "This notebook demonstrates:\n",
    "- MLP architecture and forward propagation\n",
    "- Activation functions and their derivatives\n",
    "- Backpropagation algorithm step-by-step\n",
    "- Training a simple MLP on real data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce2ab819",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.datasets import make_moons, make_circles\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "plt.style.use('seaborn-v0_8-darkgrid')\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d6c3d7e",
   "metadata": {},
   "source": [
    "## 1. Activation Functions and Their Derivatives\n",
    "\n",
    "Understanding activation functions is crucial for neural networks:\n",
    "- **Sigmoid**: $\\sigma(z) = \\frac{1}{1 + e^{-z}}$\n",
    "- **Tanh**: $\\tanh(z) = \\frac{e^z - e^{-z}}{e^z + e^{-z}}$\n",
    "- **ReLU**: $\\text{ReLU}(z) = \\max(0, z)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0cf33da5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Activation functions and derivatives\n",
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n",
    "\n",
    "def sigmoid_derivative(z):\n",
    "    s = sigmoid(z)\n",
    "    return s * (1 - s)\n",
    "\n",
    "def tanh(z):\n",
    "    return np.tanh(z)\n",
    "\n",
    "def tanh_derivative(z):\n",
    "    return 1 - np.tanh(z)**2\n",
    "\n",
    "def relu(z):\n",
    "    return np.maximum(0, z)\n",
    "\n",
    "def relu_derivative(z):\n",
    "    return (z > 0).astype(float)\n",
    "\n",
    "# Visualize activation functions\n",
    "z = np.linspace(-5, 5, 100)\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(15, 8))\n",
    "\n",
    "# Sigmoid\n",
    "axes[0, 0].plot(z, sigmoid(z), 'b-', linewidth=2, label='sigmoid(z)')\n",
    "axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)\n",
    "axes[0, 0].axhline(y=1, color='k', linestyle='--', alpha=0.3)\n",
    "axes[0, 0].set_title('Sigmoid Function', fontsize=12, fontweight='bold')\n",
    "axes[0, 0].set_xlabel('z')\n",
    "axes[0, 0].set_ylabel('σ(z)')\n",
    "axes[0, 0].grid(True, alpha=0.3)\n",
    "axes[0, 0].legend()\n",
    "\n",
    "axes[1, 0].plot(z, sigmoid_derivative(z), 'r-', linewidth=2, label=\"σ'(z)\")\n",
    "axes[1, 0].set_title('Sigmoid Derivative', fontsize=12, fontweight='bold')\n",
    "axes[1, 0].set_xlabel('z')\n",
    "axes[1, 0].set_ylabel(\"σ'(z)\")\n",
    "axes[1, 0].grid(True, alpha=0.3)\n",
    "axes[1, 0].legend()\n",
    "\n",
    "# Tanh\n",
    "axes[0, 1].plot(z, tanh(z), 'g-', linewidth=2, label='tanh(z)')\n",
    "axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)\n",
    "axes[0, 1].axhline(y=1, color='k', linestyle='--', alpha=0.3)\n",
    "axes[0, 1].axhline(y=-1, color='k', linestyle='--', alpha=0.3)\n",
    "axes[0, 1].set_title('Tanh Function', fontsize=12, fontweight='bold')\n",
    "axes[0, 1].set_xlabel('z')\n",
    "axes[0, 1].set_ylabel('tanh(z)')\n",
    "axes[0, 1].grid(True, alpha=0.3)\n",
    "axes[0, 1].legend()\n",
    "\n",
    "axes[1, 1].plot(z, tanh_derivative(z), 'orange', linewidth=2, label=\"tanh'(z)\")\n",
    "axes[1, 1].set_title('Tanh Derivative', fontsize=12, fontweight='bold')\n",
    "axes[1, 1].set_xlabel('z')\n",
    "axes[1, 1].set_ylabel(\"tanh'(z)\")\n",
    "axes[1, 1].grid(True, alpha=0.3)\n",
    "axes[1, 1].legend()\n",
    "\n",
    "# ReLU\n",
    "axes[0, 2].plot(z, relu(z), 'm-', linewidth=2, label='ReLU(z)')\n",
    "axes[0, 2].axhline(y=0, color='k', linestyle='--', alpha=0.3)\n",
    "axes[0, 2].set_title('ReLU Function', fontsize=12, fontweight='bold')\n",
    "axes[0, 2].set_xlabel('z')\n",
    "axes[0, 2].set_ylabel('ReLU(z)')\n",
    "axes[0, 2].grid(True, alpha=0.3)\n",
    "axes[0, 2].legend()\n",
    "\n",
    "axes[1, 2].plot(z, relu_derivative(z), 'c-', linewidth=2, label=\"ReLU'(z)\")\n",
    "axes[1, 2].set_title('ReLU Derivative', fontsize=12, fontweight='bold')\n",
    "axes[1, 2].set_xlabel('z')\n",
    "axes[1, 2].set_ylabel(\"ReLU'(z)\")\n",
    "axes[1, 2].grid(True, alpha=0.3)\n",
    "axes[1, 2].legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Key Observations:\")\n",
    "print(\"- Sigmoid: Saturates at 0 and 1, derivative max at 0.25\")\n",
    "print(\"- Tanh: Centered at 0, saturates at -1 and 1, stronger gradients\")\n",
    "print(\"- ReLU: No saturation for positive values, dead neurons for negative\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "82a62a7a",
   "metadata": {},
   "source": [
    "## 2. Simple MLP Implementation\n",
    "\n",
    "A 2-layer MLP with:\n",
    "- Input layer: 2 features\n",
    "- Hidden layer: 4 neurons with tanh activation\n",
    "- Output layer: 1 neuron with sigmoid activation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51d37e66",
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleMLP:\n",
    "    def __init__(self, input_size=2, hidden_size=4, output_size=1):\n",
    "        # Initialize weights with Xavier initialization\n",
    "        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2.0 / input_size)\n",
    "        self.b1 = np.zeros((1, hidden_size))\n",
    "        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2.0 / hidden_size)\n",
    "        self.b2 = np.zeros((1, output_size))\n",
    "        \n",
    "        # For storing intermediate values during forward pass\n",
    "        self.cache = {}\n",
    "        \n",
    "    def forward(self, X):\n",
    "        \"\"\"Forward propagation\"\"\"\n",
    "        # Layer 1: Input -> Hidden\n",
    "        self.cache['Z1'] = np.dot(X, self.W1) + self.b1\n",
    "        self.cache['A1'] = tanh(self.cache['Z1'])\n",
    "        \n",
    "        # Layer 2: Hidden -> Output\n",
    "        self.cache['Z2'] = np.dot(self.cache['A1'], self.W2) + self.b2\n",
    "        self.cache['A2'] = sigmoid(self.cache['Z2'])\n",
    "        \n",
    "        return self.cache['A2']\n",
    "    \n",
    "    def backward(self, X, y, learning_rate=0.01):\n",
    "        \"\"\"Backpropagation\"\"\"\n",
    "        m = X.shape[0]\n",
    "        \n",
    "        # Output layer gradients\n",
    "        dZ2 = self.cache['A2'] - y\n",
    "        dW2 = (1/m) * np.dot(self.cache['A1'].T, dZ2)\n",
    "        db2 = (1/m) * np.sum(dZ2, axis=0, keepdims=True)\n",
    "        \n",
    "        # Hidden layer gradients\n",
    "        dA1 = np.dot(dZ2, self.W2.T)\n",
    "        dZ1 = dA1 * tanh_derivative(self.cache['Z1'])\n",
    "        dW1 = (1/m) * np.dot(X.T, dZ1)\n",
    "        db1 = (1/m) * np.sum(dZ1, axis=0, keepdims=True)\n",
    "        \n",
    "        # Update parameters\n",
    "        self.W2 -= learning_rate * dW2\n",
    "        self.b2 -= learning_rate * db2\n",
    "        self.W1 -= learning_rate * dW1\n",
    "        self.b1 -= learning_rate * db1\n",
    "        \n",
    "        return dW2, db2, dW1, db1\n",
    "    \n",
    "    def compute_loss(self, y_pred, y_true):\n",
    "        \"\"\"Binary cross-entropy loss\"\"\"\n",
    "        m = y_true.shape[0]\n",
    "        loss = -(1/m) * np.sum(y_true * np.log(y_pred + 1e-8) + \n",
    "                                (1 - y_true) * np.log(1 - y_pred + 1e-8))\n",
    "        return loss\n",
    "    \n",
    "    def train(self, X, y, epochs=1000, learning_rate=0.1, verbose=True):\n",
    "        \"\"\"Training loop\"\"\"\n",
    "        losses = []\n",
    "        \n",
    "        for epoch in range(epochs):\n",
    "            # Forward pass\n",
    "            y_pred = self.forward(X)\n",
    "            \n",
    "            # Compute loss\n",
    "            loss = self.compute_loss(y_pred, y)\n",
    "            losses.append(loss)\n",
    "            \n",
    "            # Backward pass\n",
    "            self.backward(X, y, learning_rate)\n",
    "            \n",
    "            if verbose and (epoch % 100 == 0 or epoch == epochs - 1):\n",
    "                accuracy = np.mean((y_pred > 0.5) == y)\n",
    "                print(f\"Epoch {epoch:4d}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}\")\n",
    "        \n",
    "        return losses\n",
    "\n",
    "print(\"SimpleMLP class defined successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f948fb7",
   "metadata": {},
   "source": [
    "## 3. Training on Moons Dataset\n",
    "\n",
    "Let's train our MLP on a non-linearly separable dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a8e305c3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate moons dataset\n",
    "X, y = make_moons(n_samples=300, noise=0.2, random_state=42)\n",
    "y = y.reshape(-1, 1)  # Reshape for consistency\n",
    "\n",
    "# Standardize features\n",
    "scaler = StandardScaler()\n",
    "X = scaler.fit_transform(X)\n",
    "\n",
    "# Split data\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
    "\n",
    "# Visualize dataset\n",
    "plt.figure(figsize=(10, 4))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "plt.scatter(X_train[y_train.ravel()==0, 0], X_train[y_train.ravel()==0, 1], \n",
    "            c='blue', label='Class 0', alpha=0.6, edgecolors='k')\n",
    "plt.scatter(X_train[y_train.ravel()==1, 0], X_train[y_train.ravel()==1, 1], \n",
    "            c='red', label='Class 1', alpha=0.6, edgecolors='k')\n",
    "plt.title('Training Data', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Feature 1')\n",
    "plt.ylabel('Feature 2')\n",
    "plt.legend()\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "plt.scatter(X_test[y_test.ravel()==0, 0], X_test[y_test.ravel()==0, 1], \n",
    "            c='blue', label='Class 0', alpha=0.6, edgecolors='k')\n",
    "plt.scatter(X_test[y_test.ravel()==1, 0], X_test[y_test.ravel()==1, 1], \n",
    "            c='red', label='Class 1', alpha=0.6, edgecolors='k')\n",
    "plt.title('Test Data', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Feature 1')\n",
    "plt.ylabel('Feature 2')\n",
    "plt.legend()\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"Training samples: {X_train.shape[0]}\")\n",
    "print(f\"Test samples: {X_test.shape[0]}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9626516f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the model\n",
    "mlp = SimpleMLP(input_size=2, hidden_size=8, output_size=1)\n",
    "losses = mlp.train(X_train, y_train, epochs=2000, learning_rate=0.5, verbose=True)\n",
    "\n",
    "# Evaluate on test set\n",
    "y_test_pred = mlp.forward(X_test)\n",
    "test_accuracy = np.mean((y_test_pred > 0.5) == y_test)\n",
    "print(f\"\\nTest Accuracy: {test_accuracy:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bee4e6b",
   "metadata": {},
   "source": [
    "## 4. Visualizing Training Progress and Decision Boundary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7310670",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot loss curve\n",
    "plt.figure(figsize=(15, 5))\n",
    "\n",
    "plt.subplot(1, 3, 1)\n",
    "plt.plot(losses, linewidth=2)\n",
    "plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Epoch')\n",
    "plt.ylabel('Binary Cross-Entropy Loss')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "plt.subplot(1, 3, 2)\n",
    "plt.plot(losses[100:], linewidth=2, color='orange')\n",
    "plt.title('Training Loss (After Epoch 100)', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Epoch')\n",
    "plt.ylabel('Loss')\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "# Decision boundary\n",
    "plt.subplot(1, 3, 3)\n",
    "h = 0.02\n",
    "x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n",
    "y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n",
    "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n",
    "Z = mlp.forward(np.c_[xx.ravel(), yy.ravel()])\n",
    "Z = Z.reshape(xx.shape)\n",
    "\n",
    "plt.contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)\n",
    "plt.colorbar(label='Prediction Probability')\n",
    "plt.scatter(X_test[y_test.ravel()==0, 0], X_test[y_test.ravel()==0, 1], \n",
    "            c='blue', label='Class 0', edgecolors='k', s=60)\n",
    "plt.scatter(X_test[y_test.ravel()==1, 0], X_test[y_test.ravel()==1, 1], \n",
    "            c='red', label='Class 1', edgecolors='k', s=60)\n",
    "plt.title('Decision Boundary', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Feature 1')\n",
    "plt.ylabel('Feature 2')\n",
    "plt.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e56a5cec",
   "metadata": {},
   "source": [
    "## 5. Gradient Flow Visualization\n",
    "\n",
    "Understanding how gradients flow through the network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cd47bc5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train a fresh model and track gradient magnitudes\n",
    "mlp_grad = SimpleMLP(input_size=2, hidden_size=8, output_size=1)\n",
    "\n",
    "grad_magnitudes_W1 = []\n",
    "grad_magnitudes_W2 = []\n",
    "\n",
    "for epoch in range(500):\n",
    "    y_pred = mlp_grad.forward(X_train)\n",
    "    dW2, db2, dW1, db1 = mlp_grad.backward(X_train, y_train, learning_rate=0.5)\n",
    "    \n",
    "    # Track gradient magnitudes\n",
    "    grad_magnitudes_W1.append(np.linalg.norm(dW1))\n",
    "    grad_magnitudes_W2.append(np.linalg.norm(dW2))\n",
    "\n",
    "# Plot gradient magnitudes\n",
    "plt.figure(figsize=(12, 4))\n",
    "\n",
    "plt.subplot(1, 2, 1)\n",
    "plt.plot(grad_magnitudes_W1, label='Layer 1 (Input → Hidden)', linewidth=2)\n",
    "plt.plot(grad_magnitudes_W2, label='Layer 2 (Hidden → Output)', linewidth=2)\n",
    "plt.title('Gradient Magnitude Over Training', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Epoch')\n",
    "plt.ylabel('Gradient Norm')\n",
    "plt.legend()\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "plt.subplot(1, 2, 2)\n",
    "plt.plot(grad_magnitudes_W1[50:], label='Layer 1', linewidth=2)\n",
    "plt.plot(grad_magnitudes_W2[50:], label='Layer 2', linewidth=2)\n",
    "plt.title('Gradient Magnitude (After Epoch 50)', fontsize=14, fontweight='bold')\n",
    "plt.xlabel('Epoch')\n",
    "plt.ylabel('Gradient Norm')\n",
    "plt.legend()\n",
    "plt.grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Gradient flow analysis:\")\n",
    "print(f\"Final W1 gradient magnitude: {grad_magnitudes_W1[-1]:.6f}\")\n",
    "print(f\"Final W2 gradient magnitude: {grad_magnitudes_W2[-1]:.6f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3558dc78",
   "metadata": {},
   "source": [
    "## 6. Comparing Different Hidden Layer Sizes\n",
    "\n",
    "Effect of network capacity on learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f57fa080",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare different hidden layer sizes\n",
    "hidden_sizes = [2, 4, 8, 16, 32]\n",
    "results = []\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(18, 10))\n",
    "axes = axes.ravel()\n",
    "\n",
    "for idx, h_size in enumerate(hidden_sizes):\n",
    "    mlp_temp = SimpleMLP(input_size=2, hidden_size=h_size, output_size=1)\n",
    "    losses_temp = mlp_temp.train(X_train, y_train, epochs=1000, learning_rate=0.5, verbose=False)\n",
    "    \n",
    "    y_test_pred = mlp_temp.forward(X_test)\n",
    "    test_acc = np.mean((y_test_pred > 0.5) == y_test)\n",
    "    results.append({'hidden_size': h_size, 'test_accuracy': test_acc, 'final_loss': losses_temp[-1]})\n",
    "    \n",
    "    # Plot decision boundary\n",
    "    h = 0.02\n",
    "    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n",
    "    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n",
    "    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n",
    "    Z = mlp_temp.forward(np.c_[xx.ravel(), yy.ravel()])\n",
    "    Z = Z.reshape(xx.shape)\n",
    "    \n",
    "    axes[idx].contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)\n",
    "    axes[idx].scatter(X_test[y_test.ravel()==0, 0], X_test[y_test.ravel()==0, 1], \n",
    "                     c='blue', edgecolors='k', s=40, alpha=0.8)\n",
    "    axes[idx].scatter(X_test[y_test.ravel()==1, 0], X_test[y_test.ravel()==1, 1], \n",
    "                     c='red', edgecolors='k', s=40, alpha=0.8)\n",
    "    axes[idx].set_title(f'Hidden Size = {h_size}\\nAccuracy = {test_acc:.3f}', \n",
    "                       fontsize=12, fontweight='bold')\n",
    "    axes[idx].set_xlabel('Feature 1')\n",
    "    axes[idx].set_ylabel('Feature 2')\n",
    "\n",
    "# Remove extra subplot\n",
    "axes[5].axis('off')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "# Print results\n",
    "print(\"\\nComparison of Hidden Layer Sizes:\")\n",
    "print(\"=\"*50)\n",
    "for r in results:\n",
    "    print(f\"Hidden Size: {r['hidden_size']:2d} | Test Acc: {r['test_accuracy']:.4f} | Loss: {r['final_loss']:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e119905c",
   "metadata": {},
   "source": [
    "## 7. Backpropagation Step-by-Step Example\n",
    "\n",
    "Manual calculation for a single training example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7094692",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a tiny network for demonstration\n",
    "print(\"Manual Backpropagation Calculation\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "# Initialize small weights for clarity\n",
    "W1 = np.array([[0.5, 0.2], [0.1, 0.3]])\n",
    "b1 = np.array([[0.1, 0.2]])\n",
    "W2 = np.array([[0.4], [0.6]])\n",
    "b2 = np.array([[0.1]])\n",
    "\n",
    "# Single input example\n",
    "x = np.array([[1.0, 0.5]])\n",
    "y = np.array([[1.0]])\n",
    "\n",
    "print(\"\\nInput: x =\", x[0])\n",
    "print(\"Target: y =\", y[0, 0])\n",
    "\n",
    "# Forward pass\n",
    "print(\"\\n--- Forward Pass ---\")\n",
    "Z1 = np.dot(x, W1) + b1\n",
    "print(f\"Z1 = x @ W1 + b1 = {Z1[0]}\")\n",
    "\n",
    "A1 = tanh(Z1)\n",
    "print(f\"A1 = tanh(Z1) = {A1[0]}\")\n",
    "\n",
    "Z2 = np.dot(A1, W2) + b2\n",
    "print(f\"Z2 = A1 @ W2 + b2 = {Z2[0, 0]:.4f}\")\n",
    "\n",
    "A2 = sigmoid(Z2)\n",
    "print(f\"A2 = sigmoid(Z2) = {A2[0, 0]:.4f}\")\n",
    "\n",
    "loss = -(y * np.log(A2 + 1e-8) + (1 - y) * np.log(1 - A2 + 1e-8))\n",
    "print(f\"\\nLoss = {loss[0, 0]:.4f}\")\n",
    "\n",
    "# Backward pass\n",
    "print(\"\\n--- Backward Pass ---\")\n",
    "dZ2 = A2 - y\n",
    "print(f\"dZ2 = A2 - y = {dZ2[0, 0]:.4f}\")\n",
    "\n",
    "dW2 = np.dot(A1.T, dZ2)\n",
    "print(f\"dW2 = A1.T @ dZ2 = {dW2.ravel()}\")\n",
    "\n",
    "db2 = dZ2\n",
    "print(f\"db2 = {db2[0, 0]:.4f}\")\n",
    "\n",
    "dA1 = np.dot(dZ2, W2.T)\n",
    "print(f\"dA1 = dZ2 @ W2.T = {dA1[0]}\")\n",
    "\n",
    "dZ1 = dA1 * tanh_derivative(Z1)\n",
    "print(f\"dZ1 = dA1 * tanh'(Z1) = {dZ1[0]}\")\n",
    "\n",
    "dW1 = np.dot(x.T, dZ1)\n",
    "print(f\"dW1 = x.T @ dZ1 = \\n{dW1}\")\n",
    "\n",
    "db1 = dZ1\n",
    "print(f\"db1 = {db1[0]}\")\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"Gradients computed successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7fc41d1",
   "metadata": {},
   "source": [
    "## 8. Key Takeaways\n",
    "\n",
    "1. **Forward Pass**: Compute activations layer by layer\n",
    "   - $Z^{(l)} = W^{(l)} A^{(l-1)} + b^{(l)}$\n",
    "   - $A^{(l)} = \\sigma(Z^{(l)})$\n",
    "\n",
    "2. **Backward Pass**: Compute gradients using chain rule\n",
    "   - Start from output: $dZ^{(L)} = A^{(L)} - y$\n",
    "   - Propagate back: $dZ^{(l)} = dA^{(l)} \\odot \\sigma'(Z^{(l)})$\n",
    "\n",
    "3. **Activation Functions**:\n",
    "   - Sigmoid/Tanh: Can saturate, causing vanishing gradients\n",
    "   - ReLU: Better gradient flow, but can have dead neurons\n",
    "\n",
    "4. **Network Capacity**: More hidden neurons → more complex decision boundaries\n",
    "\n",
    "5. **Gradient Magnitude**: Monitor to detect vanishing/exploding gradients"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}