Post

Deep Learning Concepts: A Guide for ML Engineers

Deep Learning Concepts for Interviews: A Guide for ML Engineers

Introduction

Deep Learning is no longer a niche subfield of AI; it is the engine driving the most significant breakthroughs in technology, from natural language understanding to computer vision. For any Machine Learning Engineer, from junior to senior, a solid grasp of its core concepts is not just beneficial—it’s essential for cracking interviews and excelling on the job.

While our previous post, “ML Foundations: From Neural Networks to Transformer Models,” covered the architectural evolution to Transformers, this guide serves as a deep dive into the practical and theoretical concepts that frequently appear in technical interviews. We will answer the “why” and “how” behind these powerful techniques.

1. Neural Network Fundamentals Revisited

Let’s start with two fundamental questions that cut to the heart of how networks “learn.”

How does back-propagation work in a neural network?

Back-propagation, short for “backward propagation of errors,” is the algorithm used to train neural networks. While forward propagation pushes input values through the network to get an output and calculate an error, back-propagation pushes the error back through the network to update the weights.

The core idea is to assign blame. It calculates the gradient of the loss function with respect to each weight and bias in the network. This gradient tells us how a small change in a specific weight would affect the overall error. The magic behind this is the chain rule from calculus.

Imagine the network as a long chain of nested functions. The chain rule allows us to compute the derivative of the final loss function with respect to any weight, layer by layer, starting from the output and moving backward.

In simple steps:

  1. Forward Pass: Make a prediction and calculate the loss (e.g., using Mean Squared Error).
  2. Compute Output Gradient: Calculate the gradient of the loss with respect to the output layer’s activations.
  3. Propagate Backward: Move backward layer by layer. For each layer, use the gradient from the subsequent layer to compute the gradient with respect to the current layer’s weights, biases, and activations.
  4. Update Weights: Use an optimizer (like SGD or Adam) to take a step in the direction opposite to the gradient, thereby minimizing the error.

Why is ReLU preferred over sigmoid in hidden layers?

The Rectified Linear Unit (ReLU), defined as f(x) = max(0, x), has become the default activation function for hidden layers, largely replacing the sigmoid function. There are two primary reasons:

  1. It Mitigates the Vanishing Gradient Problem: The derivative of the sigmoid function is a bell-shaped curve that maxes out at 0.25. When you have many layers, you multiply these small gradients together during back-propagation. The result is that the gradients in the early layers of the network become infinitesimally small (“vanish”), meaning those layers learn very slowly or not at all. The derivative of ReLU, in contrast, is either 0 (for negative inputs) or 1 (for positive inputs). This constant 1 means the gradient can flow backward through the network without shrinking, allowing deeper networks to train effectively.
  2. Computational Efficiency: The ReLU function is computationally trivial—it’s just a max(0, x) operation. Sigmoid, on the other hand, involves an exponential, which is more computationally expensive. This makes training with ReLU significantly faster.

2. Convolutional Neural Networks (CNNs) for Vision

CNNs are the cornerstone of modern computer vision. They are designed to automatically and adaptively learn spatial hierarchies of features from images.

What is the role of filters in a CNN?

A filter (or kernel) is a small matrix of weights that slides over the input image. Its role is to detect specific features. In the early layers, filters might learn to detect simple features like edges, corners, or colors. In deeper layers, these filters combine the features from earlier layers to detect more complex patterns, like eyes, noses, or even entire objects like faces or cars.

The “convolution” operation involves taking the element-wise product of the filter and the patch of the image it is currently over, and then summing up the results into a single pixel in the output feature map. The network learns the values of these filter weights during training.

How does max pooling reduce computational complexity?

Max pooling is a down-sampling operation. A pooling window (e.g., a 2x2 window) slides over the feature map, and in each window, only the maximum value is passed on to the next layer. This has two key benefits:

  1. Reduces Computational Complexity: By reducing the spatial dimensions (height and width) of the feature maps, it dramatically decreases the number of parameters and computations in subsequent layers. This makes the network faster and more memory-efficient.
  2. Provides Basic Translational Invariance: By taking the maximum value in a neighborhood, the network becomes less sensitive to the exact location of a feature. If an edge shifts slightly, the max pooling output is likely to remain the same. This helps the model recognize an object regardless of where it appears in the image.

3. Recurrent Neural Networks (RNNs) & LSTMs

While Transformers are now dominant, understanding RNNs and LSTMs is crucial for grasping the history and challenges of sequence modeling.

How does an LSTM solve the vanishing gradient problem?

As discussed, standard RNNs suffer from the vanishing gradient problem because gradients are multiplied over and over again through time. Long Short-Term Memory (LSTM) networks were specifically designed to combat this.

An LSTM introduces a cell state and a series of gates (the forget gate, input gate, and output gate).

  • Cell State: Think of this as the model’s long-term memory. It’s a “conveyor belt” that runs straight down the entire chain, with only minor linear interactions. Information can flow along it unchanged.
  • Gates: These are neural networks with sigmoid activations that control the flow of information. They can learn what information to add to or remove from the cell state.
    • The Forget Gate decides what information from the old cell state should be thrown away.
    • The Input Gate decides which new information should be stored in the cell state.
    • The Output Gate decides what part of the cell state should be passed on as the hidden state (the short-term memory).

By using addition and subtraction operations controlled by these gates, rather than repeated multiplication, LSTMs can preserve the error signal over long sequences, allowing gradients to flow without vanishing.

4. The Transformer Era: BERT vs. GPT

BERT and GPT are both based on the Transformer architecture, but their designs and training objectives make them suited for different tasks.

How is BERT different from GPT?

The fundamental difference lies in their architecture and pre-training strategy:

  • BERT (Bidirectional Encoder Representations from Transformers):

    • Architecture: BERT uses only the Encoder part of the Transformer.
    • Training Objective: It’s pre-trained using a Masked Language Model (MLM) objective. It takes a sentence, masks out about 15% of the words, and its goal is to predict only those masked words.
    • Bidirectionality: Because it can see the entire sentence at once (both left and right context), it learns a deep, bidirectional representation of language.
    • Best For: Tasks that require a deep understanding of context, such as text classification, sentiment analysis, and question answering. It’s an excellent feature extractor.
  • GPT (Generative Pre-trained Transformer):

    • Architecture: GPT uses only the Decoder part of the Transformer.
    • Training Objective: It’s pre-trained using a standard Causal Language Model (CLM) objective. Its goal is to predict the next word in a sentence, given all the previous words.
    • Unidirectionality (Autoregressive): It can only see the context to its left. This makes it inherently generative.
    • Best For: Tasks that involve generating coherent text, such as summarization, translation, and creative writing (i.e., chatbots).

In short: BERT is for analysis, GPT is for generation.

5. The Art of Training: Optimization and Regularization

Training a deep learning model is an art that involves carefully choosing loss functions, optimizers, and regularization techniques.

Adam (Adaptive Moment Estimation) is the go-to optimizer for most deep learning tasks. It’s popular because it combines the best properties of two other popular optimizers: AdaGrad and RMSProp.

Adam computes adaptive learning rates for each parameter. It does this by keeping track of two moving averages:

  1. The first moment (the mean): This is like momentum. It helps accelerate the optimizer in the correct direction and dampens oscillations.
  2. The second moment (the uncentered variance): This is the “adaptive” part. It scales the learning rate for each parameter, giving smaller updates to frequently updated parameters and larger updates to infrequently updated ones.

Adam is popular because it’s efficient, requires little memory, and generally works well with default settings, requiring less manual tuning of the learning rate.

What is the purpose of dropout, and how does it help prevent overfitting?

Dropout is a powerful regularization technique to prevent overfitting. During training, on each forward pass, dropout randomly sets the activations of a fraction of neurons (e.g., 20% or 50%) to zero.

This has two main effects:

  1. Forces the Network to Learn Redundant Representations: Since any neuron can be dropped out, the network cannot rely on any single neuron to be present. It is forced to learn more robust features that are distributed across the network.
  2. Acts as an Ensemble: Training a network with dropout is like training a large ensemble of smaller networks. Each training step involves a different “thinned” network. At test time, all neurons are used, which is analogous to averaging the predictions of this massive ensemble.

By preventing neurons from co-adapting too much, dropout significantly reduces overfitting and improves the model’s ability to generalize to unseen data.

Suggested Reading

This post is licensed under CC BY 4.0 by the author.