ML Foundations: A Guide to Natural Language Processing (NLP)
ML Foundations: A Guide to Natural Language Processing (NLP)
Introduction
Welcome back to our machine learning foundations series. So far, we’ve dealt mostly with numerical data. But what about the most human form of data—language? Natural Language Processing (NLP) is the fascinating field of AI that enables computers to understand, interpret, process, and generate human language.
The fundamental challenge of NLP is that computers don’t understand words and sentences; they understand numbers. Therefore, the entire field is built around clever ways to convert unstructured text into a structured, numerical format that machine learning models can digest.
This guide will walk you through the classic NLP pipeline, from cleaning raw text to building a model that can understand its sentiment. We will cover:
- Text Preprocessing: The essential cleanup steps like tokenization, stopword removal, stemming, and lemmatization.
- Word Embeddings: The revolutionary idea of representing words as meaningful vectors using Word2Vec and GloVe.
- Sequence Modeling: How Recurrent Neural Networks (RNNs) and LSTMs learn from the order of words.
- Sentiment Analysis: A practical application that puts all these concepts together.
1. The Cleanup: Basic Text Preprocessing
Before any analysis can be done, raw text must be cleaned and standardized. Let’s take a sample sentence and see how it’s transformed: "The quick brown foxes are jumping over the lazy dogs!"
a. Tokenization & Lowercasing
First, we break the sentence down into individual units, or tokens—usually words. At the same time, we convert everything to lowercase to ensure the model treats “The” and “the” as the same word.
["the", "quick", "brown", "foxes", "are", "jumping", "over", "the", "lazy", "dogs", "!"]
b. Stopword Removal
Stopwords are extremely common words that often add little semantic value to a sentence (e.g., “a”, “an”, “the”, “is”, “in”). Removing them helps reduce noise and allows the model to focus on the more important words. The get_stop_words function you provided is a perfect example of how one might create a custom list of stopwords.
After removing stopwords (and punctuation), our list becomes:
["quick", "brown", "foxes", "jumping", "over", "lazy", "dogs"]
c. Stemming and Lemmatization
Next, we want to reduce words to their root form. There are two main ways to do this:
- Stemming: A crude, rule-based process that chops off word endings. It’s fast but can sometimes create non-existent words.
jumping->jumpstudies->studi(incorrect root)
- Lemmatization: A more sophisticated process that uses a dictionary and morphological analysis to reduce words to their proper base form, called a lemma.
jumping->jumpare->bebetter->good
Lemmatization is usually preferred over stemming because it results in actual words, though it is computationally slower. After lemmatization, our final processed list of tokens is:
["quick", "brown", "fox", "jump", "over", "lazy", "dog"]
This clean list of tokens is now ready for numerical representation.
2. Representing Words as Numbers: Word Embeddings
How do we convert these tokens into numbers? A simple approach is one-hot encoding (from our preprocessing post), but this creates huge, sparse vectors and treats every word as an independent, unrelated unit. The words “king” and “queen” would be just as different as “king” and “apple.”
A far better approach is Word Embeddings. This technique represents words as dense, low-dimensional vectors (e.g., a vector of 300 numbers) in such a way that the geometry of the vector space captures semantic relationships.
Word2Vec
Developed at Google, Word2Vec is based on a simple but powerful idea: “You shall know a word by the company it keeps.” It learns embeddings by training a neural network to predict a word from its neighbors (CBOW architecture) or to predict the neighbors from the word (Skip-gram architecture). The learned weights of the network’s hidden layer become the word vectors.
GloVe (Global Vectors)
Developed at Stanford, GloVe takes a different approach. It first builds a giant co-occurrence matrix that records how frequently each pair of words appears together in the corpus. It then uses matrix factorization to reduce the dimensionality of this matrix, producing the final word embeddings.
The incredible result of both methods is a vector space where:
vector('King') - vector('Man') + vector('Woman') ≈ vector('Queen')
This shows that the model has learned complex analogies and relationships entirely from the text data.
3. Modeling Sequences: RNNs and LSTMs
Word embeddings give us meaningful vectors for words, but what about word order? “The dog bit the man” means something very different from “The man bit the dog.” A standard feed-forward network can’t capture this sequential information.
Recurrent Neural Networks (RNNs)
RNNs are designed specifically for sequential data. They have a “loop” that allows them to maintain a hidden state, or a form of memory. When processing a sequence, the RNN takes the embedding for the current word and the hidden state from the previous word as input. It then produces an output and updates its hidden state to be passed to the next step.
This allows the network’s understanding of a word to be influenced by all the words that came before it.
The Problem: Simple RNNs suffer from the vanishing gradient problem. Over long sequences, the information from early steps gets diluted, making it difficult for the network to learn long-range dependencies. It might forget the beginning of a long paragraph by the time it reaches the end.
Long Short-Term Memory (LSTMs)
LSTMs are a special, more complex type of RNN designed to solve the vanishing gradient problem. At a high level, an LSTM has:
- A Cell State: A “conveyor belt” that carries information straight down the entire sequence, making it easy for long-term memories to persist.
- Gates: Three “gates” (the forget gate, input gate, and output gate) that act as regulators. They are small neural networks that learn which information to add to, remove from, or read from the cell state at each time step.
This structure allows LSTMs to effectively remember and make use of information over very long sequences, making them the classic choice for most serious NLP tasks before the rise of Transformers.
4. A Practical Application: Sentiment Analysis
Let’s tie all these concepts together to build a sentiment analysis model. The goal is to classify a movie review as “positive” or “negative.”
A typical deep learning model for this task would look like this:
- Input Layer: Takes the preprocessed and tokenized text, where each word is represented by an integer index from a vocabulary.
- Embedding Layer: This layer looks up the corresponding word embedding (e.g., from a pre-trained GloVe model) for each integer index. The output is a sequence of dense vectors.
- LSTM Layer: This layer processes the sequence of word embeddings, capturing the context and order of the words. It outputs a final hidden state vector that summarizes the meaning of the entire review.
- Dense Output Layer: A standard fully-connected neural network layer that takes the LSTM’s summary vector as input. It uses a Sigmoid activation function to output a single probability between 0 (negative) and 1 (positive).
This architecture effectively translates raw text into a sophisticated, context-aware prediction.
Conclusion
The classic NLP pipeline is a journey from unstructured chaos to structured insight.
- We start by cleaning and standardizing text through preprocessing.
- We then convert words into meaningful numerical representations using Word Embeddings.
- We model the order and context of these words using sequence models like RNNs and LSTMs.
- Finally, we use the output of our sequence model to perform a downstream task like Sentiment Analysis.
While the state-of-the-art in NLP has now largely moved to the Transformer architecture (which we discussed in the previous post), this pipeline represents the foundational concepts upon which modern NLP is built. Understanding LSTMs and the classic pipeline is still essential for any serious practitioner in the field.
Suggested Reading
- “Speech and Language Processing” by Dan Jurafsky and James H. Martin: The definitive textbook on NLP, covering all these topics in great depth.
- “Deep Learning with Python” by François Chollet: Provides excellent, practical examples of building NLP models with Keras.
- Chris Olah’s Blog: His post “Understanding LSTMs” is a famous and brilliant visual explanation of how LSTMs work.