A Beginner’s Guide to Text Classification using PyTorch and Hugging Face

Text classification is a common task in natural language processing (NLP) that involves assigning predefined categories or labels to a given text. It is a supervised learning problem, where the goal is to train a model to predict the correct label for new, unseen examples. Text classification has many practical applications, such as sentiment analysis, spam detection, and topic classification.

In this guide, we will explore how to perform text classification using PyTorch, a popular deep learning library, and Hugging Face, a library that provides pre-trained models and easy-to-use interfaces for NLP tasks. We will also cover some basic concepts and techniques used in text classification, such as tokenization and embeddings.

Prerequisites

Before diving into the guide, you should have some familiarity with:

  • Python programming
  • PyTorch basics
  • Basic understanding of Machine Learning
Text Classification Basics

Text classification can be performed in various ways, but the most common approach is to represent the text as a fixed-length vector and train a model to predict the label based on this representation. The main steps involved in this process are:

  1. Data preparation: Collect and preprocess the text data, such as removing stop words, punctuation, and special characters, and splitting the text into tokens.
  2. Feature extraction: Convert the text into a numerical representation that can be used as input to a machine learning model. This is usually done by creating a matrix of token counts (bag-of-words) or by using pre-trained word embeddings.
  3. Model training: Train a machine learning model, such as a logistic regression or a neural network, on the prepared data to learn the relationship between the features and the labels.
  4. Evaluation: Evaluate the performance of the trained model on a separate test dataset, and use metrics such as accuracy, precision, and recall to assess the model’s performance.
Tokenization and Embeddings

Before we can start training a model, we need to convert the text into a numerical representation that the model can understand. The first step in this process is tokenization, which involves splitting the text into smaller units, called tokens. Tokens are usually words, but they can also be characters, subwords, or any other unit of text that makes sense for the specific task.

The next step is to create a numerical representation of the tokens, called embeddings. Embeddings are dense vectors that capture the meaning of the tokens in a continuous, low-dimensional space. There are various ways to create embeddings, but the most common approach is to use pre-trained word embeddings, such as word2vec or GloVe. These embeddings are trained on large corpus of text data and have been found to be very effective in many NLP tasks.

PyTorch and Hugging Face

PyTorch is a popular deep learning library that provides a high-level interface for building and training neural networks. It is widely used in research and industry, and has a growing community of users and contributors.

Hugging Face is a library that provides pre-trained models and easy-to-use interfaces for NLP tasks. It is built on top of PyTorch and offers a wide range of models for various NLP tasks, such as text classification, language translation, and question answering.

Text Classification using PyTorch and Hugging Face

In this section, we will see how to perform text classification using PyTorch and Hugging Face. We will start by loading the dataset, then we will preprocess the data, create the embeddings, and train a model.

Loading the dataset

First, we need to load the dataset. For this guide, we will use the IMDB dataset, which consists of movie reviews labeled as positive or negative. The dataset is available in the torchtext library, which is a PyTorch library for working with text data.

import torchtext

# define the fields
text_field = torchtext.data.Field(lower=True)
label_field = torchtext.data.LabelField(dtype=torch.float)

# load the dataset
train_data, test_data = torchtext.datasets.IMDB.splits(text_field, label_field)
Preprocessing the data

Next, we need to preprocess the data. This includes tokenizing the text, creating the vocabulary, and converting the text into numerical representations.

# create the vocabulary
text_field.build_vocab(train_data, max_size=10000)
label_field.build_vocab(train_data)

# convert the data into numerical representations
train_iterator, test_iterator = torchtext.data.BucketIterator.splits(
   (train_data, test_data),
   batch_size=32,
   sort_key=lambda x: len(x.text),
   sort_within_batch=True,
)
Creating the Embeddings

Once the data has been preprocessed, we need to create the embeddings. For this guide, we will use pre-trained GloVe embeddings, which can be loaded using the torchtext library.

import torch
import torchtext.vocab as vocab

# load the pre-trained embeddings
glove = vocab.GloVe(name='6B', dim=100)
text_field.vocab.set_vectors(glove.stoi, glove.vectors, dim=100)
Training the Model

Finally, we can train the model. We will use a simple LSTM model for this guide, but other models such as CNNs can also be used.

import torch.nn as nn

class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
   super().__init__()
   self.embedding = nn.Embedding(vocab_size, embedding_dim)
   self.lstm = nn.LSTM(embedding_dim, hidden_dim)
   self.fc = nn.Linear(hidden_dim, output_dim)

def forward(self, text):
   embedded = self.embedding(text)
   output, (hidden, cell) = self.lstm(embedded)
   last_hidden = hidden[-1,:,:]
   return self.fc(last_hidden)

# Initialize the model
model = LSTMClassifier(len(text_field.vocab), 100, 256, 1)

# Loss function and optimizer
criterion = nn.BCEWithLogitsLoss
optimizer = torch.optim.Adam(model.parameters())

# Training the model
for epoch in range(10):
   for i, batch in enumerate(train_iterator):
      optimizer.zero_grad()
      predictions = model(batch.text).squeeze(1)
      loss = criterion(predictions, batch.label)
      loss.backward()
      optimizer.step()
Evaluation

Once the model has been trained, we need to evaluate its performance on the test dataset. We will use the accuracy metric to evaluate the model’s performance.

# Evaluation of the model
with torch.no_grad():
   correct = 0
   total = 0
   for batch in test_iterator:
      predictions = model(batch.text).squeeze(1)
      predictions = torch.round(torch.sigmoid(predictions))
      correct += (predictions == batch.label).sum().item()
      total += len(predictions)
   print(f'Test Accuracy: {correct/total*100:.2f}%')
Fine-Tuning Pretrained Models

In addition to training models from scratch, it’s also possible to fine-tune pre-trained models on your own dataset. This can save a significant amount of time and resources, as the pre-trained models have already learned general-purpose features from a large amount of data. Hugging Face provides a wide range of pre-trained models that can be fine-tuned for text classification tasks with minimal changes to the code.

from transformers import BertForSequenceClassification, AdamW, BertConfig

# Load a pre-trained model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Fine-tune the model
model.train()
optimizer = AdamW(model.parameters(), lr=2e-5)
for epoch in range(10):
   for i, batch in enumerate(train_iterator):
      optimizer.zero_grad()
      input_ids = batch.text.to(device)
      labels = batch.label.to(device)
      outputs = model(input_ids, labels=labels)
      loss = outputs[0]
      loss.backward()
      optimizer.step()
Conclusion

Text classification is a common task in natural language processing that involves assigning predefined categories or labels to a given text. In this guide, we have seen how to perform text classification using PyTorch and Hugging Face. We have covered some basic concepts and techniques used in text classification, such as tokenization, embeddings, and model training. We have also seen how to fine-tune pre-trained models to save time and resources.