NLP - Classification

date
Aug 26, 2024
type
Post
AI summary
Il documento discute la classificazione nel NLP, affrontando la differenza tra dati strutturati e non strutturati, introducendo il modello Bag of Words per la rappresentazione delle caratteristiche, e descrivendo vari esempi di classificazione. Viene esaminato il processo di etichettatura dei dati, la classificazione bayesiana, la regressione logistica binaria e multinomiale, e l'importanza dell'ingegneria delle caratteristiche e della normalizzazione delle probabilità. La conclusione sottolinea come la regressione logistica apprenda parametri che pesano l'importanza delle caratteristiche.
slug
nlp-classification
status
Published
tags
NLP
summary
Addresses the challenges of structured vs. unstructured data, and introduces the Bag of Words model for feature representation. Explain why each document can be represented as a point in a d-dimensional feature space.

Examples of Classification

  • Document topics: politics? sports? reports?
  • Sentiment classification: movie review positive or negative?
  • Spam detection.
  • Text complexity: is the text formal? informal? clarity?
  • Language and dialect detection.
  • Social media toxicity detection.
  • Multiple-choice question answering.
  • Language generation — what words come next?

Features of Classification

 
Where:
  • Input (): a word, a sentence, a paragraph, a document.
  • Input (): a set of all sequences of words.
  • Output: a label from a finite set of labels .
Consider the following:
  • is a random variable of inputs, such that each value of X is from
  • is a random variable of outputs take from
  • is the true distribution of labeled texts
    • The joint probability with all possible text document and all possible labels
  • is the distribution of labels
    • Irrespective of documents, how frequently we would see each label

Data Labeling

Problem:
we don’t typically know or except by data.
Solution:
Human experts label some data.
Feed the labeled data to a supervised machine learning algorithm that approximates the function
Test the classify function to some portion of data withheld for testing

Structured vs Un-structured Data

Problem: is generally considered unstructured data
Supervised learning likes structured inputs, which we call features (). Features are what the algorithm is allowed to see and know about.
Problem: In a perfect world, we simply throw away all the unnecessary parts of the input and keep the useful stuff. But we don’t always know what is useful.
But we often have some insight - feature engineering. For example getting rid of “the”, “are”….

Bag Of Words

Considering this sentence:

Features as word presence

  • Discard word order
Then we can have unigrams:
We can have bigrams:
Now x can be represented as a d-dimensional vector of features.

Features as word frequency

Conclusion

A feature vector may look like this: , where is the number of features. is typically used to represent the set of all real numbers in mathematics.
notion image
Now every document or input is a point in the d-dimensional space.
notion image

Bayesian Classification

We can translate the classification function below:
into a probability question:
Apply naive Bayes assumption: word features are independent from each other:
Where,
  • These probability can be labeled or observed from training datasets.
  • typically for NLP does not need to be computed since we are only looking for positive and negative classification
notion image
Turning probability into a classification prediction:
Apply sign (+ or -):

Log Probabilities and Smoothing

Two problems:
  1. Multiplying probabilities makes numbers very small very fast and eventually floating point precision becomes a problem.
  1. Sometimes probabilities are zero and multiplying zeros against other probabilities makes everything zero.
Log Scale Probability
notion image
The zero problem can still appear if . We can modify the equation such that we never encounter features with zero probability — we can pretend there is never a non-zero count of features in a text. The following process is called smoothing. This introduces a small amount of error.

Binary Logistic Regression

Recall that we have a collection of features , each of which map the set of words . Using Bayesian probabilities, we have . However, certain words obviously should hold more weight than others, instead of mere frequencies: deserves more attention than . To solve this problem, we introduce binary logistic regression.

Parameters / Weights

Consider a set of coefficients (also called weights or parameters) for each feature: with . If you set the coefficients just right, you can design a system that gets classification correct for a majority of inputs. But... you don't generally know the coefficients so you have to learn them from the training data.
Consider a Binary label , and a score function:
Where,
  • as an array or vector represents the document,
  • is an array or vector containing all the weights.

The Probability Perspective

How to learn the weights though? Let’s look at the score function in a probabilistic perspective:
Where,
  • converts the score into a probability distribution
  • if score is high, ; if score is low or negative,
Consider the negative label scenario,
Where:
  • if score is negative, ; if score is positive,
Generally, we can have both the positive and negative scenarios in the same equation:

Parameter Learning

To learn the perfect set of parameters , we use the following equation:
Where,
arg max: search for a set of that maximize the probability of the label given the inputs
we look at all examples of data
probability that the label for data point i matches the input data point i.
Converting the equation into log space:
We flip things around:
Where,
  • is called the negative log loss, or cross-entropy loss.
Considering that:
The sigmoid function in log scale can be written as:
Where, is the loss function for supervised learning,

Parameter Learning With Neural Networks

Construct a single layer neural network with a sigmoid function at the top:
notion image
Loss Function
Recall the loss function: , which multiplies the score () by target or before sending it to the sigmoid function: .
However, here we are talking about labels being or , we introduce the Binary cross entropy:
Where,
or
and are used to select positive and negative labels
when ,
when ,

Training

Batching and training:
notion image
notion image
Where,
is the learning rate, the higher the learning rate, the faster we go down gradient, but may overshoot the true value for .

Conclusion

To summarize,
  • Binary logistic regression learns a set of parameters that weights the importance of features .
  • Parameters can be used to approximate a conditional probability distribution over labels and features.

Multinomial Logistic Regression

Scoring function

Consider a set of labels, instead of just binary: , we construct a neural network as shown below:
notion image
Different from binary classification, we have the following:
  • set of features
  • input-output parameters ()
    • is the number of labels
    • is the number of features
  • For document with label , iterate through all labels and all features and set if feature and label are in the document.
    • notion image
Similar to binary classification, we have the following score function:
Where, try all different labels from , and return the label that maximize the score for each input-output features.

Softmax

Contrary to the sigmoid or logistic function, we need a function that can take a vector of numbers and make one of them close to 1.0 and the rest close to 0.0, the result sums up to 1.0

Probability approximation

What is the probability for all labels?
What is the probability of a single label?
Where is what makes multinomial logistic regression computational expensive — to compute probability for a single label, one must sum across all labels to normalize.

Parameter Learning

We are still going to minimize the log loss of a probability, just like we did for binary logistic regress:
Where,
  • the first part maximize the score (minimize negative score) for features that correlate with correct label ;
  • the second part minimize the score for all other labels
As such, we can construct a neural network as follows:
notion image
Negative Log Likelihood Loss
The predicted vector is:
The true label is:
Multiply together:
Summed and Negate:

Neural Net Training

 
Where the pseudocode is the same as binary classification, and is a function with more parameters and a softmax layer; is negative loss likelihood instead of binary cross entropy.
Note: most APIs will implement a CrossEntropyLoss as the top three layers: Softmax + Log + NLLLoss.

Multilayer Classification Networks

We built the neural networks based on pure math for both binary and multinomial classification problems. However, we can add layers to these neural networks to create more complex models.

About Me

I'm Qiwei Mao, a geotechnical engineer with a passion for IoT systems. I'm exploring low-power microcontrollers and LoRa communication systems to enable both hobbyist remote monitoring solutions and industrial-grade monitoring or control systems.
This note series on Natural Language Processing documents my journey in the OMSCS Program at Georgia Institute of Technology.
Qiwei Mao
Qiwei Mao

© Qiwei Mao 2024