NLP - Classification

date

Aug 26, 2024

type

Post

AI summary

Il documento discute la classificazione nel NLP, affrontando la differenza tra dati strutturati e non strutturati, introducendo il modello Bag of Words per la rappresentazione delle caratteristiche, e descrivendo vari esempi di classificazione. Viene esaminato il processo di etichettatura dei dati, la classificazione bayesiana, la regressione logistica binaria e multinomiale, e l'importanza dell'ingegneria delle caratteristiche e della normalizzazione delle probabilità. La conclusione sottolinea come la regressione logistica apprenda parametri che pesano l'importanza delle caratteristiche.

slug

nlp-classification

status

Published

Examples of Classification

Document topics: politics? sports? reports?

Sentiment classification: movie review positive or negative?

Spam detection.

Text complexity: is the text formal? informal? clarity?

Language and dialect detection.

Social media toxicity detection.

Multiple-choice question answering.

Language generation — what words come next?

Features of Classification

Where:

Input (): a word, a sentence, a paragraph, a document.

Input (): a set of all sequences of words.

Output: a label from a finite set of labels .

Consider the following:

is a random variable of inputs, such that each value of X is from

is a random variable of outputs take from

is the true distribution of labeled texts

The joint probability with all possible text document and all possible labels

is the distribution of labels

Irrespective of documents, how frequently we would see each label

Data Labeling

Problem:

we don’t typically know or except by data.

Solution:

Human experts label some data.

Feed the labeled data to a supervised machine learning algorithm that approximates the function

Test the classify function to some portion of data withheld for testing

Structured vs Un-structured Data

Problem: is generally considered unstructured data

Supervised learning likes structured inputs, which we call features (). Features are what the algorithm is allowed to see and know about.

Problem: In a perfect world, we simply throw away all the unnecessary parts of the input and keep the useful stuff. But we don’t always know what is useful.

But we often have some insight - feature engineering. For example getting rid of “the”, “are”….

Bag Of Words

Considering this sentence:

Features as word presence

Discard word order

Then we can have unigrams:

We can have bigrams:

Now x can be represented as a d-dimensional vector of features.

Features as word frequency

Conclusion

A feature vector may look like this: , where is the number of features. is typically used to represent the set of all real numbers in mathematics.

Now every document or input is a point in the d-dimensional space.

Bayesian Classification

We can translate the classification function below:

into a probability question:

Apply naive Bayes assumption: word features are independent from each other:

Where,

These probability can be labeled or observed from training datasets.

typically for NLP does not need to be computed since we are only looking for positive and negative classification

Turning probability into a classification prediction:

Apply sign (+ or -):

Log Probabilities and Smoothing

Two problems:

Multiplying probabilities makes numbers very small very fast and eventually floating point precision becomes a problem.

Sometimes probabilities are zero and multiplying zeros against other probabilities makes everything zero.

Log Scale Probability

The zero problem can still appear if . We can modify the equation such that we never encounter features with zero probability — we can pretend there is never a non-zero count of features in a text. The following process is called smoothing. This introduces a small amount of error.

Binary Logistic Regression

Recall that we have a collection of features , each of which map the set of words . Using Bayesian probabilities, we have . However, certain words obviously should hold more weight than others, instead of mere frequencies: deserves more attention than . To solve this problem, we introduce binary logistic regression.

Parameters / Weights

Consider a set of coefficients (also called weights or parameters) for each feature: with . If you set the coefficients just right, you can design a system that gets classification correct for a majority of inputs. But... you don't generally know the coefficients so you have to learn them from the training data.

Consider a Binary label , and a score function:

Where,

as an array or vector represents the document,

is an array or vector containing all the weights.

The Probability Perspective

How to learn the weights though? Let’s look at the score function in a probabilistic perspective:

Where,

converts the score into a probability distribution

the sigmoid function can be used for this purpose (aka the logistic function)

if score is high, ; if score is low or negative,

Consider the negative label scenario,

Where:

if score is negative, ; if score is positive,

Generally, we can have both the positive and negative scenarios in the same equation:

Parameter Learning

To learn the perfect set of parameters , we use the following equation:

Where,

: search for a set of that maximize the probability of the label given the inputs

we look at all examples of data

is the probability that the label for data point i matches the input data point i.

Converting the equation into log space:

We flip things around:

Where,

is called the negative log loss, or cross-entropy loss.

Considering that:

The sigmoid function in log scale can be written as:

Where, is the loss function for supervised learning,

Parameter Learning With Neural Networks

Construct a single layer neural network with a sigmoid function at the top:

Loss Function

Recall the loss function: , which multiplies the score () by target or before sending it to the sigmoid function: .

However, here we are talking about labels being or , we introduce the Binary cross entropy:

Where,

and are used to select positive and negative labels

when ,

Training

Batching and training:

Where,

is the learning rate, the higher the learning rate, the faster we go down gradient, but may overshoot the true value for .

Conclusion

To summarize,

Binary logistic regression learns a set of parameters that weights the importance of features .

Parameters can be used to approximate a conditional probability distribution over labels and features.

Multinomial Logistic Regression

Scoring function

Consider a set of labels, instead of just binary: , we construct a neural network as shown below:

Different from binary classification, we have the following:

set of features

input-output parameters ()

is the number of labels
is the number of features

For document with label , iterate through all labels and all features and set if feature and label are in the document.

Similar to binary classification, we have the following score function:

Where, try all different labels from , and return the label that maximize the score for each input-output features.

Softmax

Contrary to the sigmoid or logistic function, we need a function that can take a vector of numbers and make one of them close to 1.0 and the rest close to 0.0, the result sums up to 1.0

Probability approximation

What is the probability for all labels?

What is the probability of a single label?

Where is what makes multinomial logistic regression computational expensive — to compute probability for a single label, one must sum across all labels to normalize.

Parameter Learning

We are still going to minimize the log loss of a probability, just like we did for binary logistic regress:

Where,

the first part maximize the score (minimize negative score) for features that correlate with correct label ;

the second part minimize the score for all other labels

As such, we can construct a neural network as follows:

Negative Log Likelihood Loss

The predicted vector is:

The true label is:

Multiply together:

Summed and Negate:

Neural Net Training

Where the pseudocode is the same as binary classification, and is a function with more parameters and a softmax layer; is negative loss likelihood instead of binary cross entropy.

Note: most APIs will implement a CrossEntropyLoss as the top three layers: Softmax + Log + NLLLoss.

Multilayer Classification Networks

We built the neural networks based on pure math for both binary and multinomial classification problems. However, we can add layers to these neural networks to create more complex models.

About Me

I'm Qiwei Mao, a geotechnical engineer with a passion for IoT systems. I'm exploring low-power microcontrollers and LoRa communication systems to enable both hobbyist remote monitoring solutions and industrial-grade monitoring or control systems.

This note series on Natural Language Processing documents my journey in the OMSCS Program at Georgia Institute of Technology.

LinkedIn | Github | Reddit | X