Large Language Models Excel at Multi-Class Classification but Still Make Mistakes

Large language models (LLMs) have demonstrated remarkable performance on large-scale multi-class classification tasks. However, they still make classification errors and sometimes even generate out-of-vocabulary (OOV) class labels.

To address these issues, researchers have proposed a method called “Paraphrase and AGgregate (PAG)-LLM”:

  1. The LLM generates multiple paraphrases (parallel queries) of the input query
  2. It performs multi-class classification on the original query and each paraphrase
  3. It aggregates all the classification labels based on their confidence scores

This approach is particularly effective for difficult examples where the LLM is uncertain, reducing critical misclassification errors and hallucinated label generation.

Paraphrase and AGgregate PAG LLM

How PAG-LLM Works

Figure 1(A) illustrates the PAG-LLM process. The LLM first classifies the original query. Only if the classification confidence is below a threshold τ, the original query is fed to the LLM to generate paraphrases, which are then again fed to the LLM for classification. Finally, the LLM aggregates the predicted class labels from the paraphrases and original query.

Figure 2 shows examples from CLINC where the LLM misclassified the label (top) and generated an OOV class label (bottom). In the top example, the paraphrases generated by PAG-LLM enable making the correct classification decision with high confidence scores. Thus, even a simple majority voting aggregation leads to the correct class prediction.

PAG LLM result

In the bottom example, only paraphrase 2 generated by PAG-LLM classifies correctly, while the rest have OOV class labels. PAG-LLM aggregates the input text, paraphrases, their labels and confidences to ultimately predict the correct class label.

Evaluating PAG-LLM on Large Multi-Class Classification Datasets

PAG-LLM was evaluated on two large multi-class classification datasets: CLINC and Banking, showing 22.7% and 15.1% error reduction respectively.

PAG LLM:CLINC and Banking result 1

Table 1 shows the in-domain (ID) performance on CLINC and Banking(50%) datasets. The same supervised fine-tuned (SFT) LLaMa-7B was also used in self-consistency runs (rows 2 and 3). “Rand-seed” denotes an ensemble of 6 different SFT LLaMas trained with different random seeds. “Vote” means using a majority voting strategy to select the final label.

For rows 6 and 8, the confidence threshold was tuned on development data (τ=0.98 and 0.90 for CLINC and Banking respectively) and only queries with low classification confidence were fed to PAG-LLM. Best numbers are in bold. P1-P4 represent 4 in-context learning (ICL) baselines from prior work.

Reducing Out-of-Vocabulary Label Generation Errors

PAG-LLM is particularly effective at reducing out-of-vocabulary (OOV) class label generation errors.

PAG LLM OOD result

Table 2 shows the performance on the full test sets of CLINC and Banking(50%), including both in-domain (ID) and out-of-domain (OOD) inputs. Notations are the same as in the previous table. “All F1” denotes the macro F1 score over all ID + 1 OOD class (i.e., 151 classes for CLINC and 38 classes for Banking). “Avg” is the average of ID and OOD F1 scores.

In summary, the PAG-LLM approach of paraphrasing queries, classifying the paraphrases, and aggregating the labels based on confidence scores is a promising technique for reducing intent classification errors made by large language models, especially out-of-vocabulary label generation. The method has been shown effective on large multi-class datasets like CLINC and Banking.

https://arxiv.org/pdf/2406.17163
Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *