Evaluating Text Generation in Large Language Models | by Mina Ghashami | Jan, 2024

January 20, 2024
by Mina Ghashami
AI, Syndicated
272 Views

Metrics to measure the gap between neural text and human text

Recently, large language models have shown tremendous ability in generating human-like texts. There are many metrics to measure how close/similar a text generated by large language models is to the reference human text. In fact, bridging this gap is an active area of research.

In this post, we look into two well-known metrics for automatically evaluating the machine generated texts.

Consider you are given a reference text that is human-generated, and a machine-generated text that is generated by an LLM. To compute the semantic similarity between these two texts, BERTScore compute pairwise cosine similarity of token embeddings. See the image below:

Here the reference text is “the weather is cold today” and the candidate text which is machine generated is “it is freezing today”. If we compute the n-gram similarity these two texts will have a low score. However, we know they are semantically very similar. So BERTScore computes the contextual embedding of each token in both reference text and the candidate text and the based on these embedding vectors, it computes the pairwise cosine similarities.

Based on pairwise cosine similarities, we can compute precision, recall and F1 score. To do so as following:

Recall: we get the maximum cosine similarity for every token in the reference text and get their average
Precision: we get the maximum cosine similarity for every token in the candidate text and get their average
F1 score: the harmonic mean of precision and recall

BERTScore[1] also propose a modification to above score called as “importance weighting”. In “importance weighting” , considers the fact that rare word which are common between two sentences are more…

Source link

Evaluating Text Generation in Large Language Models | by Mina Ghashami | Jan, 2024

Metrics to measure the gap between neural text and human text

About Us

Our Services

Latest QSOL IT News

Evaluating Text Generation in Large Language Models | by Mina Ghashami | Jan, 2024

Metrics to measure the gap between neural text and human text

Related Post

Cybersecurity Threat Advisory: Microsoft SharePoint connector vulnerability

Boost cybersecurity audits: A guide for smaller MSPs

Entra Password Protection Smarter Security, Fewer Pop-Tarts

Gamma Partner with CSG for Pricing Agility in