Metrics to measure the gap between neural text and human text
Recently, large language models have shown tremendous ability in generating human-like texts. There are many metrics to measure how close/similar a text generated by large language models is to the reference human text. In fact, bridging this gap is an active area of research.
In this post, we look into two well-known metrics for automatically evaluating the machine generated texts.
Consider you are given a reference text that is human-generated, and a machine-generated text that is generated by an LLM. To compute the semantic similarity between these two texts, BERTScore compute pairwise cosine similarity of token embeddings. See the image below:
Here the reference text is “the weather is cold today” and the candidate text which is machine generated is “it is freezing today”. If we compute the n-gram similarity these two texts will have a low score. However, we know they are semantically very similar. So BERTScore computes the contextual embedding of each token in both reference text and the candidate text and the based on these embedding vectors, it computes the pairwise cosine similarities.
Based on pairwise cosine similarities, we can compute precision, recall and F1 score. To do so as following:
- Recall: we get the maximum cosine similarity for every token in the reference text and get their average
- Precision: we get the maximum cosine similarity for every token in the candidate text and get their average
- F1 score: the harmonic mean of precision and recall
BERTScore[1] also propose a modification to above score called as “importance weighting”. In “importance weighting” , considers the fact that rare word which are common between two sentences are more…