Large Language Models and Vector Databases for News Recommendations | by João Felipe Guedes | Dec, 2023

Now that collections are finally populated with vectors, we can start querying the database. There are many ways we can input information to query the database, but I think there are 2 very inputs we can use:

  • An input text
  • An input vector ID

3.1 Querying vectors with an input vector

Let’s say we built this vector database to be used in a search engine. In this case, we expect the user’s input to be an input text and we have to return the most relevant items.

Since all in a vector database are done with….VECTORS, we first need to transform the user’s input text into a vector so we can find similar items based on that input. Recall that we used Sentence Transformers to encode textual data into , so we can use the very same encoder to generate a numerical representation for the user’s input text.

Since the NPR contains articles, let’s say the user typed “Donald Trump” to learn about US elections:

query_text = "Donald Trump"
query_vector = encoder.encode(query_text).tolist()
(query_vector)
# output: [-0.048, -0.120, 0.695, ...]

Once the input query vector is computed, we can search for the closest vectors in the collection and define what sort of output we want from those vectors, like their newsId, title, and topics:

from qdrant_client.models import Filter
from qdrant_client.http import models

client.search(
collection_name="news-articles",
query_vector=query_vector,
with_payload=["newsId", "title", "topics"],
query_filter=None
)

Note: by default, Qdrant uses Approximate Nearest Neighbors to scan for embeddings quickly, but you can also do a full scan and bring the exact nearest neighbors — just bear in mind this is a much more expensive operation.

After running this operation, here are the generated output titles (translated into english for better comprehension):

  • Input Sentence: Donald Trump
  • Output 1: Paraguayans go to the polls this Sunday (30) to choose a new president
  • Output 2: Voters say Biden and Trump should not run in , Reuters/Ipsos poll
  • Output 3: Writer accuses Trump of sexually abusing her in the 1990s
  • Output 4: Mike Pence, former vice president of Donald Trump, gives testimony in court that could complicate the former president

It seems that besides bringing news related to Trump himself, the embedding also managed to represent topics related to presidential elections. Notice that in the first output, there is no direct reference to the input term “Donald Trump” other than the presidential election.

Also, I left out a query_filter parameters. This is a very useful tool if you want to specify that the output must satisfy some given condition. For instance, in a news portal, it is frequently important to filter only the most recent articles (say from the past 7 days onwards). Therefore, you could query for news articles that satisfy a minimum publication timestamp.

Note: in the news recommendation , there are multiple concerning aspects to consider like fairness and diversity. This is an open topic of discussion but, should you be interested in this area, take a look at the articles from the NORMalize Workshop.

3.2 Querying vectors with an input vector ID

Lastly, we can ask the vector database to “recommend” items that are closer to some desired vector IDs but far from undesired vector IDs. The desired and undesired IDs are called positive and negative examples, respectively, and they are thought of as seeds for the recommendation.

For instance, let’s say we have the following positive ID:

seed_id = '8bc22460-532c-449b-ad71-28dd86790ca2'
# title (translated): 'Learn why Joe Biden launched his bid for re-election this Tuesday'

We can then ask for items similar to this example:

client.recommend(
collection_name="news-articles",
positive=[seed_id],
negative=None,
with_payload=["newsId", "title", "topics"]
)

After running this operation, here are the translated output titles :

  • Input item: Learn why Joe Biden launched his bid for re-election this Tuesday
  • Output 1: Biden announces he will run for re-election
  • Output 2: USA: the 4 reasons that led Biden to run for re-election
  • Output 3: Voters say Biden and Trump should not run in 2024, Reuters/Ipsos poll shows
  • Output 4: Biden’s advisor’s gaffe that raised doubts about a possible second after the election

Source link