Domain Adaptation of A Large Language Model | by Mina Ghashami | Nov, 2023

Adapt a pre-trained to a new domain using HuggingFace

Mina Ghashami
Towards Data Science
from unsplash

(LLMs) like BERT are usually pre-trained on general domain corpora like Wikipedia and BookCorpus. If we apply them to more specialized domains like medical, there is often a drop in compared to models adapted for those domains.

In this article, we will explore how to adapt a pre-trained LLM like Deberta base to medical domain using the HuggingFace library. Specifically, we will cover an effective technique called pre-training where we do further pre-training of the LLM on from our target domain. This adapts the model to the new domain, and improves its performance.

This is a simple yet effective technique to tune LLMs to your domain and gain significant in downstream task performance.

Let’s get started.

First step in any is to prepare the data. Since our is in medical domain, it contains the following fields and many more:

image by author

Putting the full list of fields here is impossible, as there are many fields. But even this glimpse into the existing fields help us to form the input sequence for an LLM.

First point to keep in mind is that, the input has to be a sequence because LLMs read input as text sequences.

To form this into a sequence, we can inject special tags to tell the LLM what piece of information is coming next. Consider the following example: <patient>name:John, surname: Doer, patientID:1234, age:34</patient> , the <patient> is a special tag that tells LLM that what follows are information about a patient.

So we form the input sequence as following:

Image by author

As you see, we have injected four tags:

  1. <patient> </patient>: to contain…

Source link