Adapt a pre-trained model to a new domain using HuggingFace
Large language models (LLMs) like BERT are usually pre-trained on general domain corpora like Wikipedia and BookCorpus. If we apply them to more specialized domains like medical, there is often a drop in performance compared to models adapted for those domains.
In this article, we will explore how to adapt a pre-trained LLM like Deberta base to medical domain using the HuggingFace Transformers library. Specifically, we will cover an effective technique called intermediate pre-training where we do further pre-training of the LLM on data from our target domain. This adapts the model to the new domain, and improves its performance.
This is a simple yet effective technique to tune LLMs to your domain and gain significant improvements in downstream task performance.
Let’s get started.
First step in any project is to prepare the data. Since our dataset is in medical domain, it contains the following fields and many more:
Putting the full list of fields here is impossible, as there are many fields. But even this glimpse into the existing fields help us to form the input sequence for an LLM.
First point to keep in mind is that, the input has to be a sequence because LLMs read input as text sequences.
To form this into a sequence, we can inject special tags to tell the LLM what piece of information is coming next. Consider the following example: <patient>name:John, surname: Doer, patientID:1234, age:34</patient>
, the <patient>
is a special tag that tells LLM that what follows are information about a patient.
So we form the input sequence as following:
As you see, we have injected four tags:
<patient> </patient>
: to contain…