In the research papers on GIT models, it was explained that a strong vision encoder is utilized and random parameters are adopted for the language model. This time, since the goal is to ultimately use a 7B-class language model, a pre-trained model will be applied to the language model. The following modules will be examined for fine-tuning. The GIT Projection, being an initialized module, is always included. Some combinations may seem redundant, but they are explored without too much concern for this trial.
Modules set for training are given gradients, while the rest are modified to not have gradients.
# Specifying the parameters to train (training all would increase memory usage)
for name, p in model.model.named_parameters():
if np.any([k in name for k in keys_finetune]):
p.requires_grad = True
p.requires_grad = False
The Vision Encoder and LLM used for this examination are:
Training utilizes COCO dataset and lasts for 5 epochs.
Here are the target modules trained during each experiment:
- Proj: GIT Projection. Initialized randomly, so it’s always trained.
- LoRA: Query, Key, and Value of the self attention in the language model were applid.
- OPT: All layers were trained.
- ViT: All layers were trained.
- Head: The final lm_head of OPT was trained.
(Note: While LoRA can be applied to ViT, but to avoid making the experiments too complicated, it wasn’t included this time.)
As shown in the training loss plot, it’s apparent that some groups are not performing well. These were the case when OPT is included in the training. Although all experiments were conducted under fairly similar conditions, more detailed adjustments, such as learning rate, might be necessary when fine-tuning the language model. Results, excluding the models where OPT is included in training, will be examined next.
Both training and validation Loss decreased most with the Projection+LoRA model. Fine-tuning final Head layer showed nearly identical outcomes. If ViT is also trained, the Loss appears slightly higher and results seem unstable. Even when adding LoRA during ViT training, the loss still tends to be high. For fine-tuning with this data, it seems using a pre-trained ViT model without updating its parameters yields more stable results. The effectiveness of LoRA has been acknowledged in various places, and it is evident from this experiment that adding LoRA to the LLM improved bothe traininng and validation loss.
Reviewing the inference results on some test data:
When training OPT itself, the results are as poor as the result of loss, making the model at a loss for words. Additionally, when training ViT, the output makes semantic sense, but describes something entirely different from the given image. However, the other results seem to capture the features of the images to some extent. For instance, the first image mentions “cat” and “banana”, and the second one identifies “traffic sign”. Comparing results with and without LoRA, the latter tends to repetitively use similar words, but using LoRA seems to make it slightly more natural. Training the Head results in intriguing outputs, like using “playing” instead of “eating” for the first image. While there are some unnatural elements in these results, it can be deduced that the training was successful in capturing image features.
For fine-tuning conditions in earlier experiments, a slightly smaller language model, OPT-350m, was used. Now, the intention is to switch the language model to a 7B model. Not just settling for OPT, stronger LLMs, LLaMA and MPT, will also be introduced.
Integrating these two models can be done in a similar fashion to OPT. Referring to the forward functions of the LlamaModel and MPTModel, combine the projected image vectors with text tokens, and change the mask from Causal Attention Mask to GIT’s Attention Mask. One thing to note: for MPT, the mask isn’t (0, -inf), but (False, True). The subsequent processes can be implemented similarly.
To use the 7B-class model with OPT, merely change the model name from facebook/opt-350m to facebook/opt-6.7b.
For LLaMA, with the availability of LLaMA2, that will be the model of choice. To use this pre-trained model, approvals from both Meta and Hugging Face are needed. An account is necessary for Hugging Face, so make sure to set that up. Approvals typically come within a few hours. Afterwards, log into Hugging Face on the terminal where training is executed.
You can log in using the token created in Hugging Face account → Settings → Access Token.
Training parameters remain consistent, using the COCO dataset and lasting for 3 epochs. Based on results from Experiment 1, the modules set for fine-tuning were Projection + LoRA.
Let’s take a look at the results.
Reviewing the loss, it’s apparent that the models using LLaMA2 and MPT as LLM show a more satisfactory reduction. Let’s also observe the inference results.
Regarding the first image, for all models, the expressions seem more natural compared to OPT-350m. There are no bizarre expressions like “a banana with a banana”, highlighting the strength of LLM. For the second image, there’s still some difficulty with phrases like “a traffic light” or “a building”. For such complex images, there might be a need to consider upgrading the ViT model.
Finally, let’s run inference on images that became popular with GPT-4.
Although fluent responses were anticipated since LLM is in use, the outcomes are quite simple. This might be because the model was trained solely on COCO.
Given the underwhelming results of the previous experiment, it was decided to incorporate data other than COCO for training. The M3IT dataset currently in use is quite comprehensive, and it can handle a significant amount of data in the same format as COCO.
It is intended to use data from this source excluding the “Chinese” and “Video” categories. Originally, the COCO training dataset contained 566,747 pieces of data. By combining it with additional sources, this increased to 1,361,650. Although the size has roughly doubled, the dataset is believed to have become of higher quality due to the increased diversity of tasks.
Handling multiple Pytorch datasets can be easily achieved using the ConcatDataset.
dataset_list = [
datasets.load_dataset("MMInstruction/M3IT", i) for i in m3it_name_list
train_dataset = torch.utils.data.ConcatDataset([d["train"] for d in dataset_list])
The training was conducted for 1 epoch, and the LLaMA2 model was used for fine-tuning the Projection and LoRA, similarly to Experiment 2.
As there’s no loss to compare to this time, let’s dive straight into the inference results.
Along with solving simple problems, the model now handles more complex challenges. By adding datasets for tasks more intricate than just captioning, the capabilities have expanded significantly. Achieving this level of accuracy with only 1 epoch of training was surprising.
Let’s test it with the following example image. Given the increased variety in the dataset, the way the questions were presented was slightly modified.
While the description being “Umbrella” was still wired, it feels like it’s getting better. To improve further, there’s a need to increase the number of training epochs, add more types or volumes of datasets, and leverage more powerful ViT or LLM. Nonetheless, it’s impressive that such a model could be developed in just half a day given the computational and data resources.