Researchers from China Introduce ImageBind-LLM: A Multi-Modality Instruction Tuning Method of Large Language Models (LLMs) via ImageBind

Researchers have recently seen significant improvements in large language models’ (LLMs) instruction tuning. ChatGPT and GPT-4 are general-purpose talking systems that obey human commands in language and visuals. However, they are still unreplicable because of the closed-source constraint. Alpaca, LLaMAAdapter, and related efforts offer to modify the publicly accessible LLaMA into language instruction models using self-generated data in response to this. LLaVA, LLaMA-Adapter, and others integrate visual understanding capabilities into LLMs for image-conditioned generation to accomplish picture instruction tailoring. 

Despite the success of current instruction tuning techniques, more is needed to create an LLM for broad multimodality instructions, such as text, picture, audio, 3D point clouds, and video. The authors of this study from Shanghai Artificial Intelligence Laboratory, CUHK MMLab and vivo AI Lab introduce the ImageBind-LLM multimodality instruction-following model, which effectively fine-tunes LLaMA under the direction of the joint embedding space in the pre-trained ImageBind. As shown in Figure 1, their ImageBind-LLM (b) can respond to input instructions of numerous modalities in addition to pictures, distinct from earlier visual instruction models (a), demonstrating promising extensibility and generalization capacity.

They specifically propose solely using the vision-language data for tweaking multimodality instruction due to ImageBind’s image-aligned multimodality embedding space. For a picture-caption pair, they first extract the global image feature using ImageBind’s frozen image encoder before embedding transformation using a learnable bind network. The converted picture feature is subsequently applied to all transformer layer word tokens in LLaMA, creating the visual context for generating the appropriate textual caption. In contrast to the zero-initialized attention in the LLaMA-Adapter series, their visual injection mechanism is simple and weighted by a trainable zero-initialized gating factor. 

In this effective way, as the training progresses, the instruction cues of ImageBind’s multimodality embeddings may be gradually introduced into LLaMA without interfering with the original language understanding. Using ImageBind for modality-specific encodings, such as text, picture, audio, and video, their ImageBind-LLM acquires the competence to obey instructions of diverse modalities after the basic vision-language training. They use the pre-trained 3D encoder in Point-Bind to encode the input 3D point clouds for instructions in 3D domains. They also provide a training-free visual cache approach for embedding augmentation during inference to address the modality gap between image training and text, audio, 3D, or video-conditioned production. 

Ob1JoelWPAq597f6aq4u02kV eWixiRJAZcWIDG0WeJQ31ZXZOVpIWaFIATxKIZ7qboqWT3g3byD7nd1hYuFPFIRIRTxJUYDEuNZrgUPtUTKLgpzS69KGZxPHIRsP717eH 7zJFYr0Zf z0N4hV6 qI
Figure 1 compares our multi-modality vs. visual instruction models ImageBind-LLM. ImageBind-LLM performs a universal multi-modality instruction tuning for image, text, audio, video, and 3D, in contrast to earlier efforts [1-3] that are exclusively conditioned on image modality.

The cache model comprises millions of picture features in the training datasets retrieved by ImageBind, which enhances text/audio/3D/video embeddings by obtaining comparable visual characteristics (Tip-Adapter). As a result, verbal replies to multimodal instructions are of greater quality. They test ImageBind-LLM’s multimodality instruction-following capabilities in various circumstances and consistently find it to perform better. 

Overall, their ImageBind-LLM demonstrates the four qualities listed below.

• Instructions with many modes. ImageBind-LLM is optimized to respond to general multimodality inputs, such as image, text, audio, 3D point clouds, and video, and their embedding-space arithmetic represented by ImageBind and Point-Bind. This is different from earlier language and image instruction models. 

• Efficiency Tuning. During training, they freeze ImageBind’s image encoder and adjust partial weights in LLaMA using parameter-efficient approaches like LoRA and bias-norm tuning. They also train the zero-initialized gating factors and the extra bind network. 

• Zero-initialized Injection without Attention. They employ a learnable gating method for progressive knowledge injection, which is more straightforward and efficient, and incorporate the multimodality requirements with all word tokens of LLaMA directly instead of introducing additional instruction signals through attention layers. 

• Retrieval from a cross-modal cache. They offer a visual cache model from image features extracted by ImageBind, which performs cross-modality retrieval for embedding augmentation to address the modality disparity between training (single picture) and inference (many modalities).

Semak kertas dan GithubSemua Kredit Untuk Penyelidikan Ini Ditujukan Kepada Penyelidik Projek Ini. Juga, jangan lupa untuk menyertai 30k+ ML SubReddit kami, 40k+ Komuniti Facebook, Saluran Discord, dan E-mel Surat Berita, tempat kami berkongsi berita penyelidikan AI terkini, projek AI yang hebat dan banyak lagi.

Jika anda menyukai kerja kami, anda akan menyukai surat berita kami..

Aneesh PP Aneesh Tickoo

Aneesh Tickoo ialah pelatih perunding di MarktechPost. Beliau kini sedang melanjutkan pengajian ijazah sarjana muda dalam Sains Data dan Kepintaran Buatan dari Institut Teknologi India (IIT), Bhilai. Dia menghabiskan sebahagian besar masanya mengerjakan projek yang bertujuan untuk memanfaatkan kuasa pembelajaran mesin. Minat penyelidikannya ialah pemprosesan imej dan bersemangat membina penyelesaian di sekelilingnya. Dia suka berhubung dengan orang ramai dan bekerjasama dalam projek yang menarik.

Pautan sumber

Tinggalkan pesanan

Alamat e-mel anda tidak akan diterbitkan. Medan yang diperlukan ditanda *

Anda boleh menggunakan tag dan atribut HTML ini: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>