Member-only story

Chameleon: The Next Evolution in Mixed-Modal Modeling

2 min readMay 26, 2024

The proposed “Chameleon: Mixed-Modal Early-Fusion Foundation Models” by Meta is indeed an intriguing advancement in the realm of multimodal machine learning.

Read the paper here: https://arxiv.org/html/2405.09818v1
if you are not a Medium premium member, you can read it for free here.

Chameleon: The Next Evolution in Mixed-Modal Modeling

Unlike previous models such as Idefics, GPT-4v, and Flamingo, which relied on encoders and connectors for multimodality, Chameleon adopts a unified approach with fully token-based representations for both images and text. Here’s a breakdown of its implementation and insights:

Implementation:

Tokenizers: Chameleon employs two tokenizers — a dedicated image tokenizer that encodes a 512 × 512 image into 1024 tokens from an 8192-codebook, and a Byte-Pair Encoding (BPE) tokenizer with a vocabulary of 65,536, which includes the image codebook token.
Decoder Architecture: Based on Llama 2, Chameleon’s decoder architecture incorporates query-key normalization and reordering of layer norms to enhance stability during training in mixed-modal settings.
Pretraining: Conducted in two stages, the first stage involves 80% unsupervised training on text-only, text-image pairs, and interleaved text/image data (totaling 2.9T tokens). The…

Chameleon: The Next Evolution in Mixed-Modal Modeling

Written by Emad Dehnavi

No responses yet