Member-only story
Chameleon: The Next Evolution in Mixed-Modal Modeling
The proposed “Chameleon: Mixed-Modal Early-Fusion Foundation Models” by Meta is indeed an intriguing advancement in the realm of multimodal machine learning.
Read the paper here: https://arxiv.org/html/2405.09818v1
if you are not a Medium premium member, you can read it for free here.
Unlike previous models such as Idefics, GPT-4v, and Flamingo, which relied on encoders and connectors for multimodality, Chameleon adopts a unified approach with fully token-based representations for both images and text. Here’s a breakdown of its implementation and insights:
Implementation:
- Tokenizers: Chameleon employs two tokenizers — a dedicated image tokenizer that encodes a 512 × 512 image into 1024 tokens from an 8192-codebook, and a Byte-Pair Encoding (BPE) tokenizer with a vocabulary of 65,536, which includes the image codebook token.
- Decoder Architecture: Based on Llama 2, Chameleon’s decoder architecture incorporates query-key normalization and reordering of layer norms to enhance stability during training in mixed-modal settings.
- Pretraining: Conducted in two stages, the first stage involves 80% unsupervised training on text-only, text-image pairs, and interleaved text/image data (totaling 2.9T tokens). The…