FACap: A Large-Scale Fashion Dataset for Fine-grained Composed Image Retrieval

The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists.

To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts.

Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information.

FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model's performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites.

Our dataset FACap consists of 227,680 (reference image, modification text, target image) triplets which can be used to train a model on the CIR task in the fashion domain. The image pairs are built from existing fashion datasets, and the modification texts are generated with a custom pipeline using a large vision-language model and a large language model.

The resulting dataset is larger than the Fashion IQ dataset, with more detailed modification texts and comparable faithfulness, as our comparative quality evaluation shows.

Global pipeline

The input images are encoded using a pre-trained image encoder with adapter modules, and further processed by a Q-Former module. The similarity between the two obtained representations is computed using our specific matching module (see details on the right).

The number of tokens and token dimensionality is reduced by token mixing and channel mixing (respectively). The final similarity score is the sum of the cosine similarity for each paired vector.

Note: for each query, the ground-truth target image is framed in green.

BibTeX

@article{TBD,
  author    = {Garderes, Francois and Chen, Shizhe and Gauthier, Camille-Sovanneary and Ponce, Jean},
  title     = {FACap: A Large-Scale Fashion Dataset for Fine-grained Composed Image Retrieval},
  journal   = {TBD},
  year      = {2025},
}

FACap: A Large-Scale Fashion Dataset for Fine-grained Composed Image Retrieval

Abstract

FACap - Dataset presentation

FashionBLIP-2 - Method

Global pipeline

Matching module

Qualitative results

On the Fashion IQ dataset

On the enhFashionIQ dataset

BibTeX