FACap: A Large-Scale Fashion Dataset for Fine-grained Composed Image Retrieval

1Louis Vuitton,

2Inria, École normale supérieure, CNRS, PSL Research University

3Courant Institute of Mathematical Sciences and Center for Data Science, New York University

Abstract

The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists.

To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts.

Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information.

FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model's performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites.

FACap - Dataset presentation

Our dataset FACap consists of 227,680 (reference image, modification text, target image) triplets which can be used to train a model on the CIR task in the fashion domain. The image pairs are built from existing fashion datasets, and the modification texts are generated with a custom pipeline using a large vision-language model and a large language model.

The resulting dataset is larger than the Fashion IQ dataset, with more detailed modification texts and comparable faithfulness, as our comparative quality evaluation shows.

FashionBLIP-2 - Method

Global pipeline

The input images are encoded using a pre-trained image encoder with adapter modules, and further processed by a Q-Former module. The similarity between the two obtained representations is computed using our specific matching module (see details on the right).

FashionBLIP-2 global pipeline.

Matching module

The number of tokens and token dimensionality is reduced by token mixing and channel mixing (respectively). The final similarity score is the sum of the cosine similarity for each paired vector.

FashionBLIP-2 matching module.

Qualitative results

Note: for each query, the ground-truth target image is framed in green.

On the Fashion IQ dataset

On the enhFashionIQ dataset

BibTeX

@article{TBD,
  author    = {Garderes, Francois and Chen, Shizhe and Gauthier, Camille-Sovanneary and Ponce, Jean},
  title     = {FACap: A Large-Scale Fashion Dataset for Fine-grained Composed Image Retrieval},
  journal   = {TBD},
  year      = {2025},
}