FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval

1Louis Vuitton,

2Inria, École normale supérieure, CNRS, PSL Research University

3Courant Institute of Mathematical Sciences and Center for Data Science, New York University

Abstract

Composed image retrieval (CIR) aims to retrieve a target image that depicts a reference image modified by a textual description. While recent vision-language models (VLMs) achieve promising CIR performance by embedding images and text into a shared space for retrieval, they often fail to reason about what to preserve and what to change. This limitation hinders interpretability and yields suboptimal results, particularly in fine-grained domains like fashion.

In this paper, we introduce FIRE-CIR, a model that brings compositional reasoning and interpretability to fashion CIR. Instead of relying solely on embedding similarity, FIRE-CIR performs question-driven visual reasoning: it automatically generates attribute-focused visual questions derived from the modification text, and verifies the corresponding visual evidence in both reference and candidate images.

To train such a reasoning system, we automatically construct a large-scale fashion-specific visual question answering dataset, containing questions requiring either single- or dual-image analysis. During retrieval, our model leverages this explicit reasoning to re-rank candidate results, filtering out images inconsistent with the intended modifications.

Experimental results on the Fashion IQ benchmark show that FIRE-CIR outperforms state-of-the-art methods in retrieval accuracy. It also provides interpretable, attribute-level insights into retrieval decisions.

Reasoning method

Architecture of the FIRE-CIR methodology.

Overview of the FIRE-CIR model. Left: VQA score computation. The modification text is decomposed into a set of visual questions about the candidate image, which are answered using a VQA model. The predicted answers are then used to compute the VQA score, measuring the relevance of the candidate image with respect to the text. The set of questions and their answers give an interpretable insight in the relevance evaluation of each image. Right: CIR inference. Given the CIR query and a set of candidate images, FIRE-CIR computes the VQA score of each candidate image, and combines it with the score returned by another CIR method to refine the ranking of the retrieved images.

Qualitative results

Note: for each query, the ground-truth target image is framed in green.

Results overview

Qualitative example on Fashion IQ dress. Qualitative example on Fashion IQ dress.

Detailed results

Inference speed

With this method, it is possible to select a trade-off between performance and inference speed.

Trade-off between performance and inference speed.

BibTeX

@inproceedings{garderes2026fire,
  title={FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval},
  author={Gard{\`e}res, Fran{\c{c}}ois and Gauthier, Camille-Sovanneary and Ponce, Jean and Chen, Shizhe},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={5694--5703},
  year={2026}
}