The composed image retrieval (CIR) task is to retrieve target images given a reference image
and a modification text. Recent methods for CIR leverage large pretrained vision-language models
(VLMs) and achieve good performance on general-domain concepts like color and texture. However,
they still struggle with application domains like fashion, because the rich and diverse vocabulary
used in fashion requires specific fine-grained vision and language understanding.
An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant
annotations, due to the expensive cost of manual annotation by specialists.
To address these challenges, we introduce FACap, a large-scale,
automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images
and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate
accurate and detailed modification texts.
Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the
general-domain BLIP-2 model on FACap with lightweight adapters and
multi-head query-candidate matching to better account for fine-grained fashion-specific information.
FashionBLIP-2 is evaluated with and without additional fine-tuning on
the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ,
leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the
combination of FashionBLIP-2 and pretraining with
FACap significantly improves the model's performance in fashion CIR
especially for retrieval with fine-grained modification texts, demonstrating the value of our
dataset and approach in a highly demanding environment such as e-commerce websites.