Learning User Preferences for Image Generation Models

Figure 1: Our task aims to predict target images that align with users' tastes based on their history data.

Abstract

User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste.

To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user "likes" and "dislikes", while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users.

Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes.

Method

Figure 2: Overview of our MLLM-based preference learning framework.

(a) The visual encoder and text embedding module extract preference representations \( x_u^{+/-} \) by processing the preference history \( \mathcal{S} \) and a target item \( z_{\text{pos/neg}} \).

(b) The framework is trained using a base loss \( L_{\text{base}} \) to predict preference labels, and a contrastive preference loss \( L_{\text{CP}} \) that enhances separability between liked and disliked items. Additionally, learnable preference tokens \( P_v \) are introduced to model shared user interests.

Figure 3: Mining Similar Users via Attention Mechanism.

(a) Attention scores \(\mathcal{A}\) represent interactions between preference tokens and target image tokens for individual users. Each user has a unique reference history, and we concatenate the same target image to the input sequence across users. For each user, the horizontal axis represents tokens from the target image, while the vertical axis represents the preference tokens. Each user has five different random re-orderings of reference images.

(b) Examples of images liked (✓) or disliked ( × ) by each user.

User-Specific Preference Prediction

Table 1: Preference classification accuracy on pairwise comparisons between liked and disliked images.

Personalizing Generation with User Preferences

Figure 4: Qualitative comparison of text-to-image generation for three users. Each row shows user preferences (Ref-dislike/like) and generation results from our personalized preference model vs. image-text alignment (CLIP Score), aesthetic quality (Aesthetic Score), general human preference (ImageReward, PickScore), and personalized preference (ViPer) models. We generate images guided by preference model while incorporating both positive and negative user feedback.

Figure 5: Human expert evaluation of generated images from different methods on SD1.5-Turbo.

Learning User Preferences for Image Generation Models

Abstract

Method

Experiment Results

User-Specific Preference Prediction

Personalizing Generation with User Preferences