GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

CVPR 2024

* Equal contribution
1 The Chinese University of Hong Kong, 2Stanford University 3Adobe Research 4S-Lab, Nanyang Technological University 5Shanghai Artificial Intelligence Laboratory

Abstract

Despite recent advances in text-to-3D generative methods, there is a notable absence of reliable evaluation metrics. Existing metrics usually focus on a single criterion each, such as how well the asset aligned with the input text. These metrics lack the flexibility to generalize to different evaluation criteria and might not align well with human preferences. Conducting user preference studies is an alternative that offers both adaptability and human-aligned results. User studies, however, can be very expensive to scale. This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models. To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts, which serve as input to compare text-to-3D models. We further design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria. Finally, we use these pairwise comparison results to assign these models Elo ratings. Experimental results suggest our metric strongly align with human preference across different evaluation criteria

Motivation

GPT-4V (or other MLLMs) can understand 3D content via multi-view images as input.

GPT-4V Caption: Intricately detailed steampunk apparatus, primarily of mechanical design nature, appearing three-dimensional. With worn metallic and glassy texture. Showcasing a central clock face and multiple gauges, and accentuated by pipes, gears, and levers. Crafted mainly from aged bronze and accented with glass and wood. Intended for time display and possible atmospheric measurements, and is static. Exhibiting a Victorian steampunk style, set in an industrial workshop environment with a nostalgic and inventive mood & atmosphere.

GPT-4V Caption: Detailed potted plant on a rugged terrain, primarily of organic and naturalistic structure, appearing full and lifelike. Showcasing a vibrant green plant with yellow flowers and accompanied by smaller pink blossoms, and accentuated by a scattering of pebbles and rocks. Crafted mainly from digital textures mimicking natural materials and accented with subtle shading. Intended for environmental visualization and is static. Exhibiting a contemporary and natural aesthetic, set in an outdoor-like setting with a serene and peaceful atmosphere.

Prompt Distribution

Controllable prompt generator. More complexity or more creative prompts often lead to a more challenging evaluation setting. Our prompt generator can produce prompts with various levels of creativity and complexity. This allows us to examine textto-3D models’ performance in different cases more efficiently.

Different Complexity Levels

A sleeping cat.

A large, multi-layered, symmetrical wedding cake, with smooth fondant, delicate piping, and lifelike sugar flowers in full bloom, displayed on a silver stand.

A solid, symmetrical, smooth stone fountain, with water cascading over its edges into a clear, circular pond surrounded by blooming lilies, in the center of a sunlit courtyard.

Different Creativity Levels

Orange monarch butterfly resting on a dandelion.

A dancing elephant.

Frog with a translucent skin displaying a mechanical heart beating.

Method Overview

We create a customizable instruction template that contains necessary information for GPT-4V to conduct comparison tasks for two 3D assets. We complete this template with different evaluation criteria, input 3D images, and random seeds to create the final 3D-aware prompts for GPT-4V. GPT-4V will then consume these inputs to output its assessments. Finally, we assemble GPT-4V’s answers to create a robust final estimate of the task.

Examples

BibTeX

@inproceedings{wu2023gpteval3d,
   author = {Tong Wu and Guandao Yang and Zhibing Li and Kai Zhang and 
      Ziwei Liu and Leonidas Guibas and Dahua Lin and Gordon Wetzstein},
   title = {GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation},
   booktitle = {CVPR},
   year = {2024},
}

We thank Nerfies for providing this amazing project template.