3D Magic Mirror: Clothing Reconstruction from a Single Image via a Causal Perspective

Abstract

<title>Abstract</title> This paper aims to study a self-supervised 3D clothing reconstruction method, which recovers the geometry shape and texture of human clothing from a single image. Compared with existing methods, we observe that three primary challenges remain:(1) 3D ground-truth meshes of clothing are usually inaccessible due to annotation difficulties and time costs; (2) Conventional template-based methods are limited to modeling non-rigid objects, e.g., handbags and dresses, which are common in fashion images;(3) The inherent ambiguity compromises the model training, such as the dilemma between a large shape with a remote camera or a small shape with a close camera. In an attempt to address the above limitations, we propose a causality-aware self-supervised learning method to adaptively reconstruct 3D non-rigid objects from 2D images without 3D annotations. In particular, to solve the inherent ambiguity among four implicit variables, i.e., camera position, shape, texture, and illumination, we introduce an explainable structural causal map (SCM) to build our model. The proposed model structure follows the spirit of the causal map, which explicitly considers the prior template in the camera estimation and shape prediction. When optimization, the causality intervention tool, i.e., two expectation-maximization loops, is deeply embedded in our algorithm to (1) disentangle four encoders and (2) facilitate the prior template. Extensive experiments on two 2D fashion benchmarks (ATR and Market-HQ) show that the proposed method could yield high-fidelity 3D reconstruction. Furthermore, we also verify the scalability of the proposed method on a fine-grained bird dataset, i.e., CUB.

Locations

  • arXiv (Cornell University)

Ask a Question About This Paper

Summary

Reconstructing the 3D shape and texture of clothing from a single 2D image presents significant challenges, primarily due to the scarcity of 3D ground-truth data, the highly deformable nature of garments (unlike rigid human bodies), and inherent ambiguities where multiple 3D interpretations could project to the same 2D image (e.g., a large object far away or a small object close up).

A novel self-supervised learning approach addresses these limitations by adopting a causal perspective to disentangle the underlying factors responsible for a 2D image: camera viewpoint, clothing shape, texture, and illumination. The core innovation lies in designing the model structure and its training strategy based on an explainable structural causal map (SCM). This map explicitly defines the dependencies between these implicit 3D attributes and guides the architecture to employ four independent encoders—one for each attribute.

Crucially, this framework re-thinks the role of a prior 3D template. Unlike previous methods that might use a fixed template or only as a base for offsets, here, the template is explicitly fed as an input to both the camera pose and shape encoders. This aligns with human intuition, where knowing a person’s general size helps estimate their distance and pose. Furthermore, the prior template itself is adaptively updated during training, starting from a simple ellipsoid, making it highly flexible.

To overcome the inherent ambiguities and enable effective disentanglement, the method introduces two causal intervention mechanisms, inspired by expectation-maximization (EM) loops, during optimization:
1. Encoder Loop: During training, only one attribute encoder (e.g., shape) is optimized at a time, while the outputs of the other encoders (e.g., camera, texture, illumination) are held fixed. This intervention effectively ā€œcutsā€ the causal links from the input image to the fixed attributes, ensuring that the loss signal specifically targets the active encoder and prevents entangled correlations between the attributes from hindering learning. This is particularly effective in resolving the shape-camera ambiguity.
2. Prototype Loop: This loop addresses the challenge of modeling non-rigid objects by disentangling the influence of the overall deformable prior template from the per-vertex shape offsets. It iteratively refines the global prior template based on the learned shape offsets, allowing the model to adaptively learn a more accurate and expressive base shape for clothing, thereby accommodating highly variable garments like dresses and handbags.

The foundation for this system relies on several established prior ingredients. Differentiable rendering is indispensable, enabling an end-to-end learning pipeline by translating 3D mesh properties (shape, texture, illumination, camera parameters) back into 2D images, thus allowing 2D image reconstruction losses to supervise the 3D reconstruction. Self-supervised learning principles, leveraging readily available 2D image datasets and consistency losses (like image reconstruction and adversarial losses), bypass the need for expensive 3D annotations. The attribute encoders are built upon standard Convolutional Neural Networks (CNNs) and incorporate an Integration Module that effectively fuses spatial information from the adaptive prior template with visual features extracted from the input image. Finally, standard mesh regularization techniques (e.g., Laplacian, symmetry, deformation losses) are applied to ensure geometrically plausible and stable 3D reconstructions.

The result is a self-supervised system capable of high-fidelity 3D clothing reconstruction from single images, demonstrating superior performance on fashion benchmarks and exhibiting scalability to other non-rigid objects like birds. Its ability to disentangle and manipulate individual attributes like texture, shape, and camera viewpoint also enables powerful applications such as novel-view synthesis and clothing transfer.

None

2025-05-13

None

2022-06-15

None

2025-05-13

None

2025-04-30

None

2025-05-20

None

2025-04-30

None

2024-06-01

None

2025-05-21

None

2025-02-28

None

2022-12-03

None

2025-05-20

None

2025-05-01

None

2025-05-13

None

2014-07-02

None

2025-05-16

None

2025-05-15

None

2024-08-12

None

2023-06-27

None

2011-08-01