Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (2025)

Jiajing Lin  Zhenzhong Wang  Shu Jiang  Yongjie Hou  Min Jiang
School of Informatics, Xiamen University
Corresponding author

Abstract

The task of 4D content generation involves creating dynamic 3D models that evolve over time in response to specific input conditions, such as images. Existing methods rely heavily on pre-trained video diffusion models to guide 4D content dynamics, but these approaches often fail to capture essential physical principles, as video diffusion models lack a robust understanding of real-world physics. Moreover, these models face challenges in providing fine-grained control over dynamics and exhibit high computational costs.In this work, we propose Phys4DGen, a novel, high-efficiency framework that generates physics-compliant 4D content from a single image with enhanced control capabilities. Our approach uniquely integrates physical simulations into the 4D generation pipeline, ensuring adherence to fundamental physical laws. Inspired by the human ability to infer physical properties visually, we introduce a Physical Perception Module (PPM) that discerns the material properties and structural components of the 3D object from the input image, facilitating accurate downstream simulations.Phys4DGen significantly accelerates the 4D generation process by eliminating iterative optimization steps in the dynamics modeling phase. It allows users to intuitively control the movement speed and direction of generated 4D content by adjusting external forces, achieving finely tunable, physically plausible animations. Extensive evaluations show that Phys4DGen outperforms existing methods in both inference speed and physical realism, producing high-quality, controllable 4D content.Our project page is available at the link: https://jiajinglin.github.io/Phys4DGen/.

1 Introduction

The generation of 4D content is becoming increasingly valuable in a variety of fields, such as animation, gaming, and virtual reality[3]. Recent advancements in diffusion models[10] have revolutionized image generation[24, 37] and video generation[30, 4]. These models’ robust visual priors have significantly propelled progress in 4D content generation[39, 2]. Such progress has made the automated production of high-quality 4D content not only achievable but also increasingly efficient.

Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (1)

Generating 4D content from a single image presents a formidable challenge due to the inherent absence of both temporal and spatial information in a single image.Unlike image-to-3D generation, which focuses on spatially consistent shapes and appearances, image-to-4D generation prioritizes the creation of realistic and temporally consistent 4D dynamics.Existing methods[39, 23, 34, 35, 16] predominantly rely on pre-trained video diffusion models to capture dynamic information.Animate124[39] is a pioneering framework that adopts temporal score distillation sampling (SDS) from a video diffusion model to optimize 4D dynamics. However,Animate124 struggles to converge and has difficulty generating 4D content that aligns with the input images. To improve the efficiency and quality of generation, DreamGaussian4D[23] employs deformable 3D Gaussians[32] as 4D representations, It generates reference videos through a video diffusion model to determine the 4D dynamics. Subsequently, supervised loss and SDS from the 3D aware diffusion model is used to refine the 4D representation to match the reference video. To further enhance spatial and temporal consistency, some methods[27, 35, 16] use diffusion models to pre-generate image sequences from multiple viewpoints of the reference video, offering richer reference information for optimization.

Despite these advancements, existing methods still face several critical challenges.Firstly, the reliance on video diffusion models, which often fail to effectively learn physical principles, results in 4D content that frequently violates physical laws, as illustrated in the left part of Fig. 1. Secondly, the large scale of iterative optimization results in excessively time-consuming generation processes, as shown in the right part of Fig. 1. Additionally, the inherently stochastic nature of video diffusion models often leads to uncontrollable dynamics within the generated 4D content.

To address these challenges, we introduce Phys4DGen, a physics-driven method capable of generating controllable 4D content from a single image.Our key insight is the integration of physical simulation into the 4D generation process.To achieve this, we first generate static Gaussians. Since physical simulation requires predefined physical properties, such as material types and attributes, we propose a Physical Perception Module (PPM) to accurately predict these properties for different components of the 3D objects described in the images. Finally, using the predicted physical properties and applying external forces, we perform the physical simulation to generate the 4D content.

Unlike most current methods[39, 23, 16, 27] that rely on video diffusion models to determine dynamics in 4D content, Phys4DGen applies physical simulation to ensure that the generated 4D content adheres to physical laws.In previous methods[31], 3D objects were typically treated as a whole, with the same material types and properties manually assigned for simulation. In this paper, we use PPM to automatically segment materials and assign the corresponding material types and properties, enabling more accurate and realistic simulations.Moreover, by eliminating the large iterative optimization steps during the 4D dynamics generation phase, Phys4DGen achieves rapid 4D content generation from a single image. To provide fine-grained control over the dynamics in the generated 4D content, we introduce external forces. By fine-tuning these external forces, Phys4DGen can generate 4D content that aligns with user intent. The main contributions of our work are summarized as follows:

  • We present a physics-driven image-to-4D generation framework that integrates physical simulation seamlessly into the 4D generation process, enabling the rapid and fine-grained controllable production of 4D content that adheres to physical principles.

  • We first propose a Physical Perception Module (PPM) to infer the material types and properties of various components in 3D objects from input images, enabling accurate downstream simulations.

Qualitative and quantitative comparisons demonstrate that our method generates 4D content that is physically accurate, spatially and temporally consistent, high-fidelity, and controllable, with significantly reduced generation times.

Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (2)

2 Related Work

2.1 4D Generation

4D generation [40, 19, 9] aims to generate dynamic 3D content that aligns with input conditions such as text, images, and videos. Currently, most 4D generation work [2, 12, 39, 34, 7] heavily depends on diffusion models[26].Based on the input conditions, 4D generation can be categorized into three types: text-to-4D, video-to-4D, and image-to-4D. For instance, MAV3D [25] is the first text-to-4D work that trains HexPlane[6] using temporal SDS from a text-to-video diffusion model. 4D-fy [2] tackles the text-to-4D task by integrating various diffusion priors. However, these methods all suffer from time-consuming optimization. Instead of text input, Consistent4D [12] is an approach for generating 360-degree dynamic objects from monocular video by introducing cascade DyNeRF and interpolation-driven consistency loss. 4DGen[34] supervised using pseudo labels generated by a multi-view diffusion model. In practice, acquiring high-quality reference videos can be challenging. Animate124[39] pioneered an image-to-4D framework using a coarse-to-fine strategy that combines different diffusion priors. However, Animate124 struggles to generate 4D content that accurately matches the input image (see Fig. 1). The process of generating 4D content from an image in DreamGaussian4D [23] avoids using temporal SDS and instead performs optimization based on reference videos generated by a video diffusion model. Yet, generating high-quality reference videos with video diffusion models is often challenging. Compared to previous works, our framework can efficiently generate 4D content from a single image, avoiding the time-consuming optimization process during 4D dynamics generation phase, and enabling high-fidelity and controllable 4D generation.

2.2 Physical Simulation

The explicit representation of 3D Gaussian Splatting (3DGS) [13] through a set of anisotropic Gaussian kernels, which can be interpreted as particles in space, enables the many application of physical simulations[31, 5, 8, 41][31] is the first study to apply physical simulation to a given static 3D Gaussian for simulating realistic dynamic effects. Subsequently, several works combining 3DGS with physics simulations have gradually emerged.[38] utilizes reference videos from video diffusion models for supervision to autonomously determine material property values. In contrast, [11, 17] uses score distillation sampling from video diffusion models to optimize physical properties. [21] explores incorporating 3D perception to autonomously identify specific parts within a space, and then edit or simulate those parts.The aforementioned research has promoted the combination of 3DGS and physics simulation.

2.3 3D Perception

3D perception is crucial for robotics, autonomous driving, and physical simulation. With the significant progress in 2D scene understanding achieved by SAM[15], Recent studies[14, 20, 33, 42] have attempted to distill the prior knowledge of 2D perception to achieve 3D perception.LERF[14] is the first to embed CLIP features[22] into NeRF[18], enabling open-vocabulary 3D queries.However, due to the NeRF-based representation approach, LeRF suffers from significant limitations in speed.Recently, more work has focused on integrating 2D perception priors with 3D Gaussians to create a real-time, editable 3D scene representation.LangSplat[20] introduces hierarchical semantics—subparts, parts, and wholes—constructed using SAM, which addresses point ambiguity and facilitates scene understanding.Gaussian Grouping[33] introduces the concept of Gaussian groups and assigns an identity encoding to each Gaussian kernel to identify its respective Gaussian group, to achieve better editing.However, these methods are unable to perceive the physical characteristics.Recently, [36] have attempted to integrate NeRF with large language models to predict the physical properties of objects. However, due to the inability to accurately distinguish between different materials, the predictions are often imprecise.In this work, we propose a Physical Perception Module (PPM) to accurately perceive the physical characteristics of different components of the 3D objects.

3 Methodology

The proposed Phys4DGen is illustrated in Fig 2. The framework includes three phases: 3D Gaussians Generation, Physical Perception, and 4D Dynamics Generation. In 3D Gaussians Generation, a static 3D Gaussians is generated from an input image under the guidance of the image diffusion model.Sequentially, in Physical Perception, given an input image and static 3D Gaussians, we utilize the PPM to infer and assign material types and properties to different parts of static 3D Gaussians.Finally, in 4D Dynamics Generation, given external forces, dynamics are generated from static 3D Gaussians with physical properties through physical simulation.Specifically, Phys4DGen enables the control of 4D dynamics, including movement speed and direction, by manipulating external forces.In the following parts, we provide details of 3D Gaussians Generation, Physical Perception and 4D Dynamics Generation.

3.1 3D Gaussians Generation

For 3D Gaussians generation, static 3D Gaussians can be generated from a single image using any image-to-3D generation method[28, 29] based 3DGS, providing plug-and-play capability. Additionally, the quality of the 4D content improves with the quality of the static 3D Gaussians generated from the input image.We choose 3DGS for its explicit representation nature. 3DGS represents 3D objects using a collection of anisotropic Gaussian kernels[13], which can be interpreted as particles in space. Thus, 3D Gaussians can be viewed as a discretization of the continuum, which is highly beneficial for integrating particle-based physical simulation algorithms.In this phase, we obtain a static 3D Gaussians 𝓖0={(𝐱p,𝚺p,αp,𝐜p)}p=1Psuperscript𝓖0superscriptsubscriptsubscript𝐱𝑝subscript𝚺𝑝subscript𝛼𝑝subscript𝐜𝑝𝑝1𝑃\boldsymbol{\mathcal{G}}^{0}=\{(\mathbf{x}_{p},\mathbf{\Sigma}_{p},\alpha_{p},%\mathbf{c}_{p})\}_{p=1}^{P}bold_caligraphic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT for subsequent simulation. Here, 𝐱psubscript𝐱𝑝\mathbf{x}_{p}bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, 𝚺psubscript𝚺𝑝\mathbf{\Sigma}_{p}bold_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, and 𝐜psubscript𝐜𝑝\mathbf{c}_{p}bold_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT represent the center position, covariance matrix, opacity, and color of the Gaussian kernel p𝑝pitalic_p, respectively, and P𝑃Pitalic_P denotes the total number of Gaussian kernels in the 3D Gaussians.

3.2 Physical Perception

Physical simulation based on 3DGS requires predefined physical properties, such as material types and properties, for the Gaussian kernels representing the simulation objects. In practical applications, these objects typically consist of components with varying physical properties.However, previous studies[31] typically treat the simulation object as a single material type, with manually assigned properties, often resulting in imprecise simulation outcomes.In this section, we propose PPM. Our core idea is to integrate large vision and language models to first segment the simulation object into multiple material parts, and then infer the material type and properties of each part. The specific process is shown in Fig. 2.

3.2.1 Material Segmentation

In practice, objects are usually made up of various materials. Accurately predicting the physical properties of different parts of a 3D object first requires precise segmentation of its materials.Specifically, we need to assign a material group Eidsuperscript𝐸𝑖𝑑E^{id}italic_E start_POSTSUPERSCRIPT italic_i italic_d end_POSTSUPERSCRIPT to each Gaussian kernel. The user-specified input image contains richer and more accurate semantic information than rendered images. Thus, we define material groups based on the segmentation results of the input image, ensuring that each mask region corresponds to a specific material group.We propose a material segmentation method that assigns material groups from the input image to corresponding Gaussian kernels.

Pre-process. For 3D Gaussians generation, we generate a static 3D Gaussians from input image I0subscript𝐼0I_{0}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then render images and depth from N given viewpoints to produce image sequence ={𝐈o}o=1Nsuperscriptsubscriptsubscript𝐈𝑜𝑜1𝑁\mathcal{I}=\{\mathbf{I}_{o}\}_{o=1}^{N}caligraphic_I = { bold_I start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. SAM[15] is a powerful foundation model for image segmentation that can accurately group pixels with surrounding pixels belonging to the same part. Given SAM’s ability to distinguish various materials, we use it to segment both the input image and the rendered sequence, obtaining the segmentation map 𝐌0subscript𝐌0\mathbf{M}_{0}bold_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the input image and {𝐌o}o=1Nsuperscriptsubscriptsubscript𝐌𝑜𝑜1𝑁\{\mathbf{M}_{o}\}_{o=1}^{N}{ bold_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT for the rendered sequence.

CLIP Fusion. However, the 2D segmentation maps are generated independently, lacking connections between the maps of different images. To ensure consistency with the material groups defined by the input image, we align the segmentation maps {𝐌o}o=1Nsuperscriptsubscriptsubscript𝐌𝑜𝑜1𝑁\{\mathbf{M}_{o}\}_{o=1}^{N}{ bold_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_o = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of the rendered sequence with the input image’s segmentation map 𝐌0subscript𝐌0\mathbf{M}_{0}bold_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.Specifically, we first extract the CLIP features for all mask regions in the input image and the rendered image sequence, represented as:

𝐋0(m)subscript𝐋0𝑚\displaystyle\mathbf{L}_{0}(m)bold_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_m )=𝐕(𝐈0𝐌0(m),\displaystyle=\mathbf{V}(\mathbf{I}_{0}\odot\mathbf{M}_{0}(m),= bold_V ( bold_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊙ bold_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_m ) ,(1)
𝐋o(k)subscript𝐋𝑜𝑘\displaystyle\mathbf{L}_{o}(k)bold_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_k )=𝐕(𝐈o𝐌o(k)),absent𝐕direct-productsubscript𝐈𝑜subscript𝐌𝑜𝑘\displaystyle=\mathbf{V}(\mathbf{I}_{o}\odot\mathbf{M}_{o}(k)),= bold_V ( bold_I start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ⊙ bold_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_k ) ) ,(2)

where 𝐕𝐕\mathbf{V}bold_V is the CLIP image encoder and 𝐋𝐋\mathbf{L}bold_L represents the mask CLIP embedding. 𝐌(k)𝐌𝑘\mathbf{M}(k)bold_M ( italic_k ) denotes the k-th mask region in the specified segmentation map, where different maps may contain varying numbers of mask regions.Next, we calculate the similarity between the mask CLIP features of the input image and those of the rendered sequence. We then update the segmentation map for each rendered image, assigning each region the material group with the highest CLIP similarity score. This process is mathematically represented as follows:

𝐌o(k)=argmax{𝐋o(k)𝐋0(m)}m,\mathbf{M}_{o}(k)=\arg\max\{\mathbf{L}_{o}(k)\cdot\mathbf{L}_{0}(m)\}_{m}^{%\mathcal{M}},bold_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_k ) = roman_arg roman_max { bold_L start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_k ) ⋅ bold_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_m ) } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ,(3)

where \mathcal{M}caligraphic_M represents the total material groups. In this way, we align the segmentation maps of the rendered sequence with the segmentation map of the input image.

Projection and Aggregation. At this stage, we have a sequence of segmentation maps with consistent material groupings, and we need to assign a material group to each Gaussian kernel. For a random Gaussian kernel 𝒢p0subscriptsuperscript𝒢0𝑝\mathcal{G}^{0}_{p}caligraphic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and a segmentation map 𝐌osubscript𝐌𝑜\mathbf{M}_{o}bold_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, we use the camera’s intrinsic and extrinsic parameters to project the Gaussian kernel into 2D space, obtaining the 2D coordinates 𝐱p2dsuperscriptsubscript𝐱𝑝2𝑑\mathbf{x}_{p}^{2d}bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT on the segmentation map. This process can be expressed as:

𝐱p2d=𝐊[𝐑o|𝐓o]𝐱p,superscriptsubscript𝐱𝑝2𝑑𝐊delimited-[]conditionalsubscript𝐑𝑜subscript𝐓𝑜subscript𝐱𝑝\mathbf{x}_{p}^{2d}=\mathbf{K}[\mathbf{R}_{o}|\mathbf{T}_{o}]\mathbf{x}_{p},bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT = bold_K [ bold_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | bold_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ] bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ,(4)

where 𝐊𝐊\mathbf{K}bold_K and [𝐑o|𝐓o]delimited-[]conditionalsubscript𝐑𝑜subscript𝐓𝑜[\mathbf{R}_{o}|\mathbf{T}_{o}][ bold_R start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | bold_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ] represent the camera’s intrinsic and extrinsic parameters, respectively. We use the 3DGS-estimated depth to check if the Gaussian kernel is visible in the segmentation map 𝐌osubscript𝐌𝑜\mathbf{M}_{o}bold_M start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. After processing the segmentation maps for all viewpoints, we assign the material group that appears most frequently across all views to each Gaussian kernel. Repeating these steps allows us to determine the material groups for all Gaussian kernels 𝓖0=(𝐱p,𝚺p,αp,𝐜p,Epid)p=1Psuperscript𝓖0subscript𝐱𝑝subscript𝚺𝑝subscript𝛼𝑝𝐜𝑝superscriptsubscript𝐸𝑝𝑖𝑑𝑝superscript1𝑃\boldsymbol{\mathcal{G}}^{0}={(\mathbf{x}_{p},\mathbf{\Sigma}_{p},\alpha_{p},%\mathbf{c}p,E_{p}^{id})}{p=1}^{P}bold_caligraphic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_c italic_p , italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_d end_POSTSUPERSCRIPT ) italic_p = 1 start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT.

3.2.2 Material Reasoning

There is a wide variety of materials in the world, yet humans can infer their physical properties visually based on past experience. In recent years, large language models have advanced rapidly, incorporating extensive textual and multimodal knowledge and exhibiting reasoning abilities that approach or even surpass human-level understanding.Inspired by this, we leverage GPT-4o[1] for open-vocabulary semantic reasoning about material properties, including types and characteristics.

In Material Segmentation, each Gaussian kernel is assigned a material group based on the input image, where mask regions correspond directly to specific material groups. We extract sub-images from the input image according to these mask regions, which can be expressed as I0M0(m)mdirect-productsubscript𝐼0subscript𝑀0superscriptsubscript𝑚𝑚{I_{0}\odot M_{0}(m)}_{m}^{\mathcal{M}}italic_I start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊙ italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_m ) start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT.Following this, we pass both the full input image and the segmented sub-images into GPT-4, prompting it to reason the material type and properties for the material described in each segmented sub-images. Detailed prompts will be provided in the appendix.Since the segmented sub-images correspond one-to-one with the material groups, we can assign the inference results to each Gaussian kernel based on these groups, resulting in a material field represented by Gaussian kernels, denoted as 𝓖0=(𝐱p,𝚺p,αp,𝐜p,θp)p=1Psuperscript𝓖0superscriptsubscriptsubscript𝐱𝑝subscript𝚺𝑝subscript𝛼𝑝subscript𝐜𝑝𝜃𝑝𝑝1𝑃\boldsymbol{\mathcal{G}}^{0}={(\mathbf{x}_{p},\mathbf{\Sigma}_{p},\alpha_{p},%\mathbf{c}_{p},\mathbf{\theta}p)}_{p=1}^{P}bold_caligraphic_G start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_θ italic_p ) start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT.

This method assigns different material types to each Gaussian kernel in space, enabling more detailed and realistic simulations.In traditional physical simulation workflows, physical properties are typically set manually by users. However, in 4D generation, users often lack expertise in physics. This approach provides users with relatively accurate reference values for material properties, assisting in the creation of 4D content.

3.3 4D Dynamics Generation

In this work, we introduce physical simulation to drive the dynamics generation. In this part, we briefly introduce the concept of external forces. Then, we present the details of how physical simulation is integrated into our framework.

3.3.1 External Forces.

In physical simulations, external forces directly affect the system’s behavior. These forces, such as gravity, influence a continuum’s motion and deformation.In this paper, we apply two types of external forces. The first type directly modifies the particle’s velocity, such as setting the velocity of all particles to simulate the translation of the continuum. The second type indirectly affects the particle’s velocity by applying forces 𝐟𝐟\mathbf{f}bold_f. For example, gravity can simulate the continuum’s fall. Given a force 𝐟𝐟\mathbf{f}bold_f, we apply Newton’s second law and time integration to compute the particle’s velocity at the next time step:

𝐚pt=𝐟mpt,𝐯pt+1=𝐯pt+𝐚ptΔt,formulae-sequencesuperscriptsubscript𝐚𝑝𝑡𝐟superscriptsubscript𝑚𝑝𝑡superscriptsubscript𝐯𝑝𝑡1superscriptsubscript𝐯𝑝𝑡superscriptsubscript𝐚𝑝𝑡Δ𝑡\mathbf{a}_{p}^{t}=\frac{\mathbf{f}}{m_{p}^{t}},\quad\mathbf{v}_{p}^{t+1}=%\mathbf{v}_{p}^{t}+\mathbf{a}_{p}^{t}\Delta t,bold_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG bold_f end_ARG start_ARG italic_m start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , bold_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = bold_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + bold_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_Δ italic_t ,(5)

where 𝐚ptsuperscriptsubscript𝐚𝑝𝑡\mathbf{a}_{p}^{t}bold_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT denotes the acceleration of particle p𝑝pitalic_p at time step t𝑡titalic_t, and ΔtΔ𝑡\Delta troman_Δ italic_t denotes the time interval between time step t𝑡titalic_t and t+1𝑡1t+1italic_t + 1.By adjusting the external forces, we can control the motion and deformation of the object.

Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (3)

3.3.2 Dynamics Generation.

Phys4DGen can integrate any particle-based physical simulation algorithm. In this paper, we use the Material Point Method (MPM) to simulate the dynamics of 4D content.For the detailed implementation of MPM, please refer to the appendix.Particularly, 3D Gaussians can be viewed as a discretization of the continuum, making it straightforward to integrate MPM. Therefore, we treat each Gaussian kernel as a particle in the continuum and endow each Gaussian kernel with time property t𝑡titalic_t and physical properties θ𝜃\thetaitalic_θ, Therefore, each Gaussian kernel can be expressed as:

𝓖pt=(𝐱pt,𝚺pt,σp,𝐜p,𝜽pt).superscriptsubscript𝓖𝑝𝑡superscriptsubscript𝐱𝑝𝑡superscriptsubscript𝚺𝑝𝑡subscript𝜎𝑝subscript𝐜𝑝superscriptsubscript𝜽𝑝𝑡\boldsymbol{\mathcal{G}}_{p}^{t}=(\mathbf{x}_{p}^{t},\mathbf{\Sigma}_{p}^{t},%\sigma_{p},\mathbf{c}_{p},\boldsymbol{\theta}_{p}^{t}).bold_caligraphic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .(6)

Here, the subscript p𝑝pitalic_p denotes a specific Gaussian kernel within the continuum.Following [31], we employ MPM to perform physical simulations on the continuum represented by 3D Gaussian Splatting (3DGS). This allows us to track the position and shape changes of each Gaussian kernel at every time step:

𝐱t+1,𝐅t+1,𝐯t+1=MPMSimulation(𝒢t),superscript𝐱𝑡1superscript𝐅𝑡1superscript𝐯𝑡1MPMSimulationsuperscript𝒢𝑡\mathbf{x}^{t+1},\mathbf{F}^{t+1},\mathbf{v}^{t+1}=\text{MPMSimulation}(%\mathcal{G}^{t}),bold_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , bold_F start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , bold_v start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = MPMSimulation ( caligraphic_G start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,(7)

where 𝐱t+1={𝐱pt+1}p=1Psuperscript𝐱𝑡1superscriptsubscriptsuperscriptsubscript𝐱𝑝𝑡1𝑝1𝑃\mathbf{x}^{t+1}=\{\mathbf{x}_{p}^{t+1}\}_{p=1}^{P}bold_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = { bold_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT denotes the positions of all Gaussian kernel at time step t+1𝑡1t+1italic_t + 1.𝐅t+1={𝐅pt+1}p=1Psuperscript𝐅𝑡1superscriptsubscriptsuperscriptsubscript𝐅𝑝𝑡1𝑝1𝑃\mathbf{F}^{t+1}=\{\mathbf{F}_{p}^{t+1}\}_{p=1}^{P}bold_F start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = { bold_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT represents deformation gradients, which describe the local deformation of each Gaussian kernel at time step t+1𝑡1t+1italic_t + 1.Intuitively, we can consider the deformation gradient as a local affine transformation applied to the Gaussian kernel. Consequently, we can derive the covariance matrix of Gaussian kernel p𝑝pitalic_p in step t+1𝑡1t+1italic_t + 1:

𝚺pt+1=(𝐅pt+1)𝚺pt(𝐅pt+1)T.superscriptsubscript𝚺𝑝𝑡1superscriptsubscript𝐅𝑝𝑡1superscriptsubscript𝚺𝑝𝑡superscriptsuperscriptsubscript𝐅𝑝𝑡1𝑇\mathbf{\Sigma}_{p}^{t+1}=(\mathbf{F}_{p}^{t+1})\mathbf{\Sigma}_{p}^{t}(%\mathbf{F}_{p}^{t+1})^{T}.bold_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = ( bold_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) bold_Σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( bold_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .(8)

At each MPM simulation step, we obtain the motion and deformation information of the continuum represented by static 3D Gaussians.

This allows us to generate 4D dynamics that are consistent with physical constraints. At this phase, we have entirely eliminated the need for iterative optimization, which significantly speeds up the generation of 4D content.By controlling external forces, we can precisely manage the dynamics of 4D content generated from a single image, making our method highly controllable.

4 Experiments

4.1 Experimental Setup

Phys4DGen is compared with current state-of-the-art (SOTA) approaches, including the STAG4D [35] and the DG4D [23]. Following [39], we conducted qualitative and quantitative evaluations to demonstrate the effectiveness of our method. For the quantitative evaluation, we used CLIP-T scores. CLIP-T score calculates the average cosine similarity between the CLIP embeddings of every two adjacent frames in rendered video from a given view.We use LGM[29] to generate static 3D Gaussians, with LGM settings consistent with those in the original implementation.We use the LangSplat[20] implementation to obtain part-level masks and adopt SAM ViT-H[15] as the base segmentation model.To infer physical properties, we use the GPT-4 language model, with specific prompts provided in the appendix.For the 4D dynamics generation phase, we employ MPM to generate dynamics, with the MPM settings following those in [31].All experiments are performed on NVIDIA A40(48GB) GPU.For more detailed information on experimental settings, please refer to the appendix.

Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (4)
MethodSTAG4DDG4DOurs
CLIP-T-f𝑓fitalic_f \uparrow0.98740.99220.9956
CLIP-T-r𝑟ritalic_r \uparrow0.98330.98770.9926
CLIP-T-b𝑏bitalic_b \uparrow0.98310.98900.9938
CLIP-T-l𝑙litalic_l \uparrow0.98450.98390.9968
Time \downarrow6668s780s330s

4.2 Comparisons with State-of-the-Art Methods

Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (5)

Qualitative Results.The results of the perception visualization, presented in the appendix, demonstrate that PPM effectively segments materials and assigns material types and properties.Fig. 4 shows the qualitative comparison between our method and other SOTA methods. To compare the spatiotemporal consistency, the rendering view changes with each time step.The text description below the input image specifies the desired dynamic effect. For example, a snowman is melting.Fig. 4 clearly demonstrates that our generated 4D content achieves high fidelity and excellent spatiotemporal consistency, significantly outperforming the baseline methods.The results show that Phys4DGen can generate 4D content that follows physical laws and provides fine-grained controllability.The process of generating 4D content with STAG4D and DG4D heavily depends on the quality of the reference video. However, it is challenging to obtain a reference video from existing video diffusion models that adhere to physical laws. As shown in Fig. 4, the 4D content generated by STAG4D and DG4D is hard to control and inconsistent with physical laws.

Quantitative Results.Table 1 presents a quantitative comparison between our method and other approaches. We evaluate video quality using the CLIP-T score[22] and generation time.To further quantify spatiotemporal consistency, we rendered videos from the front, right, back, and left views and computed the CLIP-T score for each view. Table 1 shows that our approach outperforms the baseline methods on all metrics. This demonstrates that Phys4DGen generates 4D content that adheres to physical laws and ensures spatiotemporal consistency.Meanwhile, Phys4DGen generates 4D content in just 330 seconds, significantly reducing the time compared to the baseline methods.

4.3 Ablation Studies and Analysis

The Effectiveness of Physical Perception Module.In Fig. 5, we analyze the impact of incorporating the Physical Perception Module on the generation of 4D content.Firstly, as shown in the snowman example in Fig. 5, when PPM is not used to assign distinct material types to different parts of the snowman, the simulation algorithm treats the snowman as a uniform object, simulating the melting effect across the entire structure. As seen in Fig. 5, the scarf melts along with the snowman’s body, which clearly contradicts physical reality.PPM accurately segments materials and assigns their respective material types, thereby enabling more precise simulations.Secondly, as demonstrated by the pink rose in Fig. 5, when randomly generated material property values are used, the results lack physical realism.Users in the 4D generation often lack specialized knowledge of physics, making it challenging to assign material properties accurately. PPM automates the assignment of realistic property values to materials, enabling users to achieve more autonomous and physically consistent 4D generation.

Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (6)

The Ability for Fine-grained Controlling 4D Dynamics.Fig. 6 illustrates the 4D content generated under various external forces, showcasing the fine-grained controllability of Phys4DGen. In the first row, external force 𝐟1=(1.0,0.0,0.0)subscript𝐟11.00.00.0\mathbf{f}_{1}=(1.0,0.0,0.0)bold_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 1.0 , 0.0 , 0.0 ), directed to the right along the x-axis is applied, causing the orange rose to move to the right. In the second row, a larger force 𝐟2=(2.0,0.0,0.0)subscript𝐟22.00.00.0\mathbf{f}_{2}=(2.0,0.0,0.0)bold_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 2.0 , 0.0 , 0.0 ) is applied in the same direction, resulting in more intense motion.This demonstrates that Phys4DGen can control the strength of motion by adjusting external force. In the third row, a leftward force 𝐟3=(1.0,0.0,0.0)subscript𝐟31.00.00.0\mathbf{f}_{3}=(-1.0,0.0,0.0)bold_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( - 1.0 , 0.0 , 0.0 ), with the same magnitude as in the first row, is applied along the x-axis.As a result, the orange rose moves left under this external force, demonstrating that Phys4DGen can control the direction of motionTo summarize, Phys4DGen allows fine-grained control of the dynamics in 4D content by adjusting external forces.

5 Conclusion

In this paper, we have introduced Phys4DGen, a novel, fast, physics-driven framework for generating 4D content from a single image. By integrating physical simulation directly into the 4D generation process, Phys4DGen ensures that the generated 4D content adheres to natural physical laws.Furthermore, we propose a Physical Perception Module (PPM) to infer the physical properties of different parts of 3D objects depicted in images, enabling more precise simulations and the generation of more physically realistic 4D content.To achieve controllable 4D generation, Phys4DGen incorporates external forces, allowing precise manipulation of the dynamics, such as movement speed and direction. Furthermore, by removing the need for iterative optimization in the 4D dynamics generation phase, Phys4DGen reduces 4D content generation time.Our extensive experiments show that Phys4DGen generates high-fidelity, physically accurate 4D content while significantly reducing generation time.

References

  • Achiam etal. [2023]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • Bahmani etal. [2024]Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, JeongJoon Park, Andrea Tagliasacchi, and DavidB Lindell.4d-fy: Text-to-4d generation using hybrid score distillation sampling.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024.
  • Bailey [2020]Jason Bailey.The tools of generative art, from flash to neural networks.Art in America, 8:1, 2020.
  • Blattmann etal. [2023]Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, etal.Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023.
  • Cai etal. [2024]Junhao Cai, Yuji Yang, Weihao Yuan, Yisheng He, Zilong Dong, Liefeng Bo, Hui Cheng, and Qifeng Chen.Gaussian-informed continuum for physical property identification and simulation.arXiv preprint arXiv:2406.14927, 2024.
  • Cao and Johnson [2023]Ang Cao and Justin Johnson.Hexplane: A fast representation for dynamic scenes.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023.
  • Chu etal. [2024]Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki.Dreamscene4d: Dynamic multi-object scene generation from monocular videos.arXiv preprint arXiv:2405.02280, 2024.
  • Feng etal. [2024]Yutao Feng, Xiang Feng, Yintong Shang, Ying Jiang, Chang Yu, Zeshun Zong, Tianjia Shao, Hongzhi Wu, Kun Zhou, Chenfanfu Jiang, etal.Gaussian splashing: Dynamic fluid synthesis with gaussian splatting.arXiv preprint arXiv:2401.15318, 2024.
  • Gao etal. [2024]Quankai Gao, Qiangeng Xu, Zhe Cao, Ben Mildenhall, Wenchao Ma, Le Chen, Danhang Tang, and Ulrich Neumann.Gaussianflow: Splatting gaussian dynamics for 4d content creation.arXiv preprint arXiv:2403.12365, 2024.
  • Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020.
  • Huang etal. [2024]Tianyu Huang, Yihan Zeng, Hui Li, Wangmeng Zuo, and RynsonWH Lau.Dreamphysics: Learning physical properties of dynamic 3d gaussians with video diffusion priors.arXiv preprint arXiv:2406.01476, 2024.
  • Jiang etal. [2023]Yanqin Jiang, Li Zhang, Jin Gao, Weimin Hu, and Yao Yao.Consistent4d: Consistent 360 {{\{{\\\backslash\deg}}\}} dynamic object generation from monocular video.arXiv preprint arXiv:2311.02848, 2023.
  • Kerbl etal. [2023]Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis.3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023.
  • Kerr etal. [2023]Justin Kerr, ChungMin Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik.Lerf: Language embedded radiance fields.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19729–19739, 2023.
  • Kirillov etal. [2023]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, AlexanderC Berg, Wan-Yen Lo, etal.Segment anything.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • Liang etal. [2024]Hanwen Liang, Yuyang Yin, Dejia Xu, Hanxue Liang, Zhangyang Wang, KonstantinosN Plataniotis, Yao Zhao, and Yunchao Wei.Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models.arXiv preprint arXiv:2405.16645, 2024.
  • Liu etal. [2024]Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan.Physics3d: Learning physical properties of 3d gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024.
  • Mildenhall etal. [2021]Ben Mildenhall, PratulP Srinivasan, Matthew Tancik, JonathanT Barron, Ravi Ramamoorthi, and Ren Ng.Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021.
  • Pan etal. [2024]Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang.Fast dynamic 3d object generation from a single-view video.arXiv preprint arXiv:2401.08742, 2024.
  • Qin etal. [2024]Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister.Langsplat: 3d language gaussian splatting.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024.
  • Qiu etal. [2024]Ri-Zhao Qiu, Ge Yang, Weijia Zeng, and Xiaolong Wang.Feature splatting: Language-driven physics-based scene synthesis and editing.arXiv preprint arXiv:2404.01223, 2024.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Ren etal. [2023]Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu.Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023.
  • Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Singer etal. [2023]Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, etal.Text-to-4d dynamic scene generation.arXiv preprint arXiv:2301.11280, 2023.
  • Song etal. [2020]Jiaming Song, Chenlin Meng, and Stefano Ermon.Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020.
  • Sun etal. [2024]Qi Sun, Zhiyang Guo, Ziyu Wan, JingNathan Yan, Shengming Yin, Wengang Zhou, Jing Liao, and Houqiang Li.Eg4d: Explicit generation of 4d object without score distillation.arXiv preprint arXiv:2405.18132, 2024.
  • Tang etal. [2023]Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng.Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023.
  • Tang etal. [2024]Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu.Lgm: Large multi-view gaussian model for high-resolution 3d content creation.arXiv preprint arXiv:2402.05054, 2024.
  • Wang etal. [2023]Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, etal.Lavie: High-quality video generation with cascaded latent diffusion models.arXiv preprint arXiv:2309.15103, 2023.
  • Xie etal. [2024]Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang.Physgaussian: Physics-integrated 3d gaussians for generative dynamics.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024.
  • Yang etal. [2024]Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin.Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20331–20341, 2024.
  • Ye etal. [2025]Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke.Gaussian grouping: Segment and edit anything in 3d scenes.In European Conference on Computer Vision, pages 162–179. Springer, 2025.
  • Yin etal. [2023]Yuyang Yin, Dejia Xu, Zhangyang Wang, Yao Zhao, and Yunchao Wei.4dgen: Grounded 4d content generation with spatial-temporal consistency.arXiv preprint arXiv:2312.17225, 2023.
  • Zeng etal. [2024]Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao.Stag4d: Spatial-temporal anchored generative 4d gaussians.arXiv preprint arXiv:2403.14939, 2024.
  • Zhai etal. [2024]AlbertJ Zhai, Yuan Shen, EmilyY Chen, GloriaX Wang, Xinlei Wang, Sheng Wang, Kaiyu Guan, and Shenlong Wang.Physical property understanding from language-embedded feature fields.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28296–28305, 2024.
  • Zhang etal. [2023]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • Zhang etal. [2024]Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, BrandonY Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and WilliamT Freeman.Physdreamer: Physics-based interaction with 3d objects via video generation.arXiv preprint arXiv:2404.13026, 2024.
  • Zhao etal. [2023]Yuyang Zhao, Zhiwen Yan, Enze Xie, Lanqing Hong, Zhenguo Li, and GimHee Lee.Animate124: Animating one image to 4d dynamic scene.arXiv preprint arXiv:2311.14603, 2023.
  • Zheng etal. [2024]Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Otmar Hilliges, and Shalini DeMello.A unified approach for text-and image-guided 4d scene generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7300–7309, 2024.
  • Zhong etal. [2024]Licheng Zhong, Hong-Xing Yu, Jiajun Wu, and Yunzhu Li.Reconstruction and simulation of elastic objects with spring-mass 3d gaussians.arXiv preprint arXiv:2403.09434, 2024.
  • Zhou etal. [2024]Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi.Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024.
Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Nathanial Hackett

Last Updated:

Views: 5729

Rating: 4.1 / 5 (52 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Nathanial Hackett

Birthday: 1997-10-09

Address: Apt. 935 264 Abshire Canyon, South Nerissachester, NM 01800

Phone: +9752624861224

Job: Forward Technology Assistant

Hobby: Listening to music, Shopping, Vacation, Baton twirling, Flower arranging, Blacksmithing, Do it yourself

Introduction: My name is Nathanial Hackett, I am a lovely, curious, smiling, lively, thoughtful, courageous, lively person who loves writing and wants to share my knowledge and understanding with you.