Previously, I was a visiting researcher in the Computer Vision and Geometry Group (CVG), ETH Zurich, advised by Prof. Marc Pollefeys.
My research goal is to make computers (robotics) learn to perceive, localize, reconstruct, reason, and interact with the real world like human beings, that is AGI.
I'm interested in 3D Vision Foundation Model, World Model, Physical World Simulator, and Embodied AI, especially correspondence, 3D/4D reconstruction, rendering, generation, and robotics manipulation.
Proposed a hierarchical manner in generalizable 3D Gaussian Splatting to construct hierarchical 3D Gaussians via a coarse-to-fine strategy,
which significantly enhances reconstruction quality and cross-dataset generalization.
Proposed a novel distributed framework for efficient
Gaussian reconstruction for sparse-view vast scenes, leveraging feed-forward
Gaussian model for fast inference and a global alignment algorithm to ensure geometric
consistency.
Proposed scalable and consistent text-to-panorama generation with spherical epipolar-aware diffusion.
Established large-scale panoramic video-text datasets with corresponding depth and camera poses.
Achieved long-term, consistent, and diverse panoramic scene generation given unseen text and camera poses with SOTA performance.
MeshAnything mimics human artist in extracting meshes from any 3D representations.
It can be combined with various 3D asset production pipelines, such as 3D reconstruction and generation, to convert their results into Artist-Created Meshes that can be seamlessly applied in 3D industry.
Proposed flow-based large diffusion transformers foundation model for transforming text into any modality (image, video, 3D, Audio, music, etc.), resolution, and duration.
Proposed a concise, elegant, and robust SfM pipeline with point tracking for smooth camera trajectories and dense pointclouds from casual monocular videos.
First cloud-edge-device hierarchical framework with federated learning for large-scale high-fidelity surface reconstruction in a distributed manner,
achieving balance between high-precision reconstruction and low-cost memory.
Proposed photorealistic rendering and efficient high-fidelity surface reconstruction model without any pretrained priors, outperforming 3DGS-based (Sugar, 2DGS, Gaussian Opacity Fields, etc.) and SDF-Based methods on T&T, DTU, etc. with faster training.
e.g. (ours: only 1 hour vs Neuralangelo 128+ hours)
Based on PGSR, proposed the DynaSurfGS framework, which can facilitate real-time photorealistic rendering and dynamic high-fidelity surface reconstruction,
achieving smooth surfaces with meticulous geometry.
Based on PGSR, proposed photorealistic rendering and efficient high-fidelity large surface reconstruction in a divide-and-conquer manner with LOD structure, outperforming Neuralangelo.
Based on PGSR, proposed photorealistic rendering and efficient high-fidelity Large Scene Surface Reconstruction for Urban Street Scenes with Free Camera Trajectories, outperforming F2NeRF.
Identified two main factors of the SDF-based approach that degrade surface quality and proposed a two-stage neural surface reconstruction framework without any pretrained priors,
achieving faster training (only 18 GPU hours) and high-fidelity surface reconstruction with fine-grained details, outperforming Neuralangelo on T&T, ScanNet++, etc.
Proposed Normal Deflection fields to represent the angle deviation between the scene normals and the prior normals, achieving smooth surfaces with fine-grained structures, outperforming MonoSDF.
Implied that point cloud observation, or explicit 3D information, matters for robot learning. With point cloud as input, the agent achieved higher mean success rates and exhibited better generalization ability.
Incorporating semantic cues and perspective-aware depth supervision, NeRF-Det++ outperforms NeRF-Det by +1.9% in mAP@0.25 and +3.5% in mAP@0.50 on ScanNetV2.
Proposed plug-and-play and iterative diffusion refinement framework for robust scene flow estimation. Achieved unprecedented millimeter level accuracy on KITTI, and with 6.7% and 19.1% EPE3D reduction respectively on FlyingThings3D and KITTI 2015.
Proposed a novel end-to-end RGB-D SLAM, which adopts a feature-based deep neural tracker as frontend and a NeRF-based neural implicit mapper as the backend.
Proposed an efficient system named Coxgraph for multi-robot collaborative dense reconstruction in real-time.
To facilitate transmission, we propose a compact 3D representation which transforms the SDF submap to mesh packs.
Proposed a multi-device integrated cargo loading management system with AR, which monitors cargoes by fusing perceptual information from multiple devices in real-time.
Proposed a novel saliency guided subdivision method to achieve the trade-off between detail generation and memory consumption. Our method can both produce visually pleasing mesh reconstruction results with fine details and achieve better performance.
Proposed bipartite graph network with Hungarian pooling layer to deal with 2D-3D matching, which can find more correct matches and improves localization on both the robustness and accuracy.
Experiences
Researcher Intern
General 3D Vision Team, Shanghai AI Laboratory
As the first author/corresponding author/project lead, proposed Match Anything (Correspondence Foundation Model), DiffusionSfM, InternVerse (Reconstruction Foundation Models include SurfelGS, FedSurfGS, GigaGS, StreetSurfGS, InvrenderGS, NeuRodin),
DiffPano (Text to Multi-view Panorama Generation), MAIL (Embodied Foundation Model for Imitation Learning), etc.
Working with Dr. Tong He, Prof. Wanli Ouyang, and Prof. Yu Qiao.
Mentoring 10+ junior researchers at Shanghai AI Lab.
2023.10-Present