Why I Don't Work on 3D Anymore

Thoughts after reading Vincent Sitzmann's "A Bitter Lesson in Vision"

My research career started around 2022 in 3D and graphics. Back then, it felt like the absolute peak of NeRFs (Neural Radiance Fields), then 3D generation with GANs, followed by score distillation with text-to-image diffusion models like DreamFusion, and then 3DGS. I was really happy working in this direction and thought it was the perfect path for my PhD. The world is 3D, after all, so why not build with 3D as the core mindset? My very first paper, Pix2NeRF, was on generative NeRFs. I was also fortunate enough to get an offer from arguably the best group in the world to work on 3D representations, becoming Vincent's junior labmate, and the future felt incredibly bright.

The turning point that completely changed my mind was the release of Objaverse. From that moment on, seemingly overnight, the entire 3D generation field pivoted to data-driven methods, and the quality of generated content was just leaps and bounds better. That was exactly when I started to realize I should move on to do something else. The reasoning is simple: scalable training is so important that it can outweigh almost all other factors. If something is not scalable, it might ultimately just be relegated to a side project of a scalable foundation model. Under the premise of large-scale training, the value of explicit inductive biases or explicit representations becomes highly debatable; maybe "no inductive bias is the best inductive bias ;)".

After starting my PhD, influenced by the atmosphere at Stanford, my mindset gradually became more VC-like, or rather, more utilitarian and capital-driven, which is not necessarily a good thing, lol. But as everyone can see, this wave of AI hype has an incredibly strong ability to attract capital, with startups casually raising hundreds of millions of dollars. So why are VCs so willing to throw money at it? Ultimately, I think it is the influence of LLMs. The industry has now seen a highly mature, scaling-based path to success, along with an underlying framework for how that success gets built.

The Scaling Template

FSD

Scale: 4.3M+ hours of driving data.

Cost: Real cars generating new data every single day for free.

LLMs

Scale: Tokens roughly equivalent to more than 10 million books.

Cost: Decades of nearly free scraping from the public internet, plus millions of daily interactions that provide RLHF data.

Image Generation

Scale: 5.85 billion image-text pairs, and that is just the open-source LAION-5B dataset; state-of-the-art industry models likely use much more.

Cost: Continuously scraped for free from alt-text and images across the public web.

Video Generation

Scale: Millions of hours of video data on platforms like YouTube.

Cost: Uploaded daily by billions of global creators and accessible essentially for free.

Behind every single one of these tracks is a massively large and scalable data ecosystem. The current trend is that other fields, whether robotics or anything else, are desperately trying to lean into the same path. If something does not look like it has the potential to replicate this LLM-style scaling trajectory, it is destined to fall out of favor with capital.

What This Meant for 3D

Looking back at 3D in this context, the question becomes: how do we build a true 3D foundation model? Intuitively, that means training a large-scale pre-trained model designed to understand, generate, and interact with 3D spatial data across a variety of downstream tasks. But where do we find low-cost, scalable data at that scale?

My conclusion was that if I were to continue in 3D, I would either have to work on how to acquire scalable data or research how to leverage existing foundation models to solve specific 3D tasks. I am genuinely interested in both topics, but I just could not see myself dedicating myself to them for the long term.

Why I Moved to Video

So I chose to pivot to a direction that at least appears to genuinely possess this scalable property: video, along with the corresponding foundations for scaling up, such as long context and infrastructure. As Vincent pointed out, video might still be a long way from the final solution, but it is the easiest, most scalable pre-training scheme people can think of right now.