3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

Olivia Zumsteg1*†, Jannis Widmer1*, Yann Bourdé1, Norbert Kirchgessner1
Andreas Hund1, Lukas Roth1, Paraskevi Nousi2
1ETH Zurich, Switzerland · 2Swiss Data Science Center, Switzerland
* Equal contribution    † Corresponding author: zumstego@ethz.ch
📄 Paper (coming soon) 📑 arXiv Preprint 💻 Code 🤗 Hugging Face Models

Abstract

Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm3 of the non-distilled RT to 639.93 mm3 and 644.62 mm3, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

Overview

Overview of the proposed method
Overview of our proposed in-field single wheat spike volume estimation using 2D multi-view images of wheat spikes (marked in red) taken simultaneously from different angles as input, and their corresponding 3D scan (a) and volume as ground truth. We use knowledge distillation (KD) from ground truth scans (a) to supervise volume estimation from reconstructed field-based point clouds (PC; b). The resulting 3D model is combined with a regulated Transformer (RT) as image model in an ensemble model. This ensemble is then used as teacher to distill knowledge into the final model, a distilled regulated Transformer.

Highlights

Highlight image

Methodology:

Regulated Transformer: Fast, but less accurate for volume estimation. Single image predictor to estimate per-image volume, reducing overestimation (regulation). Global predictor to learn volume token from 12 views per spike. Rigid-invariant point cloud model: Higher accuracy but long inference time. Use KD to align features from indoor and outdoor point clouds to improve field point cloud models. Use KD to transfer 3D information to image models.

Results

DINOv2 outperforms other backbones variants. RT and rigid-invariant point cloud models outperform baseline models. Distilling knowledge from the strongest ensemble into the RT minimizes both error and processing time.

Second highlight image