Abstract. 3D object geometry reconstruction remains a challenge when
working with transparent, occluded, or highly reflective surfaces. While
recent methods classify shape features using raw audio, we present a multimodal neural network optimized for estimating an object’s geometry
and material. Our networks use spectrograms of recorded and synthesized object impact sounds and voxelized shape estimates to extend the
capabilities of vision-based reconstruction. We evaluate our method on
multiple datasets of both recorded and synthesized sounds. We further
present an interactive application for real-time scene reconstruction in
which a user can strike objects, producing sound that can instantly classify and segment the struck object, even if the object is transparent or
visually occluded.