Kranti Kumar Parida 1,
Omar Emara1,
Hazel Doughty2,
Dima Damen1
1University of Bristol, 2Leiden University

Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3 \times$ and $4.7 \times$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.

We define our task, Collision Sound Source Segmentation (CS3) as follows: given an audio-visual clip containing a single collision sound, we aim to identify the object(s) causing this collision sound, and output pixel-level segmentations of these objects in a frame within the collision segment. We define a collision sound as one produced when two or more objects (or parts of one object) exert force on each other as a result of an interaction. Crucially, such sound results from the interaction itself and cannot be attributed to either object alone. Our approach is trained with weak supervision, i.e. we require only temporally segmented clips of collisions, without any labels of colliding objects, action semantics or object segmentation masks. Formally, given a collision audio $\mathbf{A}{\in}\mathbb{R}^{1 \times T}$ and an image $\mathbf{I}{\in}\mathbb{R}^{3 \times H \times W}$ from a video, we learn a function to segment object(s) responsible for the collision: $$f: (\mathbf{I}, \mathbf{A}) \rightarrow \{\mathbf{M}_k\}_{k=1}^n$$ where $n{\in}\{1, 2\}$, and each $\mathbf{M}_k{\in}\{0, 1\}^{H{\times}W}$ denotes the mask for the $k^\text{th}$ colliding object.
Given our videos are egocentric, we use the interaction prior that sounds are often caused by the camera wearer manipulating one of the objects involved in the collision. Guided by this hypothesis, we combine two complementary cues: (1) audio-visual correlation to localise the sound-producing object, and (2) hand-object interaction priors to identify objects held in both hands. Our architecture consists of three main components: (1) audio-conditioned segmentation, (2) hand-object interaction (HOI) and (3) collision verification. The audio-conditioned segmentation model takes an image ($\mathbf{I}$) and its corresponding audio ($\mathbf{A}$) to produce conditioning signals $\mathbf{I}_C$ and $\mathbf{A}_C$. The audio is first encoded into a representation aligned with the text token space, which is used alongside visual features to guide the localisation of sound-producing regions. The model is trained with image-level ($\mathcal{L}_{i}$), feature-level ($\mathcal{L}_{f}$), area regaularisation ($\mathcal{L}_{r}$) losses. The HOI model provides bounding boxes for in-hand left and right objects when present. The collision verification module uses SAM to extract object masks for audio-conditioned segmentation mask $\mathbf{M}_{av}$ and in-hand objects $\mathbf{M}_{\textit{left}}$ and $\mathbf{M}_{\textit{right}}$. A contact-based strategy is then applied to estimate the segmentations for collision sound sources, $\mathbf{M}_{coll.}$.
If you use the code or dataset from the project, please cite:
@InProceedings{parida2025segmenting,
author = {Parida, Kranti and Emara, Omar and Doughty, Hazel and Damen, Dima},
title = {Segmenting Collision Sound Sources in Egocentric Videos},
booktitle = {ArXiv},
year = {2025}}
This project is supported by EPSRC Program Grant Visual AI (EP/T028572/1). O Emara is supported by UKRI CD in Interactive AI (EP/S022937/1). H Doughty is supported by the Dutch Research Council (NWO) under a Veni grant (VI.Veni.222.160). We acknowledge the usage of EPSRC Tier-2 Jade clusters for initial experiments. The authors also acknowledge the use of Isambard-AI National AI Research Resource (AIRR). Isambard-AI is operated by the University of Bristol and is funded by the UK Government’s Department for Science, Innovation and Technology (DSIT) via UK Research and Innovation; and the Science and Technology Facilities Council [ST/AIRR/I-A-I/1023]. We also extend our gratitude to SURF for granting compute resources from the National Supercomputer Snellius.