Visualizing Paired Image Similarity in Transformer Networks

January 5, 2022

Authored by:

Robert Pless

The lack of explainability or interpretability is often cited as one of the main drawbacks of deep neural networks used in computer vision and machine learning [2, 25, 27]. The primary approach to addressing this problem in vision domains is through visualization approaches that highlight the portion(s) of the input that most contributed to the output prediction. For images, these heatmap visualizations tend to depict the relative importance of particular pixel regions.

Generating these visualizations often depends on both the (1) network architecture and (2) prediction task. Convolutional neural networks (CNNs) have been the dominant architecture in computer vision and most visualization approaches were specific to these models. Typically, for CNNs, the feature maps at each layer have local support; the representation at a particular location is computed from a small neighborhood of nearby pixel locations. Relevancy maps can be generated and overlaid with the input to produce an intuitive visualization. While most of the work deals with classification networks [31, 30, 37, 45, 29], there has been some recent work on embedding networks [35, 10] which typically use representation learning for downstream tasks, such as image retrieval.