Usually a Siamese network gets as input two (raw) images. I wonder if it would be also possible to train an Auto Encoder to be able to encode the images into a latent representation first. Instead of feeding the Siamese Network two images, you could them feed them the corresponding latent representations of these images. I could imagine that this might work since the latent representation is sufficient to reconstruct the original image (i.e. it contains most of the information) is also should be sufficient to compare images. But I am not sure about that point since maybe the spatial layout gets lost in the AE process. Please let me know if you are aware of anybody doing that or something related.
My motivation: AFAIK (but please correct me if there is anything smarter, I am generally very interested in that question) after we have trained our Siamese network or some other Similarity metric learning network in order to find the image in our repository to which a given image is most similar to we have to do pairwise comparisons with every single image in our repository. Maybe this costly comparisons could be sped up if the repository is held in compressed form (as latent representation created by the AE).