I’m aware of papers that train hard or , but I’m not aware of papers that confirm that attention is robust to this sparsization in terms of . In general, how sparse is attention? I’m interested in self-attention on texts in particular. Does anyone have any idea?

