I’m aware of papers that train hard or , but I’m not aware of papers that confirm that attention is robust to this sparsization in terms of . In general, how sparse is attention? I’m interested in self-attention on texts in particular. Does anyone have any idea?

Source link
thanks you RSS link
( https://www.reddit.com/r//comments/9e6eaz/d_is_it_possible_to_make_soft_attention_sparse_by/)


Please enter your comment!
Please enter your name here