Fix the attribution for the `attention_probs`
attention_probs
are tensors of shape num_heads x seq_len x seq_len. Currently, they are being aggregated only along the dimension 1 but they should be aggregated also along their dimension 2, leaving virtually 1 neurons per attention head.
This will be mitigated in the branch aggregate-input-type
as we will have 4/5 categories per attention head, leaving more room for analysis.