Interpreting Multi-Head Attention
AI Algorithms

Interpreting Multi-Head Attention: Unique Features or Not?

Do Different Heads Focus on Unique Features? Understanding how multi-head attention