Fix GPT-2 Attention Scaling Ignored in SDPA/FlashAttention
A silent bug in Hugging Face Transformers caused GPT-2 attention scaling configs to be ignored when using SDPA or FlashAttention backends. Here's how I traced, fixed, and tested it through three rounds of maintainer review.