Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The implementation of ROPE seems a little off for k #30900

Closed
2 of 4 tasks
wzhcz8902 opened this issue May 20, 2024 · 1 comment
Closed
2 of 4 tasks

The implementation of ROPE seems a little off for k #30900

wzhcz8902 opened this issue May 20, 2024 · 1 comment

Comments

@wzhcz8902
Copy link

wzhcz8902 commented May 20, 2024

System Info

main branch

Who can help?

@ArthurZucker @younesbelkada @bojone

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`, *optional*):
Deprecated and unused.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed

The rotary embedding for q and k is done exactly the same, but should they be different so that when they do multiply together, we end up with a relative position as the difference between them. The current implementation results in a relative position as a sum of the positions of q and k, so I think we should put a negative sign in front of k's angle, thus a correct implementation for k may look like
k_embed = (k * cos) + (rotate_half(k) * (-sin))

Expected behavior

The angle between q and k should differ by a sign

@ArthurZucker
Copy link
Collaborator

Hey, inviting you to read https://blog.eleuther.ai/rotary-embeddings/ on which the current implementation of ROPE is based.

A quick test you can do is directly change this and check the outputs of the Llama 3 or 2 model 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants