This can be explained by the fact that in matrix factorization models, the norm of the embedding is often correlated with popularity (popular movies have a larger norm), which makes it more likely to recommend more popular items.
This can happen if the embedding of that movie happens to be initialized with a high norm. Then, because the movie has few ratings, it is infrequently updated, and can keep its high norm. This will be alleviated by using regularization.
the expected norm of a d-dimensional vector with entries ∼N(0,σ2) is approximately σd .
Folding
The model does not learn how to place the embeddings of irrelevant movies. This phenomenon is known as folding.
Loss Function
regularization
We can add regularization terms that will address the folding issue.
We use two types of regularization:
ℓ2 regularization term: Regularization of the model parameters.
This is given by r(U,V)=N1∑i ∥Ui∥2+M1∑j ∥Vj∥2.
Gravity term : A global prior that pushes the prediction of any pair towards zero.
This is given by g(U,V)=MN1∑i= 1N∑j= 1M⟨Ui,Vj⟩2.
The total loss is then given by
: ∣Ω∣1(i,j)∈ Ω∑(Aij−⟨Ui,Vj⟩)2+λrr(U,V)+λgg(U,V)
where λr and λg are two regularization coefficients (hyper-parameters).