ML10M Data Splitting

This page describes the analysis done to select the cutoffs for temporally-splitting the ML10M data set.

Split Windows

Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/validation/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.

t_valid t_test
0 2005-04-02 16:52:03.300 2006-12-27 19:38:10.050

This suggests that Apr. 2005 is a reasonable validation cutoff, and 2007 a good test cutoff.

part n_ratings n_users n_items
0 test 1487550 10904 10415
1 train 6974190 54669 8550
2 valid 1538314 10479 8913

How many users can we use for collaborative filtering in the testing set?

n_users n_ratings
0 3128 269925

And for testing?

n_users n_ratings
0 3248 431297

This give us enough data to work with, even if we might like more test users.

References

Meng, Zaiqiao, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. “Exploring Data Splitting Strategies for the Evaluation of Recommendation Models.” In Fourteenth ACM Conference on Recommender Systems, 681–86. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3383313.3418479.