ML32M Data Splitting

This page describes the analysis done to select the cutoffs for temporally-splitting the ML32M data set.

Split Windows

Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/validation/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.

t_valid t_test
0 2016-10-13 13:27:36.400 2019-11-09 20:15:07.950

This suggests that Oct. 2016 is a reasonable validation set cutoff, and Nov. 2019 a reasonable test set cutoff.

part n_ratings n_users n_items
0 train 22355173 154354 36291
1 valid 4814880 31199 54661
2 test 4830151 30302 70652

How many users can we use for collaborative filtering in the testing set?

n_users n_ratings
0 7760 898685

And for testing?

n_users n_ratings
0 7863 1229054

This does lose a lot of users, but it’s enough we should have reasonably useful results.

References

Meng, Zaiqiao, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. “Exploring Data Splitting Strategies for the Evaluation of Recommendation Models.” In Fourteenth ACM Conference on Recommender Systems, 681–86. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3383313.3418479.