ML20M Data Splitting

This page describes the analysis done to select the cutoffs for temporally-splitting the ML20M data set.

Split Windows

Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/validation/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.

t_valid t_test
0 2007-12-08 01:20:32.600 2010-12-06 22:24:59.600

This suggests that 2008-2010 (valid) and 2011-end (test) are reasonable splits.

part n_ratings n_users n_items
0 test 2943856 25167 25806
1 valid 2992504 22797 15040
2 train 14063903 101068 9710

How many users can we use for collaborative filtering in the testing set?

n_users n_ratings
0 5564 619592

And for testing?

n_users n_ratings
0 5278 818921

This give us enough data to work with, even if we might like more test users.

References

Meng, Zaiqiao, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. “Exploring Data Splitting Strategies for the Evaluation of Recommendation Models.” In Fourteenth ACM Conference on Recommender Systems, 681–86. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3383313.3418479.