ML25M Data Splitting

This page describes the analysis done to select the cutoffs for temporally-splitting the ML25M data set.

Split Windows

Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/validation/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.

t_valid t_test
0 2015-01-22 16:43:02.600 2017-03-24 01:50:04.700

This suggests that Jan. 2015 is a reasonable validation set cutoff, and March/April 2017 a reasonable test set cutoff.

part n_ratings n_users n_items
0 train 17436354 121673 22316
1 valid 3850116 27077 36820
2 test 3713625 24324 53059

How many users can we use for collaborative filtering in the testing set?

n_users n_ratings
0 6481 642289

And for validation?

n_users n_ratings
0 4534 720161

This does lose a lot of users, but it’s enough we should have reasonably useful results.

References

Meng, Zaiqiao, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. “Exploring Data Splitting Strategies for the Evaluation of Recommendation Models.” In Fourteenth ACM Conference on Recommender Systems, 681–86. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3383313.3418479.