t_valid | t_test | |
---|---|---|
0 | 2016-10-13 13:27:36.400 | 2019-11-09 20:15:07.950 |
ML32M Data Splitting
This page describes the analysis done to select the cutoffs for temporally-splitting the ML32M data set.
Split Windows
Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/validation/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.
This suggests that Oct. 2016 is a reasonable validation set cutoff, and Nov. 2019 a reasonable test set cutoff.
part | n_ratings | n_users | n_items | |
---|---|---|---|---|
0 | train | 22355173 | 154354 | 36291 |
1 | valid | 4814880 | 31199 | 54661 |
2 | test | 4830151 | 30302 | 70652 |
How many users can we use for collaborative filtering in the testing set?
n_users | n_ratings | |
---|---|---|
0 | 7760 | 898685 |
And for testing?
n_users | n_ratings | |
---|---|---|
0 | 7863 | 1229054 |
This does lose a lot of users, but it’s enough we should have reasonably useful results.
References
Meng, Zaiqiao, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. “Exploring Data Splitting Strategies for the Evaluation of Recommendation Models.” In Fourteenth ACM Conference on Recommender Systems, 681–86. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3383313.3418479.