t_valid | t_test | |
---|---|---|
0 | 2007-12-08 01:20:32.600 | 2010-12-06 22:24:59.600 |
ML20M Data Splitting
This page describes the analysis done to select the cutoffs for temporally-splitting the ML20M data set.
Split Windows
Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/validation/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.
This suggests that 2008-2010 (valid) and 2011-end (test) are reasonable splits.
part | n_ratings | n_users | n_items | |
---|---|---|---|---|
0 | test | 2943856 | 25167 | 25806 |
1 | valid | 2992504 | 22797 | 15040 |
2 | train | 14063903 | 101068 | 9710 |
How many users can we use for collaborative filtering in the testing set?
n_users | n_ratings | |
---|---|---|
0 | 5564 | 619592 |
And for testing?
n_users | n_ratings | |
---|---|---|
0 | 5278 | 818921 |
This give us enough data to work with, even if we might like more test users.
References
Meng, Zaiqiao, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. “Exploring Data Splitting Strategies for the Evaluation of Recommendation Models.” In Fourteenth ACM Conference on Recommender Systems, 681–86. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3383313.3418479.