t_valid | t_test | |
---|---|---|
0 | 2015-01-22 16:43:02.600 | 2017-03-24 01:50:04.700 |
ML25M Data Splitting
This page describes the analysis done to select the cutoffs for temporally-splitting the ML25M data set.
Split Windows
Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/validation/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.
This suggests that Jan. 2015 is a reasonable validation set cutoff, and March/April 2017 a reasonable test set cutoff.
part | n_ratings | n_users | n_items | |
---|---|---|---|---|
0 | train | 17436354 | 121673 | 22316 |
1 | valid | 3850116 | 27077 | 36820 |
2 | test | 3713625 | 24324 | 53059 |
How many users can we use for collaborative filtering in the testing set?
n_users | n_ratings | |
---|---|---|
0 | 6481 | 642289 |
And for validation?
n_users | n_ratings | |
---|---|---|
0 | 4534 | 720161 |
This does lose a lot of users, but it’s enough we should have reasonably useful results.