Data Splitting

This page describes the analysis done to select the cutoffs for temporally-splitting the ML25M data set.

Split Windows

Following (Meng et al. 2020), we are going to prepare a global temporal split of the rating data. We will target a 70/15/15 train/tune/test split, but round the timestamps so our test splits are at clean calendar dates. Searching for quantiles will get us this.

	t_tune	t_test
0	2015-01-22 16:54:38.200	2017-03-24 01:52:26.050

This suggests that Jan. 2015 is a reasonable tuning set cutoff, and March/April 2017 a reasonable test set cutoff.

	part	n_ratings	n_users	n_items
0	tune	3850116	27077	36820
1	train	17436172	121672	22316
2	test	3713624	24324	53058

How many test users have at least 5 training ratings?

	n_users	n_ratings
0	6481	642289

And for tuning?

	n_users	n_ratings
0	4534	720161

This give us enough data to work with, even if we might like more test users. For more thoroughness, let’s look at how many test users we have by training rating count:

/home/mde48/lenskit/lenskit-codex/.venv/lib/python3.12/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log10

/tmp/ipykernel_1910840/4033342475.py:1: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

Since we have very small loss up through 10–11 ratings, we will use all users who appear at least once in training as our test users.

References

Meng, Zaiqiao, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2020. “Exploring Data Splitting Strategies for the Evaluation of Recommendation Models.” In Fourteenth ACM Conference on Recommender Systems, 681–86. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3383313.3418479.