Issues with reproducibility have been identified as a major factor hampering progress in recommender systems research.
In response, researchers increasingly share the code of their models. However, the provision of only the code of the proposed model
is usually not sufficient to ensure reproducibility. In many works, the central claim is that a new model is advancing the
state-of-the-art. Thus, it is crucial that the entire experiment is reproducible, including the configuration and the results of the
considered baselines.
With this work, our goal is to gauge the level of reproducibility in algorithms research in recommender systems.
We systematically analyzed the reproducibility level of
65 papers published at a top-ranked conference during the last three years. Our results are sobering. While the model code is shared
in about two thirds of the papers, the code of the baselines is provided only in eight cases. The hyperparameters of the baselines are
reported even less frequently, and how these were exactly determined is not explained in any paper. As a result, it is
commonly not only impossible to reproduce the full result tables reported in the papers,
it is also unclear if the claimed improvements over the state-of-the-art were actually achieved.
Overall, we conclude that the research community has not reached the required level of reproducibility yet.
We therefore call for more rigorous reproducibility standards to ensure progress in this field.
Note: To avoid pointing out individual authors, we refrain from sharing the detailed analysis of the considered papers on the GitHub.
However, interested readers may request access to the full analysis via the Zenodo platform.