ACM UMAP '26, Gothenburg, Sweden

Large language models (LLMs) are disrupting recommender systems research and have become central to modern recommendation architectures. At the same time, LLMs introduce additional stochasticity and architectural complexity, exacerbating existing reproducibility challenges. In this study, we assess the reproducibility of recent LLM-based recommender systems research to highlight gaps in this rapidly evolving area. We define a set of LLM-specific reproducibility variables and systematically analyze a representative sample of 64 papers published at top-ranked conferences. Our findings reveal several concerns: while source code is shared in nearly two thirds of the papers, none provides all the details required to fully reproduce the reported results. Fine-tuning strategies are for example used in 39 papers, yet only about 15% supply complete fine-tuning materials. Overall, we argue that the community should not overlook emerging reproducibility issues surrounding LLMs and should adopt more rigorous standards to ensure reliable progress.

Note: To avoid pointing out individual authors, we refrain from sharing the detailed analysis of the considered papers on the GitHub. However, interested readers may request access to the full analysis via the Zenodo platform.

GitHub: Python script to reproduce given figures in the paper

ACM UMAP '26, Gothenburg, Sweden

Reproducibility of LLM-based Recommender Systems Research: Worrying Observations from an Initial Analysis

Additional material: Source code and Datasets.