Large language models (LLMs) are disrupting recommender systems research and have become central
to modern recommendation architectures. At the same time, LLMs introduce additional stochasticity
and architectural complexity, exacerbating existing reproducibility challenges.
In this study, we assess the reproducibility of recent LLM-based recommender systems research to
highlight gaps in this rapidly evolving area. We define a set of LLM-specific reproducibility
variables and systematically analyze a representative sample of 64 papers published at top-ranked
conferences. Our findings reveal several concerns: while source code is shared in nearly two thirds
of the papers, none provides all the details required to fully reproduce the reported results.
Fine-tuning strategies are for example used in 39 papers, yet only about 15% supply complete
fine-tuning materials. Overall, we argue that the community should not overlook emerging reproducibility
issues surrounding LLMs and should adopt more rigorous standards to ensure reliable progress.
Note: To avoid pointing out individual authors, we refrain from sharing the detailed
analysis of the considered papers on the GitHub.
However, interested readers may request access to the full analysis via the Zenodo platform.