Authors :
Niraj Patel
Volume/Issue :
Volume 10 - 2025, Issue 2 - February
Google Scholar :
https://tinyurl.com/33ym29nh
Scribd :
https://tinyurl.com/4d5uvaau
DOI :
https://doi.org/10.5281/zenodo.14987459
Abstract :
This study explores the potential of Wikipedia edit data as a predictor of opening box office revenues for films
released in the US. After analyzing films from 2007 to 2011, we developed a predictive model based on Wikipedia article
edits using gradient boosting trees as the primary algorithm. Our model incorporates features such as the frequency of
Wikipedia edits, the size and content of article revisions, and the revenues of similar films. The results demonstrate that
Wikipedia activity can serve as a rough indicator of film popularity, though the model’s predictive accuracy is limited. We
find that Wikipedia-based features, particularly edit runs and content changes, significantly contribute to the model’s
performance, achieving an R2 of 0.54 for films released in 2012. This suggests that while Wikipedia data offers valuable
insights into social interest, it is best used in conjunction with other predictors for more reliable revenue estimates.
References :
- “Ensemble methods.” Retrieved 13 Jan 2012. http://scikit-learn. org/stable/modules/ensemble.html
- Friedman, Jerome H. (19 Apr 2001). “Greedy Function Approx- imation: A Gradient Boosting Machine.” Retrieved 10 Jan 2012. http://www-stat.stanford.edu/∼jhf/ftp/trebst.pdf
- “Gradient boosting.” Retrieved 13 Jan 2012. http://en.wikipedia. org/wiki/Gradient boosting
- “List of hoaxes on Wikipedia.” Retrieved 10 Jan 2012. http:// en.wikipedia.org/wiki/Wikipedia:List of hoaxes on Wikipedia
- Pfeiffer, Eric (4 Jan 2013). “War is over: Imaginary ‘Bicholm’ conflict removed from Wikipedia after five years.” Retrieved 10 Jan 2012.
- “Wikipedia.” Retrieved 10 Jan 2012. http://en.wikipedia.org/ wiki/Wikipedia
This study explores the potential of Wikipedia edit data as a predictor of opening box office revenues for films
released in the US. After analyzing films from 2007 to 2011, we developed a predictive model based on Wikipedia article
edits using gradient boosting trees as the primary algorithm. Our model incorporates features such as the frequency of
Wikipedia edits, the size and content of article revisions, and the revenues of similar films. The results demonstrate that
Wikipedia activity can serve as a rough indicator of film popularity, though the model’s predictive accuracy is limited. We
find that Wikipedia-based features, particularly edit runs and content changes, significantly contribute to the model’s
performance, achieving an R2 of 0.54 for films released in 2012. This suggests that while Wikipedia data offers valuable
insights into social interest, it is best used in conjunction with other predictors for more reliable revenue estimates.