The game of baseball has no shortage of statistics — from batting average to exit velocity, strikeouts to wins above replacement. Among all sports, Major League Baseball (MLB) arguably contains the most analytical and data-driven participants and fan base. Subconsciously or viscerally, players and managers on the field and those following from anywhere are constantly assessing and making decisions based off of game play trends and expectations — whether a batter will come through with a hit in an important situation, when a pitcher should be pulled. Less analyzed, however, is what leads fans to become engaged with certain players or teams, and what factors drive their love of the game. This is the motivation behind the problem being posed by Major League Baseball in their Kaggle competition for Player Digital Engagement Forecasting. Can you use machine learning to deconstruct baseball fandom?
This competition asks you to predict measures of digital engagement for each active player on a daily basis during the MLB season. So, how large was the surge in fan interest after Joe Musgrove threw the first no-hitter in Padres history? Is Shohei Ohtani’s engagement higher when he pitches well, when he hits a monster home run…or when he does both? You’re provided a wealth of game, team and player information – detailed stats, awards, rosters, and transaction information – as well as social and digital engagement data as your inputs. Data scientists will recognize this as an exciting forecasting problem with both traditional regression and time series components, where having this input data just prior to the prediction date is critical to determining which players will receive the most engagement.
With so many variables in the game, there are an endless number of vectors which could possibly influence fan engagement. Eleven-time All-Star Miguel Cabrera delighted fans by hitting the first home run of the season – in the snow! Occasionally a lesser-known player like Musgrove or Carlos Rodón “wins the day” with an unlikely no-hitter. And sometimes just getting traded to an iconic franchise like the Yankees generates a ton of fan interest, like it did for Rougned Odor in early April.
As these examples show, a player’s digital engagement can be pretty dynamic during the season, with many different potential contributors to who is “trending” on a given day. How can you use data to uncover which factors are the most influential of engagement with each player’s digital content?
Ready to play ball? Check out the competition on Kaggle for all the details. $50,000 in prizes is up for grabs in two prize categories. The code competition puts your machine learning skills to the test, to see who can build the most accurate forecasting models to predict daily digital engagement for every active player. You’ll have until July 31st to build your models and then be evaluated on a future time frame, which will determine the winners. For data visualization and exploration experts out there, the explainability prizes give you an opportunity to analyze more broadly which factors, even those outside of what we’re providing directly, most influence digital engagement. You’ll be evaluated on how well you can use what the data is telling you to support your findings.
And if you’re looking to get started, we’ve provided an introductory video and some notebook tutorials, including a starting point for harnessing the power of Vertex AI through tools including Cloud Notebooks, Explainable AI, and Vizier.
With the second half of the season upon us, it’s an exciting time to be an MLB fan. With this Kaggle competition, it’s also a perfect opportunity to use data science to help understand baseball fandom and potentially earn some of your own accolades in the process. Step up to the plate!
Major League Baseball trademarks and copyrights are used with permission of Major League Baseball. Visit MLB.com.
Cloud BlogRead More