Learn how analyzing stats from professional sports leagues is an instructive use case for data analytics using Apache Spark with SQL. Covered in this installment: data exploration with Apache Impala (incubating) and Hue.
In Part 1 of this series, I introduced the topic of using fantasy sports analytics as an instructive use case for exploring the Apache Hadoop ecosystem. In that installment, we focused on data processing by taking a collection of data from BasketballReference.com and enriching it with zscores and normalized zscores to analyze the relative value of NBA players. This time, we’ll continue exploring the data interactively with Apache Impala (Incubating) in the Hue UI. This example will help illustrate that with CDH providing a storage layer and tools for data processing and data analytics like Spark and Impala, you can easily transform and explore data in a variety of ways.
All the code for this post can be accessed via Github, and refer to Part 1 for an overview of the data processing that got us to this point.
Interactive Data Analysis: Finding Trends in Age and Experience
Last time, you learned how to work with a DataFrame and query it using Apache Spark SQL. This time, you’ll learn how to ask some more difficult questions and quickly get answers with Spark and Impala. Since we are thorough in our data processing, you’ll be able to get results from the data with only a few simple queries. This appoach is typical of the Apache Hadoop best practice of using data processing frameworks to reduce the complexity of interactive queries.
A burning question that is always on the mind of fantasy sports owners is: “How is age going to affect a player’s season?” Athletes are mere humans, and in time, their skills decline. When does an allstar regress to being an average player? Can we calculate the expected gain/loss of the value of a player as they age? We’ll focus on that topic here.
We’ll look at both zTot and nTot, and consider the player’s age and experience.The latter is potentially important because there have been shifts in what ages players joined the league over the timespan we are considering. It used to be rare for players to skip college, then it wasn’t, now they are required to play at least one year. It will be interesting to see if we see a difference in age versus experience in the numbers.
We start with the RDD containing all the raw stats, zscores, and normalized zscores. Another piece of data to consider is how a player’s zscore and normalized zscore change each year, so we’ll calculate the change in both from year to year. We’ll save off two sets of data, one a keyvalue pair of agevalues, and one a keyvalue pair of experiencevalues. (Note that in this analysis, we disregard all players who played in 1980, as we don’t have sufficient data to determine their experience level.)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

//group data by player name
val pStats=dfPlayers.sort(dfPlayers(“name”),dfPlayers(“exp”) asc).map(x=>(x.getString(1),(x.getDouble(50),x.getDouble(40),x.getInt(2),x.getInt(3),Array(x.getDouble(31),x.getDouble(32),x.getDouble(33),x.getDouble(34),x.getDouble(35),x.getDouble(36),x.getDouble(37),x.getDouble(38),x.getDouble(39)),x.getInt(0)))).groupByKey()
pStats.cache
//for each player, go through all the years and calculate the change in valueZ and valueN, save into two lists
//one for age, one for experience
//exclude players who played in 1980 from experience, as we only have partial data for them
val excludeNames=dfPlayers.filter(dfPlayers(“year”)===1980).select(dfPlayers(“name”)).map(x=>x.mkString).toArray.mkString(“,”)
val pStats1=pStats.map{ case(name,stats) =>
var last = 0
var deltaZ = 0.0
var deltaN = 0.0
var valueZ = 0.0
var valueN = 0.0
var exp = 0
val aList = ListBuffer[(Int,Array[Double])]()
val eList = ListBuffer[(Int,Array[Double])]()
stats.foreach( z => {
if (last>0){
deltaN = z._1 – valueN
deltaZ = z._2 – valueZ
}else{
deltaN = Double.NaN
deltaZ = Double.NaN
}
valueN = z._1
valueZ = z._2
last = z._3
aList += ((last, Array(valueZ,valueN,deltaZ,deltaN)))
if (!excludeNames.contains(z._1)){
exp = z._6
eList += ((exp, Array(valueZ,valueN,deltaZ,deltaN)))
}
})
(aList,eList)
}

We’ll now process the list of agevalue pairs. A new function is defined, processStatsAgeorExperience
, which goes through our normal process of mapping raw statistical data to bballStats
objects and reducing it by each statistic. This gives our aggregate stats for each statistic. We’ll again need to take our RDD and convert it into a DataFrame. In this example, we’ll load it directly into Apache Hive for querying by Impala in Hue. Hue lets you take advantage of some visualizations that will make analyzing the data a little easier.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

//extract out the age list
val pStats2 = pStats1.flatMap{case(x,y)=>x}
//create age data frame
val dfAge = processStatsAgeOrExperience(pStats2, “age”)
//save as table
dfAge.saveAsTable(“Age”)
//extract out the experience list
val pStats3 = pStats1.flatMap{case(x,y)=>y}
//create experience DataFrame
val dfExperience = processStatsAgeOrExperience(pStats3,“Experience”)
//save as table
dfExperience.saveAsTable(“Experience”)

Visualizing the Results via Hue
Now let’s hop over to Hue to see our results. We’ll query the Age table and utilize the charting feature of Hue to visualize the results, plotting age on the xaxis and mean zscore on the yaxis:
Select * from Age Order By age asc
Average zTot for each age group
The findings are pretty clear: on average, young players struggle to contribute value until their midtwenties, peak at 28, create pretty good value in their early 30s, and then tail off quickly starting at 37. If you recall that we earlier looked at the spread of recorded seasons across the different ages, most of the seasons were concentrated between the ages 2232, which means the tails on either end are working with very small amounts of data, hence explaining the large swings. (The youngest and oldest players also get fewer minutes on average, which affects their values in counting stats [PTS, 3P, TRB, AST, STL, BLK] and results in lower zscores and nscores. Our analysis ignores minutes played in calculating values, but a similar exercise can be done on stats averaged out per 36 minutes of play. Players who don’t play much generally don’t significantly contribute to a fantasy team, so ignoring playing time suits our purposes here. )
Looking at nTot values for each age, we see a similar pattern of people peaking in their late 20s and declining through the 30s. Note that the mean nTot is much lower than the mean zTot, which one can interpret as there being more belowaverage players and only a few “stars” who excel in a majority of categories—making great players a rarity. (This is likely one of the reasons it’s challenging to find deep leagues that dig into more than the top 120150 players in the league. In 1980, we have 55 players posting net positive normalized zscores and 115 in 2015.)
Average nTot for each age group
By inspecting the change in zscore and normalized zscores, we notice that a player on average continues to improve his game until age 26, then begins to decline:
Change in zTot for each age group compared to the previous year
Change in nTot for each age group compared to the previous year
If you’re wondering why the average player peaks at 28 but only continues getting better until age 26, the reason is due to the fact that we’re not looking at the same set of players from each age to the next. Recall that the peak year for player participate is age 24, in which 1,626 seasons have been logged. That means that in the beginning we’re calculating average zscores that include rookies, but those are not included in those years’ delta scores, as they have not yet logged a full season. Similarly, as players begin to drop out of the league, they no longer factor into the delta calculations. At age 25, we have 1,455 seasons logged, down 171 from age 24. These are largely players that were cut from the league (i.e. among the worst players), so removing them results in a net positive across all players of that age group, even if the average player is beginning to decline slightly.
Next we’ll look at the same metrics, now organized by experience:
Select * from Experience Order By experience asc
Average zTot of players by experience level
Recall that by looking at experience, we are removing the ambiguity about what age players enter the league. This approach offers a better view of the longevity of a player and the wear and tear they sustain over time. Additionally, we record 2,738 total seasons of 0 experience, and each year declines after that. (Indeed, 626 players never make it to their second year in the NBA!) Most players prove to peak around year 7, but return positive value while steadily declining until around year 13. Note that by year 14 we are down to 196 seasons logged, so we’re really dealing with an elite group that can sustain productivity that long. Looking at nTot tells a similar story, so we’ll omit it. Looking at the delta zscores shows us that players stop improving, on average, after their 4th year.
Change in zTot among player experience compared to the previous year
Note that in year 4 we are down to 1,225 players, fewer than half of which we started.
Oddly, looking at age alone seems to indicate that players have a longer period of growth than when just looking purely at experience. This anomaly is due to the fact that players enter the league at different ages, and there is usually a sharp increase in the first couple of years followed by a gradual decline. If we look at the number of rookies per age, we see that there have only been 100 who were 18 or 19, 400 who were 20 or 21, and over 1,400 who were 22 or 23. After that, it drops off quickly. Considering that most players enter the league at 22 or 23 and on average will continue to improve for 4 years, the average player would appear to improve into age 26 or 27, which is what we saw when we looked at the delta stats over age. This fact highlights the importance of knowing the context around the data to avoid making erroneous conclusions. With that in mind, we’ll focus on experience over age going forward.
Also, don’t forget that we’re speaking of the average player here—some break the rules. Let’s look at a few examples, picking nTot as our metric. (zTot leads to similar conclusions.)
nTot scores for Michael Jordan per experience level
nTot scores for Shaquille O’Neal per experience level
nTot scores for Allen Iverson per experience level
Here we see trends that agree with our analysis above. By Michael Jordan’s 4th year, he had reached fantasyelite status and his subsequent years were around the same mark. Similarly, Allen Iverson and Shaquille O’Neal were both playing at or near their peak level within 4 years. Being elite players, they maintained high value late into their careers—which is not what we would expect of the “average” player—but even in the examples above, there were a couple years of growth before stardom was justified.
A few players arguably break from the average trajectory:
 Kyle Korver is an interesting example of a player who had trouble in his early years but emerged as a specialist and has thrived in his 30s.
nTot scores for Kyle Korver per experience level
 Tyreke Evans is a rare player who peaked in his rookie year and has not improved since due to injuries.
nTot scores for Tyreke Evans per experience level
 Kawhi Leonard has not regressed in his short career. He spent many of his first years in reserve, so his value has increased as he has seen more playing time and taken up more of an offensive responsibility. It’s likely that he’s reaching his peak from a fantasy standpoint, as he’s already ranked in the top 10 this year, so his development is likely reaching its conclusion this season.
nTot scores for Kawhi Leonard per experience level
 Stephen Curry, however, was troubled with injuries early in his career and has arguably seen his growth delayed. He’s 27 now and possibly just had the best season he will ever have.
nTot scores for Stephen Curry per experience level
All of this goes to show that it pays to know the context of the players in providing insight past the numbers.
Which brings us to our next point: Who’s the best of all time? The answer differs depending on whether we look at zscores or normalized zscores. Looking at raw zscores, we see that Curry just had the best year ever. Larry Bird’s 1987 campaign comes in a close second. We see other great players in the mix as well, like Michael Jordan and Kevin Durant.
select name, year, age, zTot from players order by zTot desc
When we switch to normalized zscores, it becomes the Michael Jordan show: he records 5 of the top 10 seasons.
Curry’s 201516 season comes in at number 6. That’s certainly impressive, but he still lags Michael Jordan’s peak years. Curry could improve in the next few years, but given his tenure in the league, it’s more likely he will start on a slight downward trend.
Conclusion
In this installment, we used Spark and Impala to determine at what year in a player’s career he is expected to peak. We also calculated how much a player is expected to change in value from one year to the next. These are valuable data points that can be used to help determine which players are expected to increase or decrease in value between seasons.
Jordan Volz is a Systems Engineer at Cloudera.