library(tidyverse)
Last year I found some sample data while reading about PGA ShotLink1. I decided to put together a series of posts using this data. This first post will cover some simple data cleaning and visualization.
The sample contains 4 delimited text files: event-level, hole-level, round-level, and shot-level. I’ll focus on the shot-level data in this post.
<- read.delim(file ='shotlink_sample/StrokeLevelTOURChamp.txt')
shot_level str(shot_level, vec.len = 1)
'data.frame': 8422 obs. of 38 variables:
$ tour_code : chr "R" ...
$ tour_description : chr "PGA TOUR" ...
$ year : int 2011 2011 ...
$ tourn_num : int 410 410 ...
$ Player.. : int 1810 1810 ...
$ Course.. : int 688 688 ...
$ Permanent.Tournament.. : int 60 60 ...
$ Player.First.Name : chr "Phil" ...
$ Player.Last.Name : chr "Mickelson" ...
$ Round : int 1 1 ...
$ Tournament.Name : chr "TOUR Championship by Coca-Cola" ...
$ Course.Name : chr "East Lake GC" ...
$ Hole : int 1 1 ...
$ Hole.Score : int 4 4 ...
$ Par.Value : int 4 4 ...
$ Yardage : int 424 424 ...
$ Shot : int 1 2 ...
$ Shot.Type.S.P.D. : chr "S" ...
$ X..of.Strokes : int 1 1 ...
$ From.Location.Scorer. : chr "Tee Box" ...
$ From.Location.Laser. : chr "" ...
$ To.Location.Scorer. : chr "Primary Rough" ...
$ To.Location.Laser. : chr "Right Rough" ...
$ Distance : int 10677 4932 ...
$ Distance.to.Pin : int 15444 4788 ...
$ In.the.Hole.Flag : chr "N" ...
$ Around.the.Green.Flag : chr "N" ...
$ X1st.Putt.Flag : chr "" ...
$ Distance.to.Hole.after.the.Shot: int 4788 281 ...
$ Time : int 1255 1300 ...
$ Lie : chr "Good" ...
$ Elevation : chr "With" ...
$ Slope : chr "Level" ...
$ X.Coordinate : chr "9,093.97" ...
$ Y.Coordinate : chr "9,859.90" ...
$ Z.Coordinate : num 480 ...
$ Distance.from.Center : int 562 232 ...
$ Distance.from.Edge : int 126 25 ...
This table has a ton of interesting fields to explore, but first I want to clean up the column names - the janitor
package makes this really easy.
<- shot_level %>%
shot_level_clean ::clean_names() %>%
janitorrename(num_of_strokes = x_of_strokes,
first_putt_flag = x1st_putt_flag)
%>%
shot_level_clean colnames() %>%
head(10)
[1] "tour_code" "tour_description" "year"
[4] "tourn_num" "player" "course"
[7] "permanent_tournament" "player_first_name" "player_last_name"
[10] "round"
Next, I’d like to get a feel for the scope of the data.
#How many tournaments/rounds?
%>%
shot_level_clean select(year, tournament_name, round) %>%
distinct()
year tournament_name round
1 2011 TOUR Championship by Coca-Cola 1
2 2011 TOUR Championship by Coca-Cola 2
3 2011 TOUR Championship by Coca-Cola 3
4 2011 TOUR Championship by Coca-Cola 4
#How many players?
%>%
shot_level_clean select(player,
player_first_name,%>%
player_last_name) distinct()
player player_first_name player_last_name
1 1810 Phil Mickelson
2 2206 David Toms
3 6527 Steve Stricker
4 6567 Vijay Singh
5 20645 John Senden
6 21731 Fredrik Jacobson
7 21878 Mark Wilson
8 21961 Charles Howell III
9 22046 Geoff Ogilvy
10 22371 Aaron Baddeley
11 22405 Justin Rose
12 23108 Matt Kuchar
13 23623 Bo Van Pelt
14 23983 Luke Donald
15 24357 K.J. Choi
16 24502 Adam Scott
17 24663 Y.E. Yang
18 24781 Hunter Mahan
19 24924 Bill Haas
20 24925 Jonathan Byrd
21 25686 Jason Dufner
22 25804 Bubba Watson
23 26476 Chez Reavie
24 27095 Nick Watney
25 27649 Brandt Snedeker
26 28089 Jason Day
27 29221 Webb Simpson
28 30925 Dustin Johnson
29 31323 Gary Woodland
30 33141 Keegan Bradley
#Player scores?
%>%
shot_level_clean group_by(player,
player_first_name,
player_last_name,%>%
round) summarize(shots = n(),
.groups = "keep") %>%
ungroup() %>%
mutate(round = paste0("round_",round)) %>%
pivot_wider(names_from = round,
values_from = shots) %>%
as.data.frame() %>%
mutate(final_score = round_1 + round_2 + round_3 + round_4) %>%
arrange(final_score) %>%
select(-player, -player_first_name) %>%
head()
player_last_name round_1 round_2 round_3 round_4 final_score
1 Mahan 67 68 66 71 272
2 Donald 66 68 70 69 273
3 Choi 68 65 70 70 273
4 Haas 68 67 70 68 273
5 Baddeley 68 70 64 73 275
6 Scott 67 65 75 68 275
At first glance, the data looks straightforward: one row for every stroke in the 2011 FedEx Cup Playoffs, with Hunter Mahan defeating the field of 30 with a final score of 272. However, a quick Google search shows that Bill Haas actually won the tournament in a playoff over Hunter Mahan. Referencing the docs, the number of strokes assessed actually comes from the # of Strokes
2 column (renamed to num_of_strokes
above) to account for drops and penalties. Adjusting the code above:
%>%
shot_level_clean group_by(player,
player_first_name,
player_last_name,%>%
round) summarize(shots = sum(num_of_strokes),
.groups = "keep") %>%
ungroup() %>%
mutate(round = paste0("round_",round)) %>%
pivot_wider(names_from = round,
values_from = shots) %>%
as.data.frame() %>%
mutate(final_score = round_1 + round_2 + round_3 + round_4) %>%
arrange(final_score) %>%
select(-player, -player_first_name) %>%
head()
player_last_name round_1 round_2 round_3 round_4 final_score
1 Mahan 67 68 66 71 272
2 Haas 68 67 69 68 272
3 Baddeley 68 69 64 72 273
4 Donald 66 68 70 69 273
5 Choi 68 65 70 70 273
6 Howell III 67 71 68 68 274
These scores match the round-level scores on the Wikipedia page, which is good enough for me.
Next - I want to take a look at the x/y/z coordinate data. The docs give definitions for these fields:
The X coordinate in an X/Y/Z grid system used by lasers to pinpoint the ball when the shot ends. The value is zeros for the shot that finishes in the hole. The numbers in the grid system represent feet.
The Y coordinate in an X/Y/Z grid system used by lasers to pinpoint the ball when the shot ends. The value is zeros for the shot that finishes in the hole. The numbers in the grid system represent feet.
The Z coordinate in an X/Y/Z grid system used by lasers to pinpoint the ball when the shot ends. The value is zeros for the shot that finishes in the hole. The numbers in the grid system represent feet, and ‘Z’ is the vertical component of the position.
The shot that finishes in the hole will get 0 values, but these definitions don’t indicate whether the coordinates are relative to each hole or the overall course. Plotting the shots should help clear this up. Additionally, I need to convert the coordinate columns from character
to numeric
.
library(ggplot2)
%>%
shot_level_clean mutate(x = str_replace_all(string = x_coordinate,
pattern = ",",
replacement = "")) %>%
mutate(y = str_replace_all(string = y_coordinate,
pattern = ",",
replacement = "")) %>%
mutate(z = str_replace_all(string = z_coordinate,
pattern = ",",
replacement = "")) %>%
mutate(x = as.numeric(x),
y = as.numeric(y),
z = as.numeric(z)) %>%
ggplot(mapping = aes(x = x,
y = y)) +
geom_point()
The non-zero shot coordinates look like a golf course, so I think it makes sense to set the 0 values to NA
since the coordinates to not appear to match the grid system.
<- shot_level_clean %>%
shot_level_clean_coords mutate(x = str_replace_all(string = x_coordinate,
pattern = ",",
replacement = "")) %>%
mutate(y = str_replace_all(string = y_coordinate,
pattern = ",",
replacement = "")) %>%
mutate(z = str_replace_all(string = z_coordinate,
pattern = ",",
replacement = "")) %>%
mutate(x = as.numeric(x),
y = as.numeric(y),
z = as.numeric(z)) %>%
mutate(x = ifelse(x == 0,NA,x),
y = ifelse(y == 0,NA,y),
z = ifelse(z == 0,NA,z)) %>%
select(-x_coordinate,
-y_coordinate,
-z_coordinate)
%>%
shot_level_clean_coords filter(!is.na(x)) %>%
mutate(hole = factor(hole, ordered = T)) %>%
ggplot(mapping = aes(x = x,
y = y,
color = hole)) +
geom_point() +
coord_equal() +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.background = element_blank())
Definitely looks like a golf course, especially when you overlay it on top of the Google Maps view of East Lake Golf Club in Atlanta:
Next I’d like to take a look at the cut3 information for each shot. The table has 4 columns with this information:
To Location (Scorer): General location from which the shot hit ended as recorded by the walking scorer
To Location (Laser): Enhanced location from which the shot hit ended as recorded by a laser device which tracks ball position using coordinates on a course map
From Location (Scorer): General location from which the shot began as recorded by the walking scorer
From Location (Laser): Enhanced location from which the shot began as recorded by a laser device which tracks ball position using coordinates on a course map
The x/y coordinate data matches the “to location” - so I’ll focus on those columns for now.
%>%
shot_level_clean_coords filter(!is.na(x)) %>%
group_by(to_location_laser,
%>%
to_location_scorer) summarize(rows = n(),
.groups = "keep") %>%
as.data.frame() %>%
arrange(desc(rows))
to_location_laser to_location_scorer rows
1 Unknown Green 3548
2 Left Fairway Fairway 578
3 Right Fairway Fairway 550
4 Left Rough Primary Rough 436
5 Right Rough Primary Rough 401
6 Right Intermediate Intermediate Rough 145
7 Left Intermediate Intermediate Rough 112
8 Unknown Fairway Bunker 92
9 Right Green Side Bunker Green Side Bunker 77
10 Right Front Green Side Bunker Green Side Bunker 66
11 Front Left Green Side Bunker Green Side Bunker 60
12 Left Green Side Bunker Green Side Bunker 47
13 Front Center Green Side Bunker Green Side Bunker 34
14 Unknown Tree Outline 24
15 Unknown Water 23
16 Unknown Unknown 20
17 Unknown Water Drop 13
18 Left Rear Green Side Bunker Green Side Bunker 3
19 Left Fairway Fringe 2
20 Right Fairway Fringe 2
21 Right Rear Green Side Bunker Green Side Bunker 1
22 Unknown Dirt Outline 1
The columns seem to match up pretty well, but I’d like to consolidate them into more general to_location
column.
<- shot_level_clean_coords %>%
shot_level_clean_coords_cut mutate(to_location = case_when(to_location_scorer %in% c('Fairway','Fringe') ~ 'Fairway',
%in% c('Fairway Bunker','Green Side Bunker') ~ 'Bunker',
to_location_scorer %in% c('Intermediate Rough','Primary Rough') ~ 'Rough',
to_location_scorer == 'Green' ~ 'Green',
to_location_scorer == 'Water' ~ 'Water',
to_location_scorer == 0 ~ 'Hole',
distance_to_hole_after_the_shot ~ 'Other'))
T
%>%
shot_level_clean_coords_cut filter(!is.na(x)) %>%
group_by(to_location_laser,
to_location_scorer,%>%
to_location) summarize(rows = n(),
.groups = "keep") %>%
as.data.frame() %>%
arrange(desc(rows))
to_location_laser to_location_scorer to_location rows
1 Unknown Green Green 3548
2 Left Fairway Fairway Fairway 578
3 Right Fairway Fairway Fairway 550
4 Left Rough Primary Rough Rough 436
5 Right Rough Primary Rough Rough 401
6 Right Intermediate Intermediate Rough Rough 145
7 Left Intermediate Intermediate Rough Rough 112
8 Unknown Fairway Bunker Bunker 92
9 Right Green Side Bunker Green Side Bunker Bunker 77
10 Right Front Green Side Bunker Green Side Bunker Bunker 66
11 Front Left Green Side Bunker Green Side Bunker Bunker 60
12 Left Green Side Bunker Green Side Bunker Bunker 47
13 Front Center Green Side Bunker Green Side Bunker Bunker 34
14 Unknown Tree Outline Other 24
15 Unknown Water Water 23
16 Unknown Unknown Other 20
17 Unknown Water Drop Other 13
18 Left Rear Green Side Bunker Green Side Bunker Bunker 3
19 Left Fairway Fringe Fairway 2
20 Right Fairway Fringe Fairway 2
21 Right Rear Green Side Bunker Green Side Bunker Bunker 1
22 Unknown Dirt Outline Other 1
Next - I want to plot to_location
to see how it looks.
= c('Bunker'='tan',
cut_colors 'Fairway'='#4CBB17',
'Tee Box'='#4CBB17',
'Green'='#90EE90',
'Rough'='#355E3B',
'Other'='darkgray',
'Hole'='black',
'Water'='lightblue')
%>%
shot_level_clean_coords_cut filter(!is.na(x)) %>%
mutate(hole = factor(hole, ordered = T)) %>%
ggplot(mapping = aes(x = x,
y = y)) +
geom_point(mapping = aes(color = to_location),
size = 1) +
coord_equal() +
scale_color_manual(values = cut_colors) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank())
%>%
shot_level_clean_coords_cut filter(!is.na(x)) %>%
filter(hole == 1) %>%
mutate(hole = factor(hole, ordered = T)) %>%
ggplot(mapping = aes(x = x,
y = y)) +
geom_point(mapping = aes(color = to_location),
size = 2) +
coord_equal() +
scale_color_manual(values = cut_colors) +
labs(title = "1st Hole") +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank())
Lastly, I’d like to add columns that indicate how many strokes are remaining for the player before/after each shot, since it does not appear that they are in the current data set.
<- shot_level_clean_coords_cut %>%
strokes_remaining group_by(player,
hole,%>%
round) arrange(player,
round,
hole,%>%
shot) mutate(strokes_rolling = cumsum(num_of_strokes)) %>%
mutate(strokes_remaining_before_shot = hole_score - strokes_rolling + 1,
strokes_remaining_after_shot = hole_score - strokes_rolling) %>%
ungroup() %>%
mutate(yards_out_after_shot = distance_to_hole_after_the_shot/36,
yards_out_before_shot = distance_to_pin/36) %>%
as.data.frame()
%>%
strokes_remaining filter(hole == 1,
== 1,
round == 24924) %>%
player select(shot,
strokes_rem_before = strokes_remaining_before_shot,
yards_out_before = yards_out_before_shot,
strokes_rem_after = strokes_remaining_after_shot,
yards_out_after = yards_out_after_shot)
shot strokes_rem_before yards_out_before strokes_rem_after yards_out_after
1 1 4 429.000000 3 116.444444
2 2 3 116.444444 2 10.722222
3 3 2 10.722222 1 1.194444
4 4 1 1.194444 0 0.000000
%>%
strokes_remaining filter(!is.na(x)) %>%
filter(hole == 1) %>%
mutate(strokes_remaining_after_shot = factor(strokes_remaining_after_shot,
ordered = T)) %>%
ggplot(mapping = aes(x = x,
y = y,
color = strokes_remaining_after_shot)) +
geom_point(size = 3) +
coord_equal() +
labs(title = "1st Hole") +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank())
saveRDS(object = strokes_remaining, file = "shot_level.rds")
saveRDS(object = cut_colors, file = "cut_colors.rds")