This workshop is brought to you by R-Ladies Paris. Today we will explore how to use basic concepts of ggplot2 to create our own plots!
This workshop is divided into 5 sections:
Don’t know if you have the libraries already? No problem. Execute the code below to get started, if you have them already installed it will skip this step, otherwise it will start installing.
Hat tip to Antoine Soetewey for this cool trick.
#list of packages used in this workshop
Warning: The working directory was changed to /Users/tanyashapiro/Desktop/workshop/code inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
packages<- c("tidyverse","ggimage","sysfonts","showtext")
# if package not already installed, install package
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
Tidyverse. ggplot2 is not explicitly included among our library import lists below. Instead, we’re loading in tidyverse. Tidyverse is a bundle of packages, which includes ggplot2 and dplyr.
Extensions. There are also several extension packages that complement ggplot2. Today we’ll explore how to work with ggplot extensions with ggimage and ggtext.
Other Packages. Last but not least, we will use sysfonts and showtext to load different font families and adjust the typography.
#includes dplyr + ggplot
library(tidyverse)
##ggplot extension packages
library(ggimage)
#to add google font libraries
library(sysfonts)
library(showtext)
To customize fonts in our plots, we’ll import a couple of fonts using
sysfonts
and showtext. We can use sysfonts to add fonts from Google’s Font
Library without downloading them locally with
font_add_google
.
💡 Tip: For chart typefaces, you can’t go wrong with a clean sans-serif font. There’s a great article about this topic by Lisa Charlotte Muth- Which fonts to use for your charts and tables .
#load fonts with sysfonts package
sysfonts::font_add_google("Roboto","Roboto")
sysfonts::font_add_google("Creepster","Creepster")
#to automatically render fonts, use showtext_auto()
showtext::showtext_auto()
The data we’re using today is from The Movie Database and contains information on over 35K Horror Movies from the 1950s until today 😱
Shout out to Georgios Karamanis for inspiring our topic, I was inspired to wrangle this data after finding a similar one Georgios submitted for #TidyTuesdayback in 2019.
An alternative title to this workshop
Variable | Type | Definition | Example |
---|---|---|---|
id | int | unique movie id | 4488 |
original_title | char | original movie title | Friday the 13th |
title | char | movie title | Friday the 13th |
original_language | char | movie language | en |
overview | char | movie overview/desc | Camp counselors are stalked… |
tagline | char | tagline | They were warned… |
release_date | date | release date | 1980-05-09 |
poster_path | char | image url | /HzrPn1gEHWixfMOvOehOTlHROo.jpg |
popularity | num | popularity | 58.957 |
vote_count | int | total votes | 2289 |
vote_average | num | average rating | 6.4 |
budget | int | movie budget | 550000 |
revenue | int | movie revenue | 59754601 |
runtime | int | movie runtime (min) | 95 |
status | char | movie status | Released |
genre_names | char | list of genre tags | Horror, Thriller |
collection_id | num | collection id (nullable) | 9735 |
collection_name | char | collection name (nullable) | Friday the 13th Collection |
Let’s load in our data and take a look at what we’re working with.
Our data is stored on a GitHub repository, we can load
in data with readr::read_csv()
along with the url to the
csv.
#download data from GitHub repository
raw <- readr::read_csv("https://raw.githubusercontent.com/tashapiro/horror-movies/main/data/horror_movies.csv")
#preview data
head(raw,5)
We’ll use dplyr to make a few tweaks, namely to
modify existing columns and add a couple of new ones with
mutate
. To learn more about dplyr, check out this slick cheatsheet
by RStuido.
🧹 To-Dos:
#url for poster images, need this to concatenate with poster path
base_url<-'https://www.themoviedb.org/t/p/w1280'
df <- raw|>
mutate(budget = budget/1000000,
revenue = revenue/1000000,
#convert release date from char to date
release_date = as.Date(release_date),
#truncate date on year and month
release_year = as.numeric(format(release_date, '%Y')),
release_month = as.numeric(format(release_date, '%m')),
#concatenate base url with poster path to get poster url
poster_url = paste0(base_url,poster_path))
#preview data
head(df,5)
Building Blocks
ggplot - how we initialize the ggplot object, optional to include data and mapping here (acts as the parent for future geoms).
Geom Layers - the component to our plot. We can
specify type with different geom
functions, e.g. for a scatter plot we’ll use
geom_point()
and for a bar chart we can use
geom_col()
or geom_bar()
.
Aes - aesthetic mapping of data variables to
visual properties (e.g. x, y, color, size) in our plot. Set in
ggplot()
or individual geom layers.
Scales - fine tuning the details of our plots, e.g. tweaking axis limits, axis labels, color values, etc. Scales are specific to data types (e.g. discrete, continuous).
Labs - plot labeling, e.g. plot title and axis labels
Themes - how we can adjust the look, feel, and
overall style of our plot. There are out of the
box theme options, e.g. theme_bw()
or we can create our
custom theme using theme()
and modify components within the
function.
Slasher films are a popular sub-genre of Horror. For our scatter plot, let’s visualize how some of the most popular slasher franchise films compare to one another.
We’ll need to slice and dice our data a bit. Using
dplyr::filter()
, pull in only records for films associated
to Halloween, Friday the 13th,
Scream, or A Nightmare on Elm
Street.
#list of slasher collection titles
slashers<- c("Halloween Collection",
"Friday the 13th Collection",
"Scream Collection",
"A Nightmare on Elm Street Collection")
#create slashers dataframe object
df_slashers<-df|>
#filter only show movies in slasher collections, add budget and revenue filter
filter(collection_name %in% slashers & budget>0 & revenue>0)|>
#remove "Collection" from collection_name
mutate(collection_name = gsub(" Collection","",collection_name))|>
#subet columns with select
select(title, collection_name, budget, revenue, popularity)
#preview data
head(df_slashers,5)
The Essentials. We always start with
ggplot()
. There are two main arguments,
data and mapping. Let’s use our
df_slashers data frame, for our mapping we need aes()
and
additional arguments to specify x (budget) and
y (revenue).
Plot Titles. We can use labs()
to adjust all types of plot labels. In this example, let’s change our
title, subtitle, axis
titles, and caption.
ggplot(data=df_slashers, mapping=aes(y=revenue, x=budget))+
geom_point()+
labs(title="Slasher Movies: Revenue vs. Budget",
subtitle = "Includes movies from Friday the 13th, A Nightmare on Elm Street, Scream, \nand Halloween collections.",
caption = "Data from The Movie Database",
y="Revenue (millions)",
x="Budget (millions)")
More Aes Arguments.
Depending on the geom, aes()
can contain more than just the
x and y value pairs. In our scatter plot, we can also adjust details
like size and color. Here, we’ll map
size to popularity and color
to collection_name.
Static Aesthetics. Adjust the opacity of points with alpha.
Intro to Scales. To override our default
color options from our color mapping, we can modify with a color scales.
In this example, we’ll use scale_color_manual
and pass in a list of hex color values. Here are some
I’ve pre-selected: “#DB2B39”,“#29335C”,“#F3A712”,“#3ED58E”
Intro to Themes. Try out a new theme, here’s
a full list of default
themes (e.g. theme_minimal
, theme_light
,
theme_classic
)
💡 Tip: There are a bunch of color palette libraries you can explore. If you want something more custom, palette generator sites like Coolors are also nice!
#create ggplot and score in a variable name
plot_scatter<-ggplot(df_slashers, mapping=aes(y=revenue, x=budget))+
geom_point(mapping=aes(size=popularity, color=collection_name), alpha=0.6)+
scale_color_manual(values = c("#DB2B39","#29335C","#F3A712",'#3ED58E'))+
guides(
color = guide_legend(override.aes = list(size=4)),
size = guide_legend(override.aes=list(shape=21,color="black",fill="white"))
)+
labs(title="Slasher Movies: Revenue vs. Budget",
subtitle = "Comparison of movies in popular slasher franchise series.",
y="Revenue (millions)",
x="Budget (millions)",
caption = "Data from The Movie Database",
size = "Popularity",
color = "Franchise")+
theme_minimal()
plot_scatter
#uncomment to save plot
#ggsave(filename="plot_scatter.png", plot=plot_scatter, width =7 , height=5, units="in", bg="white")
Plot how many movies were released by month and year.
First, we’ll need to aggregate our data with dplyr
group_by
on release_month and
release_year. Since this data set spans back to the 50s, let’s
filter
to only show release_year between 2016 and
2021.
df_monthly<-df|>
#aggregate data by year and month
group_by(release_year, release_month)|>
#summarise total movies, n() equivalent of count
summarise(count=n())|>
#filter to only get years between 2016 and 2021
filter(release_year>=2016 & release_year<=2021)
head(df_monthly,5)
How many movies were released month-over-month in 2021?
#subset rows to pull only months from release year 2021
df_monthly_2021<- df_monthly|>filter(release_year==2021)
#create line plot with plot titles
ggplot(data = df_monthly_2021,
mapping=aes(x=release_month, y=count))+
geom_line()+
labs(x="Month",
y="Number of Movies", x="Month",
title = "Horror Movies by Release Month (2021)",
caption = "Data from The Movie Database")
More Scales. Notice our
first plot we have an issue with the x axis, it’s plotting it on a
continuous scale because release_month is numeric. Our y axis
is also starting ~90 instead of 0. We can use scales to modify our x and
y axis (e.g. limits, breaks, and labels). We’ll need
scale_y_continuous
to change the y axis and
scale_x_continuous
to change our x axis.
Annotations. Sometimes its helpful to call
out specific observations in our plots. We can add annotations with an
annotate
layer. For long annotations, use in your text to
introduce a line break, e.g. “Wrap text.”
Custom Theme. We tested out one of ggplot’s
default themes before, now let’s try creating our own custom
theme
.
💡 Tip: The annotate
layer is useful if you want to
include text that isn’t your data frame (e.g. single observation about a
trend). If the label exists in the data, check out
geom_text
or geom_label.
Unlike Fred from Accounting, we never forget about themes
Palette Variables. Storing color palettes as
variables can be useful when developing custom themes. Let’s set up our
palette variables first.
#set up palette for future use
pal_text <- "white"
pal_subtext <- "#DFDFDF"
pal_grid <- "grey30"
pal_bg<-'#191919'
#expand on line chart
plot_line<-ggplot(data = df_monthly_2021,
mapping=aes(x=release_month, y=count))+
geom_line(color="green")+
annotate(geom="label",
label="Horror movies popular right \naround Halloween",
x=10, y=410, color=pal_text, fill=pal_bg, size=3)+
scale_x_continuous(breaks=1:12, labels=month.abb[1:12])+
scale_y_continuous(limits=c(0,450))+
labs(x="",
y="Number of Movies", x="Month",
title = "Total Horror Movies by Release Month (2021)",
caption = "Data from The Movie Database")+
theme(
#adjust plot background + gridlines
plot.background = element_rect(fill=pal_bg, color=pal_bg),
panel.background = element_rect(fill=pal_bg),
panel.grid.minor = element_blank(),
panel.grid = element_line(color=pal_grid, size=0.2),
#modify text color
text = element_text(color=pal_text),
axis.text = element_text(color=pal_text),
axis.title.y = element_text(size=10, margin=margin(r=10)),
plot.title = element_text(family="Creepster", size=25, hjust=0.5),
plot.subtitle = element_text(hjust=0.5),
#add padding around the plot
plot.margin = margin(l=20, r=20, b=10, t=20))
plot_line
#uncomment to save plot
#ggsave(filename="plot_line.png", plot=plot_line, width=7, height=5, units="in")
What if we wanted to see seasonality and compare across different years?
To create multiple plots, we can use ggplot facet_wrap
to pivot a plot on a data variable. Using our
df_monthly data, use facet_wrap
with
release_year to show month-over-month trends per year.
💡 Tip: Applying annotate
on a facet plot
will apply the annotation to all plots. To annotate a singular plot,
recommend creating a new data frame with a facet variable and using
geom_text
or geom_label
.
plot_facet<-ggplot(data = df_monthly, mapping=aes(x=release_month, y=count, group=1))+
geom_line(color="green")+
facet_wrap(~release_year)+
scale_x_continuous(breaks=1:12, labels=month.abb[1:12])+
labs(x="",
y="New Releases",
title = "Most Horror Movies released around Halloween",
caption = "Data from The Movie Database",
subtitle = "Total movie releases by month & year. Big spike in October, Halloween falls on October 31st.")+
theme(
#modify plot background + gridlines
plot.background = element_rect(fill=pal_bg, color=pal_bg),
panel.background = element_rect(fill=pal_bg),
strip.background = element_rect(fill="#521EA4"),
panel.grid.minor = element_blank(),
panel.grid = element_line(color=pal_grid, size=0.2),
#modify text
text = element_text(color=pal_text),
axis.text = element_text(color=pal_text),
axis.text.x=element_text(angle=90, size=9),
plot.title = element_text(family="Creepster", size=20, hjust=0.5),
plot.subtitle = element_text(hjust=0.5, size=10),
strip.text=element_text(color=pal_text),
#add padding around plot
plot.margin = margin(l=20, r=20, b=10, t=20))
plot_facet
#uncomment to save plot
#ggsave(filename="plot_facet.png", plot=plot_facet, width =7 , height=5, units="in")
Visualize the top 10 horror movies by revenue.
To rank our data set in descending order by revenue, we can use dplyr
arrange()
. We’ll simplify and pick the columns we need for
plotting with select()
. Finally, use head()
to
grab the first 10 rows of data.
#get top movies based on revenue
df_top_movies<-df|>
#arrange data in descending order revenue
arrange(-revenue)|>
#select specific columns
select(title, release_year, revenue, budget, poster_url)|>
#select first 10 rows
head(10)
head(df_top_movies,10)
Use geom_col
and map y to title and x to
revenue.
💡 Tip: When you encounter discrete variables with long names like movie title, it might be easier to read as a horizontal bar chart instead of traditional bar chart.
#plot horizontal bar chart, add plot labels
ggplot(data=df_top_movies, mapping=aes(y=title, x=revenue))+
geom_col()+
labs(x="Revenue (millions)", y="",
title="Horror Movies That Killed The Box Office",
subtitle="Top horror movies based on total revenue.",
caption = "Graphic @tanya_shapiro | Data tmdb API"
)
Reordering. For discrete
variables like movie title, ggplot will default to ordering in
alphabetical order. We can use the base R function reorder()
within our y mapping to change the order of our title by
revenue.
Layering Geoms. The beauty of ggplot is all
in the layering - you can keep building by adding more than one geom! To
test this concept, we’ll add the revenue numbers outside the
bars with geom_text
.
🗒️ Note: Another trick for reordering discrete variables is to convert them to factors in before passing them into ggplot.
pal_bar<-'#A70000'
ggplot(data=df_top_movies,
mapping=aes(y=reorder(title,revenue), x=revenue))+
geom_col(fill = pal_bar, width=0.6)+
geom_text(mapping=aes(label=round(revenue,0), x=revenue+20),
color="white", size=4)+
scale_x_continuous(limits=c(0,750), expand=c(0,0))+
labs(x="Revenue (millions)", y="",
title="Horror Movies That Killed The Box Office",
subtitle="Top horror movies based on total revenue. Data from The Movie Database as of October 2022.",
caption = "Graphic @tanya_shapiro | Data tmdb API"
)+
theme(panel.background = element_rect(fill=pal_bg),
axis.text = element_text(color="white", size=10),
axis.ticks = element_blank(),
plot.background = element_rect(fill=pal_bg, color=pal_bg),
plot.subtitle = element_text(color="#DFDFDF", size=12),
text = element_text(color="white"),
panel.grid = element_blank(),
plot.margin = margin(t=20, b=20, l=20, r=20),
plot.title = element_text(family="Creepster", size=30),
panel.grid.major.x= element_line(color="#2C2C2C", size=0.2))
Keep Layering. Add in budget revenue for
comparison. We can create a bullet bar graph by overlaying a thinner
geom_col
over our existing set up. Use the
width argument to adjust its thickness.
Intro to More Geoms. The world of geoms goes
beyond the geoms listed in ggplot2. There are tons of ggplot extension’s
to explore that work with ggplot2. In this example, we will plot the
movie poster images with ggimage geom_image
.
Annotations + Arrows. Sometimes the
placements of annotations is a little tricky, we might want to draw a
line and arrow connecting our text to the observation. To draw
straight lines, we can use geom_segment
, to draw a curved
line we can use geom_curve
. Use the arrow
argument and arrow()
to add an arrow head.
🗒️ Note: For curved lines, a negative curvature values create left-hand curves and positive values create right-hand curves.
plot_bar<-ggplot(data=df_top_movies, mapping=aes(y=reorder(title,revenue), x=revenue))+
#horizontal bar chart with revenue and budget
geom_col(fill=pal_bar, width=0.4)+
geom_col(mapping=aes(x=budget), fill='#600000', width=0.2)+
#text with revenue amounts
geom_text(mapping=aes(x=revenue+18, label=round(revenue,0)),
color="white", size=3.5)+
#text with movie title + year
geom_text(mapping=aes(x=0, y=title, label=paste0(title, " (",release_year,")")),
color="white", vjust=-2.2, hjust=0)+
#poster images - map image to url path
geom_image(mapping=aes(x=-50, image=poster_url))+
#annotation for budget with arrow
annotate(geom="text", color="white", x=250, y=9.5, label="Movie Budget", size=3)+
geom_curve(mapping=aes(x=250, xend=205, y=9.35, yend=9),
color="white", curvature=-0.2, size=0.2, alpha=0.6,
arrow = arrow(length = unit(0.07, "inch")))+
#format x axis to adjust limits and breaks
scale_x_continuous(expand=c(0,0),
limits=c(-70,750),
breaks = c(0, 100,200, 300, 400, 500, 600, 700))+
#titles and labels
labs(y="", x="USD (millions)",
title="Horror Movies That Killed The Box Office",
subtitle="Top horror movies based on total revenue. Movie budget amount added for reference.\nData from The Movie Database as of October 2022.\n",
caption = "Graphic @tanya_shapiro | Data tmdb API")+
#custom theme
theme(
#change background color
plot.background =element_rect(fill=pal_bg, color=pal_bg),
panel.background =element_rect(fill=pal_bg, color=pal_bg),
#modify text font
text = element_text(color=pal_text, family="Roboto"),
axis.text = element_text(color=pal_text),
axis.title.x = element_text(margin=margin(t=10)),
axis.text.y = element_blank(),
plot.subtitle = element_text(color=pal_subtext, size=12),
plot.caption = element_text(color=pal_subtext, size=10),
plot.title=element_text(family="Creepster", size=30, color=pal_text),
#adjust lines for ticks and grid
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.grid.major.x= element_line(color=pal_grid, size=0.2),
#add padding around plot
plot.margin = margin(t=20, b=20, l=20, r=20)
)
plot_bar
#uncomment to save file
#ggsave(filename="plot_bar.png", plot=plot_bar, height=8, width=8, units="in")
This concludes our workshop session today, but it’s just the beginning of your ggplotting adventure.
Next step is just for you. Create your own ggplot using the Horror Movies data set and share your work with us on Twitter.
If you have any questions regarding our ggplot workshop, please feel free to reach out to me @tanya_shapiro. Thank you!
#create your own ggplot