Coffee Curiosity

Exploring Discernment and Appreciation of Different Roasts

Evaluation of data recovered from coffee taste test

Author

Affiliation

Cristina Lafuente

School of Information, University of Arizona

if (!require("pacman")) 
  install.packages("pacman")

# use this line for installing/loading
pacman::p_load(cowplot,
               dplyr,
               here,
               knitr,
               magick,
               scales,
               shiny,
               tidyverse,
               viridis)

devtools::install_github("tidyverse/dsbox")

Dataset

coffeeData <- readr::read_csv(('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-05-14/coffee_survey.csv'))

The Coffee Data comes from a coffee taste test held on YouTube in Oct 2023. Data was self reported by participants. It was aggregated by a data blogger named Robert McKeon Aloe who analyzed the data the following month.

The Dimensions of the data are:

dim(coffeeData)

[1] 4042   57

The variables collected are:

colnames(coffeeData)

 [1] "submission_id"                "age"                         
 [3] "cups"                         "where_drink"                 
 [5] "brew"                         "brew_other"                  
 [7] "purchase"                     "purchase_other"              
 [9] "favorite"                     "favorite_specify"            
[11] "additions"                    "additions_other"             
[13] "dairy"                        "sweetener"                   
[15] "style"                        "strength"                    
[17] "roast_level"                  "caffeine"                    
[19] "expertise"                    "coffee_a_bitterness"         
[21] "coffee_a_acidity"             "coffee_a_personal_preference"
[23] "coffee_a_notes"               "coffee_b_bitterness"         
[25] "coffee_b_acidity"             "coffee_b_personal_preference"
[27] "coffee_b_notes"               "coffee_c_bitterness"         
[29] "coffee_c_acidity"             "coffee_c_personal_preference"
[31] "coffee_c_notes"               "coffee_d_bitterness"         
[33] "coffee_d_acidity"             "coffee_d_personal_preference"
[35] "coffee_d_notes"               "prefer_abc"                  
[37] "prefer_ad"                    "prefer_overall"              
[39] "wfh"                          "total_spend"                 
[41] "why_drink"                    "why_drink_other"             
[43] "taste"                        "know_source"                 
[45] "most_paid"                    "most_willing"                
[47] "value_cafe"                   "spent_equipment"             
[49] "value_equipment"              "gender"                      
[51] "gender_specify"               "education_level"             
[53] "ethnicity_race"               "ethnicity_race_specify"      
[55] "employment_status"            "number_children"             
[57] "political_affiliation"

The reason I selected this data set is because I enjoy coffee very much! Too much? Probably not. I have 6 different coffee brewing apparatuses in my kitchen.

mokaPot <- (here("images/mokaPot.png"))
espressoMaker <- (here("images/espressoMachine.png"))
frenchPress <- (here("images/frenchPress.png"))
pod <- (here("images/kCup.png"))
percolator <- (here("images/percolator.png"))
pourOver <- (here("images/pourOver.png"))

ggdraw() +
  draw_image(mokaPot, width = 0.33, height = .5, y = 0, interpolate = TRUE) +
  draw_image(espressoMaker, width = 0.33, height = 0.5, x = 0.33, y = 0, interpolate = TRUE) +
  draw_image(frenchPress, width = 0.33, height = 0.5, x = 0.67, y = 0, interpolate = TRUE) + 
  draw_image(pod, width = 0.33, height = 0.5, y = 0.5 , interpolate = TRUE) +
  draw_image(percolator, width = 0.33, height = 0.5, x = 0.33, y = 0.5, interpolate = TRUE) + 
  draw_image(pourOver, width = 0.33, height = 0.5, x = 0.67, y = 0.5, interpolate = TRUE )

Images from amazon.com reproduced here without permission

Looking at data on coffee for six weeks seemed like a natural choice.

Questions

Question 1:

Do people tend to quantify acidity and bitterness in coffee correctly and is their ability to judge dependent on their preference in coffee?

Question 2:

Does political affiliation play any part in coffee preference?

Analysis plan

Additional data will come from: Acids in coffee: A review of sensory measurements and meta-analysis of chemical composition, by Yeager et al
When it comes to what specific plotting method will best represent this data, it is difficult to know with certainty at this time. Anytime a correlation is made, scatterplots are an obvious choice with the possibility of facet wrapping by type.
There are two relevant CSVs one contains data on total levels of organic acids, known to produce the acidic qualities of coffee and another contains data on total levels of chlorogenic acids in coffee - the total amount of which is dependent on roast level and correlated (inversely) with the “bitterness” flavor profiles.

This data will be joined on the type of roast ( a variable in both).

This study provides definitive data on the levels of acidity in coffee by roast as well as a source for what causes both bitterness and acidity in taste and will be helpful in determining subjects taste accuracy.

They have quite a few very nice plots of their own, I think the one that best summarizes the data I’ll be using is:

plotAcidByRoast <- here("images/acidLevelByRoast.png")

ggdraw() +
  draw_image(plotAcidByRoast)

Image from Yeager et al, Acids in Coffee

These tables will not be used in my report but are included here for informational purposes only. It shows the concentrations of total organic acids (on top) for Arabica (left) and Robusta (right) at different roast levels as well as different concentrations of CGAs for the same (bottom).

As discussed in their paper, CGAs form chlorogenic acid lactones during roasting which impart the bitterness into the coffee. After roasting, the CGAs are no longer present and the lactones are, giving the bitter profile. Similar bitterness occurs through breakdown resulting from other organic acid compounds.

Reformatting Data

Due to some type of formatting error, roughly 1500 and 250 NA columns were appended to each dataset. Trimming those off, allows a more realistic view of the actual data names and size.

This new data for chlorogenic acids has the dimensions and columns:

## Dimensions and columns in the chlorogenic acids dataset:
dim(coffeeAcidityCGA[1:33])

[1] 1344   33

The columns are called:

## The columns are called:
colnames(coffeeAcidityCGA[1:33])

 [1] "Source"                    "Type"                     
 [3] "Roast"                     "Extraction"               
 [5] "Stats"                     "Other"                    
 [7] "Units"                     "total CQA"                
 [9] "Total FQA"                 "Total diCQA"              
[11] "3-CQA"                     "4-CQA"                    
[13] "5-CQA"                     "3-FQA"                    
[15] "5-pCoQA"                   "5-FQA"                    
[17] "Ferulic Acid"              "4-FQA"                    
[19] "3,5-diCQA"                 "3,4-diCQA"                
[21] "4,5-DiCQA"                 "3-pCo,5-CQA"              
[23] "3-C,4-FQA"                 "3-C,5-FQA"                
[25] "4,5-FQA"                   "3-C,4-FQA and 3-pCo,4-CQA"
[27] "3-C,5-DQA"                 "3-C,4-DQA"                
[29] "3-D,5-FQA"                 "Nicotinic Acid"           
[31] "3-CGA"                     "Total CGA"                
[33] "Notes"

The new data for organic acids has dimension:

## The dimensions of the Organic Acids dataset is:
dim(coffeeAcidityOA[1:33])

[1] 287  33

That data has column names:

## Those columns are called:
colnames(coffeeAcidityOA[1:33])

 [1] "Source"         "Type"           "Roast"          "Extraction"    
 [5] "Stat"           "Other"          "Units"          "Citric"        
 [9] "Formic"         "Malic"          "Pyruvic"        "Quinic"        
[13] "Succinic"       "Acetic"         "Oxalic"         "Fumaric"       
[17] "Tartaric"       "Lactic"         "Glycolic"       "Nitric"        
[21] "Mesaconic"      "Maleic"         "Isocitric"      "Citraconic"    
[25] "Propionic"      "2-Furoic"       "Pyroglutamic"   "Phosphoric"    
[29] "Levulinic Acid" "Methylsuccinic" "Nicotinic"      "Ascorbic"      
[33] "Hydroxybenzoic"

Working with this additional data will require that I combine these three datasets along the “roast” variable and compare the relevant variables.

For Organic Acids, I will need to create a new variable which adds up the total relevant organic acids. This will be a double type as it is a measure of concentration.

In the CGAs Table, I will similarly need to create a new variable which sums the total relevant CGAs. This will be a double type as it is a measure of concentration.

Once this has been done, those two variables can be compared, by roast (chr type - qualitative), against what the taste testers believed to be acidic and non acidic (chr type -qualitative).

After that, I would like to run an analysis on the likelihood of a person to be correct based on their favorite coffee beverage.

Looking at the second question, will be much more straightforward examining the political affiliation selection made by the participants as well as which of the coffees they preferred.

Weekly Plan of Attack Table: Last updated: 5/28/2024
Task Name	Status	Assignee	Due	Priority	Summary
Add any extras or polish up wording	not yet started	Cristina Lafuente	June 10, 2024	Low	As possible additions nice to haves pop up, add them to this list
Begin working on presentation and write up	not yet started	Cristina Lafuente	June 3, 2024	Moderate	Once feedback is in, begin wrangling data and working on presentation
Finish proposal and give feedback	complete	Cristina Lafuente	Wednesday, May 29: 5pm	High	Push completed proposal and complete peer review