Alcohol vs. Happiness: an Investigation into Alcoholism

Author

Adam Del Rio, Dillon Murphy, Ruben Jimenez, Nick Patrick

Introduction

As college students, our group was especially interested regarding the implications alcohol has on the quality of life. So, we set out to investigate the existence of such association. Based on empirical data collected in “The Association between Mental Wellbeing, Levels of Harmful Drinking, and Drinking Motivations: A Cross-Sectional Study of the UK Adult Population,” which demonstrates that low-risk drinking makes for higher well-being than abstinence, harmful drinking, and hazardous drinking, we hypothesize that happiness has a positive association with alcohol consumption until a certain point at which it will being to have a negative association. This certain point will likely be determined in our analysis.

The “happiness score” dataset measures the national average response to a Cantril life ladder survey from 2004 to 2020. The Cantril life ladder is a well-being quantitative assessment based on an imaginary step ladder where the bottom rung is minimum life satisfaction (0) and the top rung is the best possible life satisfaction (10), explained further here. Scores were then scaled from 0-100 to effectively communicate percentages.

The “alcohol consumption per capita” dataset measures projected estimates for the quantity of alcohol consumed in people over the age of 15 from 1999 to 2017, in liters. Data were collected quinquennially, with the exception of the 2014-2017 timespan, where data were collected triennially.

Data

Both datasets were pivoted longer, and “year” columns were converted to numeric variables. We also realized there was not much justification to keeping the many years from the alcohol dataset that were entirely missing.

Code

# Pivoting Longer, convert years to numeric and dropping unnecessary missing alcohol columns
alc_final <- alc |> 
  select("country", "1999", "2004", "2009", "2014", "2017") |> 
  pivot_longer(cols = "1999":"2017",
               names_to = "year",
               values_to = "alcohol_consumption")|>  
  mutate(
    across("year", as.numeric))


happy_final <- happy |> 
  pivot_longer(cols = `2004`:`2020`,
               names_to = "year",
               values_to = "happiness_score")|>  
  mutate(
    across("year", as.numeric))

Then, we joined the datasets using an inner join so we could keep only the years both datasets had in common.

Code

# Inner Joining the data
final_data <- inner_join(happy_final, alc_final, by = c("year", "country"))
final_data |> 
  datatable(class = "cell-border stripe hover", 
                   colnames = c("Country",
                                "Year",
                                "Mean Happiness Score",
                                "Mean Alcohol Consumption"),
                   rownames = FALSE,
                   options = list(paging = TRUE, 
                                  searching = TRUE, 
                                  info = TRUE))

Our table contains 636 observations displaying mean happiness scores over a year against mean alcohol consumptions over a single year within a single country. The maximum mean happiness score came from Denmark in 2004 at 80.2 pts, while the maximum mean alcohol consumption came from Moldova in 2004 at 19.9 L. The minimum mean happiness score came from Afghanistan in 2017 at 26.9 pts, while the minimum mean alcohol consumption came from Kuwait in both 2014 and 2017 at .003 L.

Linear Regression

Code

# Font formatting credit: https://ggplot2.tidyverse.org/articles/faq-customising.html
actual_reg <- ggplot(data = final_data, mapping = aes(x = alcohol_consumption, y = happiness_score)) +
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(title = "Observed Happiness Score\nBy Alcohol Consumption",
       x = "Average Alcohol Consumption (L)",
       subtitle = "Observed Average Happiness Score (pts)",
       y= NULL) +
  theme_bw(base_family = "Times New Roman") +
  scale_x_continuous(limits = c(0, 20)) +
  scale_y_continuous(limits = c(20, 90))
animate(actual_reg + transition_states(year,
    transition_length = 2,
    state_length = 1
  ), fps = 15, duration = 15, renderer = gifski_renderer())

Based on our linear regression model investigating the strength of the association between alcohol consumption per capita and happiness score by each year, we can see that there is a weak positive association between these variables every year. Overall, there is likely a weak positive association between alcohol consumption per capita and happiness score.

Code

average_year <- final_data |> 
  group_by(year) |> 
  summarize(mean_happy = mean(happiness_score, na.rm = TRUE),
            mean_alc = mean(alcohol_consumption, na.rm = TRUE))

plot1 <- ggplot(data = average_year, mapping = aes(x = year, y = mean_happy)) +
  geom_line(color = "blue") +
  labs(title = "Average Happiness Over Time (pts)",
       x = "Year",
       y= "",
       subtitle = "Average Happiness Score") +
  theme_bw(base_family = "Times New Roman")

plot2 <- ggplot(data = average_year, mapping = aes(x = year, y = mean_alc)) +
  geom_line(color = "red") +
  labs(title = "Average Alcohol Consumption \nOver Time (L)",
       x = "Year",
       y= "",
       subtitle = "Average Alcohol Consumption") +
  theme_bw(base_family = "Times New Roman")

plot1 + plot2

For the plots above, we grouped the observations by year and found the mean happiness score and mean alcohol consumption per capita in liters. Based on our time series plots for mean happiness score and mean alcohol consumption per capita, both happiness and alcohol consumption have decreased over time. From 2004-2009, happiness decreased very rapidly from an average score around 64 happiness points all the way down to about 55 happiness points. Meanwhile average alcohol consumption was actually increasing very slightly. Past 2009, average happiness began to level off, and even increases after 2014. On the other hand, average alcohol consumption rapidly dropped from just under 6.35 liters per capita down to less than 6.1 liters.

Model Fit

Code

my_model <- lm(happiness_score ~ alcohol_consumption, 
               data = final_data
               )
my_model |> 
  tidy() |>
  mutate(term = case_when(term == "alcohol_consumption" ~ "Alcohol Consumption",
                   term == "(Intercept)" ~ "Intercept")) |>
  kbl(col.names = c("Term",
                    "Estimate",
                    "Standard Error",
                    "Statistic",
                    "P-value"),
      caption = "Regression Output",
      digits = 4) |>
  kable_classic_2(html_font = "Times New Roman",
                  lightable_options = "striped")

Regression Output
Term	Estimate	Standard Error	Statistic	P-value
Intercept	48.5803	0.9257	52.4776	0
Alcohol Consumption	1.0012	0.1157	8.6560	0

\[ \widehat{happiness~score} = 1.001167(alcohol~consumption) + 48.580332 \]

Based on the model we fit predicting happiness score by alcohol consumption per capita, we can see that for every 1 liter increase in alcohol consumption per capita, happiness score increases by approximately .001 pts. At 0 liters alcohol consumption per capita, we predict happiness score to be approximately 48.58 pts.

Code

augmented <- augment(my_model)
response_var <- var(augmented$happiness_score) |>
  round(digits = 3)
fitted_var <- var(augmented$.fitted) |> 
  round(digits = 3)
residual_var <- var(augmented$.resid) |> 
  round(digits = 3)
table <- data.frame(response_var, fitted_var, residual_var)
table |> 
  kbl(col.names = c("Variation in Response",
                  "Variation in the Fitted Values",
                  "Variation in the Residuals"),
      caption = "Variation",
      digits = 3) |>
  kable_classic_2(html_font = "Times New Roman",
                  lightable_options = "striped")

Variation
Variation in Response	Variation in the Fitted Values	Variation in the Residuals
127.019	19.307	107.711

Code

model_fit <- round((fitted_var/(response_var))*100,
                   digits = 3)

By dividing the amount of variation in the fitted values (values from our predictive model) by the total variation in the observed responses, we obtain an $R^{2}$ statistic representing the amount of variation explained by our model. In other words, this statistic explains how well our model can predict the pattern in the data. Based on the $R^{2}$ we calculated for the model we fit, only 15.2% of variability in the response values were accounted for by the regression model. With such a low $R^{2}$ this confirms that our model is unfit for the data.

Simulation

Code

noise <- function(x, mean = 0, sd){
  x + rnorm(length(x), 
            mean, 
            sd)
}

sim <- function(){
  preds <- predict(my_model)

  sd <- sigma(my_model)

  tibble(sim_happiness_score = noise(preds, sd = sd))
}

Code

set.seed(1234)
final_preds <- sim()
new <- final_data |> na.omit(alcohol_consumption)

simulated <- final_preds |> mutate(
  alcohol_consumption = new$alcohol_consumption
)

Code

sim_reg <- ggplot(data = simulated, mapping = aes(x = alcohol_consumption, y = sim_happiness_score)) +
  geom_point() +
  geom_smooth(method = "lm") + 
  labs(title = "Predicted Happiness Score\nBy Alcohol Consumption Per Capita",
       x = "Average Alcohol Consumption Per Capita (L)",
       subtitle = "Predicted Average Happiness Score (pts)",
       y= NULL) +
  theme_bw(base_family = "Times New Roman") +
  scale_x_continuous(limits = c(0, 20)) +
  scale_y_continuous(limits = c(20, 90))

actual_reg + sim_reg

As we can see by the visualizations of observed data and simulated data, the two plots do look awfully similar, especially the linear regression lines which look almost the same. However, there are still many differences. The simulated data seems to have a larger variance than the observed data. The simulated data has larger and smaller extreme values than the observed data, and these extreme values seem to occur more often. Also, the data is clustered differently within the two plots. While the observed data has a large cluster around 0 liters of average alcohol consumption per capita, the simulated does not and instead has a similar cluster around 10 liters of average alcohol consumption per capita that the observed data does not have.

Code

r2s <- vector()
for (i in c(1:1000)){
  predicted <- sim()
  sim_data <- new |> 
    select(happiness_score) |> 
    na.omit() |> 
    bind_cols(predicted)
  
  sim_r2 <- lm(happiness_score ~ sim_happiness_score,
               data = sim_data
               ) |>
    glance() |>
    select(r.squared) |>
    pull()
  r2s <- append(r2s, sim_r2)
}

Code

ggplot(data = tibble(r2s), mapping = aes(x = r2s)) +
  geom_histogram(aes(y = ..density..), fill = "skyblue") + 
  geom_density(color = "darkgrey", size = 1) +
  theme_minimal(base_family = "Times New Roman") + 
  labs(x = bquote(R^2), y = "", title = "Histogram of 1000 Simulated R Squared Values",
       subtitle = "Density")

Based on the distribution of 1000 very small $R^{2}$ values of < 0.1 from the regressions, we can see that the model does not generate data that is remotely similar to what was observed. Less than 10% of the variation in the observed happiness scores can be explained by the predicted happiness score for all 1000 simulations. This makes sense since there did not end up being a pattern among the data or any large association between alcohol consumption per capita and happiness score. Since the plot was so scattered with no apparent pattern, it would be hard for any model to be able to predict the observations.

Conclusion

Based on the regression coefficient = 1.001 we got from our regression model, we expected there to be a very weak positive association between alcohol consumption per capita and happiness score. However, upon inspecting the regression model’s fit to the data, we found the $R{^2}$ statistic to be .152, which means that only 15.2% of the variability in the happiness scores can actually be explained by alcohol consumption. This means there are likely many sources of extraneous variability; that is, many other factors that may impact happiness scores. So, as it turns out, we can’t necessarily use alcohol consumption per capita to predict happiness score within a country. Experimentation using random assignment, data collection with more observations, and the inclusion of possible interactions would likely improve the accuracy of the model we fit and allow us to get a better understanding of the association between alcohol consumption and happiness score.

--- title: "Alcohol vs. Happiness: an Investigation into Alcoholism" author: "Adam Del Rio, Dillon Murphy, Ruben Jimenez, Nick Patrick" format: html: self-contained: true code-tools: true toc: true code-fold: true mainfont: "Times New Roman" editor: source execute: error: true echo: true message: false warning: false --- ```{r Setup} #| include: false library(tidyverse) library(here) library(broom) library(patchwork) library(DT) library(kableExtra) library(gganimate) library(gifski) library(transformr) alc <- read_csv(here::here("sh_alc_pcap_li.csv")) happy <- read_csv(here::here("hapiscore_whr.csv")) ``` ## Introduction As college students, our group was especially interested regarding the implications alcohol has on the quality of life. So, we set out to investigate the existence of such association. Based on empirical data collected in [“The Association between Mental Wellbeing, Levels of Harmful Drinking, and Drinking Motivations: A Cross-Sectional Study of the UK Adult Population,”](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6069385/) which demonstrates that low-risk drinking makes for higher well-being than abstinence, harmful drinking, and hazardous drinking, we hypothesize that happiness has a positive association with alcohol consumption until a certain point at which it will being to have a negative association. This certain point will likely be determined in our analysis. The “happiness score” dataset measures the national average response to a Cantril life ladder survey from 2004 to 2020. The Cantril life ladder is a well-being quantitative assessment based on an imaginary step ladder where the bottom rung is minimum life satisfaction (0) and the top rung is the best possible life satisfaction (10), explained further [here.](https://news.gallup.com/poll/122453/understanding-gallup-uses-cantril-scale.aspx.) Scores were then scaled from 0-100 to effectively communicate percentages. The “alcohol consumption per capita” dataset measures projected estimates for the quantity of alcohol consumed in people over the age of 15 from 1999 to 2017, in liters. Data were collected quinquennially, with the exception of the 2014-2017 timespan, where data were collected triennially. ## Data Both datasets were pivoted longer, and "year" columns were converted to numeric variables. We also realized there was not much justification to keeping the many years from the alcohol dataset that were entirely missing. ```{r} # Pivoting Longer, convert years to numeric and dropping unnecessary missing alcohol columns alc_final <- alc |> select("country", "1999", "2004", "2009", "2014", "2017") |> pivot_longer(cols = "1999":"2017", names_to = "year", values_to = "alcohol_consumption")|> mutate( across("year", as.numeric)) happy_final <- happy |> pivot_longer(cols = `2004`:`2020`, names_to = "year", values_to = "happiness_score")|> mutate( across("year", as.numeric)) ``` Then, we joined the datasets using an inner join so we could keep only the years both datasets had in common. ```{r} # Inner Joining the data final_data <- inner_join(happy_final, alc_final, by = c("year", "country")) final_data |> datatable(class = "cell-border stripe hover", colnames = c("Country", "Year", "Mean Happiness Score", "Mean Alcohol Consumption"), rownames = FALSE, options = list(paging = TRUE, searching = TRUE, info = TRUE)) ``` Our table contains 636 observations displaying mean happiness scores over a year against mean alcohol consumptions over a single year within a single country. The maximum mean happiness score came from Denmark in 2004 at 80.2 pts, while the maximum mean alcohol consumption came from Moldova in 2004 at 19.9 L. The minimum mean happiness score came from Afghanistan in 2017 at 26.9 pts, while the minimum mean alcohol consumption came from Kuwait in both 2014 and 2017 at .003 L. ## Linear Regression ```{r} # Font formatting credit: https://ggplot2.tidyverse.org/articles/faq-customising.html actual_reg <- ggplot(data = final_data, mapping = aes(x = alcohol_consumption, y = happiness_score)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Observed Happiness Score\nBy Alcohol Consumption", x = "Average Alcohol Consumption (L)", subtitle = "Observed Average Happiness Score (pts)", y= NULL) + theme_bw(base_family = "Times New Roman") + scale_x_continuous(limits = c(0, 20)) + scale_y_continuous(limits = c(20, 90)) animate(actual_reg + transition_states(year, transition_length = 2, state_length = 1 ), fps = 15, duration = 15, renderer = gifski_renderer()) ``` Based on our linear regression model investigating the strength of the association between alcohol consumption per capita and happiness score by each year, we can see that there is a weak positive association between these variables every year. Overall, there is likely a weak positive association between alcohol consumption per capita and happiness score. ```{r} average_year <- final_data |> group_by(year) |> summarize(mean_happy = mean(happiness_score, na.rm = TRUE), mean_alc = mean(alcohol_consumption, na.rm = TRUE)) plot1 <- ggplot(data = average_year, mapping = aes(x = year, y = mean_happy)) + geom_line(color = "blue") + labs(title = "Average Happiness Over Time (pts)", x = "Year", y= "", subtitle = "Average Happiness Score") + theme_bw(base_family = "Times New Roman") plot2 <- ggplot(data = average_year, mapping = aes(x = year, y = mean_alc)) + geom_line(color = "red") + labs(title = "Average Alcohol Consumption \nOver Time (L)", x = "Year", y= "", subtitle = "Average Alcohol Consumption") + theme_bw(base_family = "Times New Roman") plot1 + plot2 ``` For the plots above, we grouped the observations by year and found the mean happiness score and mean alcohol consumption per capita in liters. Based on our time series plots for mean happiness score and mean alcohol consumption per capita, both happiness and alcohol consumption have decreased over time. From 2004-2009, happiness decreased very rapidly from an average score around 64 happiness points all the way down to about 55 happiness points. Meanwhile average alcohol consumption was actually increasing very slightly. Past 2009, average happiness began to level off, and even increases after 2014. On the other hand, average alcohol consumption rapidly dropped from just under 6.35 liters per capita down to less than 6.1 liters. ## Model Fit ```{r} my_model <- lm(happiness_score ~ alcohol_consumption, data = final_data ) my_model |> tidy() |> mutate(term = case_when(term == "alcohol_consumption" ~ "Alcohol Consumption", term == "(Intercept)" ~ "Intercept")) |> kbl(col.names = c("Term", "Estimate", "Standard Error", "Statistic", "P-value"), caption = "Regression Output", digits = 4) |> kable_classic_2(html_font = "Times New Roman", lightable_options = "striped") ``` $$ \widehat{happiness~score} = 1.001167(alcohol~consumption) + 48.580332 $$ Based on the model we fit predicting happiness score by alcohol consumption per capita, we can see that for every 1 liter increase in alcohol consumption per capita, happiness score increases by approximately .001 pts. At 0 liters alcohol consumption per capita, we predict happiness score to be approximately 48.58 pts. ```{r} augmented <- augment(my_model) response_var <- var(augmented$happiness_score) |> round(digits = 3) fitted_var <- var(augmented$.fitted) |> round(digits = 3) residual_var <- var(augmented$.resid) |> round(digits = 3) table <- data.frame(response_var, fitted_var, residual_var) table |> kbl(col.names = c("Variation in Response", "Variation in the Fitted Values", "Variation in the Residuals"), caption = "Variation", digits = 3) |> kable_classic_2(html_font = "Times New Roman", lightable_options = "striped") ``` ```{r} model_fit <- round((fitted_var/(response_var))*100, digits = 3) ``` By dividing the amount of variation in the fitted values (values from our predictive model) by the total variation in the observed responses, we obtain an $R^{2}$ statistic representing the amount of variation explained by our model. In other words, this statistic explains how well our model can predict the pattern in the data. Based on the $R^{2}$ we calculated for the model we fit, only `r model_fit`% of variability in the response values were accounted for by the regression model. With such a low $R^{2}$ this confirms that our model is unfit for the data. ## Simulation ```{r} noise <- function(x, mean = 0, sd){ x + rnorm(length(x), mean, sd) } sim <- function(){ preds <- predict(my_model) sd <- sigma(my_model) tibble(sim_happiness_score = noise(preds, sd = sd)) } ``` ```{r} set.seed(1234) final_preds <- sim() new <- final_data |> na.omit(alcohol_consumption) simulated <- final_preds |> mutate( alcohol_consumption = new$alcohol_consumption ) ``` ```{r} sim_reg <- ggplot(data = simulated, mapping = aes(x = alcohol_consumption, y = sim_happiness_score)) + geom_point() + geom_smooth(method = "lm") + labs(title = "Predicted Happiness Score\nBy Alcohol Consumption Per Capita", x = "Average Alcohol Consumption Per Capita (L)", subtitle = "Predicted Average Happiness Score (pts)", y= NULL) + theme_bw(base_family = "Times New Roman") + scale_x_continuous(limits = c(0, 20)) + scale_y_continuous(limits = c(20, 90)) actual_reg + sim_reg ``` As we can see by the visualizations of observed data and simulated data, the two plots do look awfully similar, especially the linear regression lines which look almost the same. However, there are still many differences. The simulated data seems to have a larger variance than the observed data. The simulated data has larger and smaller extreme values than the observed data, and these extreme values seem to occur more often. Also, the data is clustered differently within the two plots. While the observed data has a large cluster around 0 liters of average alcohol consumption per capita, the simulated does not and instead has a similar cluster around 10 liters of average alcohol consumption per capita that the observed data does not have. ```{r} r2s <- vector() for (i in c(1:1000)){ predicted <- sim() sim_data <- new |> select(happiness_score) |> na.omit() |> bind_cols(predicted) sim_r2 <- lm(happiness_score ~ sim_happiness_score, data = sim_data ) |> glance() |> select(r.squared) |> pull() r2s <- append(r2s, sim_r2) } ``` ```{r} ggplot(data = tibble(r2s), mapping = aes(x = r2s)) + geom_histogram(aes(y = ..density..), fill = "skyblue") + geom_density(color = "darkgrey", size = 1) + theme_minimal(base_family = "Times New Roman") + labs(x = bquote(R^2), y = "", title = "Histogram of 1000 Simulated R Squared Values", subtitle = "Density") ``` Based on the distribution of 1000 very small $R^{2}$ values of < 0.1 from the regressions, we can see that the model does not generate data that is remotely similar to what was observed. Less than 10% of the variation in the observed happiness scores can be explained by the predicted happiness score for all 1000 simulations. This makes sense since there did not end up being a pattern among the data or any large association between alcohol consumption per capita and happiness score. Since the plot was so scattered with no apparent pattern, it would be hard for any model to be able to predict the observations. ## Conclusion Based on the regression coefficient = 1.001 we got from our regression model, we expected there to be a very weak positive association between alcohol consumption per capita and happiness score. However, upon inspecting the regression model’s fit to the data, we found the $R{^2}$ statistic to be .152, which means that only 15.2% of the variability in the happiness scores can actually be explained by alcohol consumption. This means there are likely many sources of extraneous variability; that is, many other factors that may impact happiness scores. So, as it turns out, we can’t necessarily use alcohol consumption per capita to predict happiness score within a country. Experimentation using random assignment, data collection with more observations, and the inclusion of possible interactions would likely improve the accuracy of the model we fit and allow us to get a better understanding of the association between alcohol consumption and happiness score.