class: center, middle, inverse, title-slide .title[ # Introduction to Data Science ] .subtitle[ ## Week 9: Causal inference and Regression Analysis with difference-in-differences (DiD) ] .author[ ### Ugur Aytun ] .institute[ ### METU, Department of Economics | ECON 413 ] --- --- # Causal inference -- - Causal inference is the process of drawing conclusions about causal relationships from data, often using statistical methods to control for confounding variables. -- - It is important to distinguish between correlation and causation, as correlation does not imply causation. -- - Causal inference methods include randomized controlled trials, observational studies, and quasi-experimental designs. -- - Difference-in-differences (DiD) is a quasi-experimental design that compares the changes in outcomes over time between a treatment group and a control group, allowing for the estimation of causal effects. -- - RDD (Regression Discontinuity Design) is another quasi-experimental design that exploits a cutoff or threshold to identify causal effects, often used when random assignment is not feasible. --- # Causal inference -- - Instrumental variables (IV) are used to address endogeneity issues in regression analysis, where an instrument is correlated with the treatment but not directly with the outcome, allowing for causal inference. -- - PSM (Propensity Score Matching) is a method used to control for confounding variables by matching treated and control units based on their propensity scores, which estimate the probability of treatment assignment given observed covariates. -- - As we saw in the last week, `fixest` package can be used to implement causal inference methods in R, including DiD, RDD, IV, and PSM, providing a flexible framework for regression analysis with fixed effects. --- # Popularity of causal inference methods .h-400px[  ] --- # DiD -- - DiD is a method used to estimate the causal effect of a treatment on an outcome by comparing the changes in outcomes over time between a treatment group and a control group. -- - For example, if a new policy is implemented in one state but not in another, the changes in outcomes in the two states can be compared to estimate the effect of the policy. -- - In this lecture, we will focus on the difference-in-differences (DiD) method, which is a popular approach in causal inference. --- # Did Brexit have an effect on the Turkish export flows to the UK? .scrollable[ ``` r # Clear workspace and garbage collection rm(list = ls()) gc() ``` ``` ## used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 540717 28.9 1214712 64.9 660385 35.3 ## Vcells 985614 7.6 8388608 64.0 1769874 13.6 ``` ``` r #library('R.utils') library(data.table) ``` ``` ## Warning: package 'data.table' was built under R version 4.3.3 ``` ``` r library(sf) ``` ``` ## Warning: package 'sf' was built under R version 4.3.3 ``` ``` ## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE ``` ``` r library(ggplot2) ``` ``` ## Warning: package 'ggplot2' was built under R version 4.3.3 ``` ``` r library(maps) ``` ``` ## Warning: package 'maps' was built under R version 4.3.3 ``` ``` r library(scales) # for squish ``` ``` ## Warning: package 'scales' was built under R version 4.3.3 ``` ``` r library(dplyr) ``` ``` ## Warning: package 'dplyr' was built under R version 4.3.3 ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:data.table': ## ## between, first, last ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` ``` r library(fixest) ``` ``` ## Warning: package 'fixest' was built under R version 4.3.3 ``` ``` ## ## Attaching package: 'fixest' ``` ``` ## The following object is masked from 'package:scales': ## ## pvalue ``` ``` r setwd("H:/My Drive/ECON413") # Call the trade dataset trade_data <- setDT(fread("data/trade data.csv")) # drop the unnecessary columns and change the column names trade_data <- trade_data[, .(year = period, partnercode = partnerCode, partnername = partnerDesc, value = primaryValue)] # generate the treatment variable that takes valus of one the partner is the UK after Brexit date trade_data[, treatment := as.integer(partnercode == 826 & year >= 2016)] reg_brexit <- feols(log(value) ~ treatment | # variable of interest partnercode + year, # fixed effects data = trade_data, cluster = c("partnercode")) summary(reg_brexit) ``` ``` ## OLS estimation, Dep. Var.: log(value) ## Observations: 2,661 ## Fixed-effects: partnercode: 235, year: 12 ## Standard-errors: Clustered (partnercode) ## Estimate Std. Error t value Pr(>|t|) ## treatment -0.139658 0.050645 -2.75756 0.006283 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.736856 Adj. R2: 0.950577 ## Within R2: 3.023e-5 ``` ``` r # The coefficient of the treatment variable is 0.12, # which means that the Brexit had a negative effect on the Turkey after 2016 export to the UK compared to other partners. ``` ] --- # Effectiveness of active choice organ donor policy of California .scrollable[ ``` r # install.packages("causaldata") # install.packages("modelsummary") library(data.table) library(fixest) library(causaldata) ``` ``` ## Warning: package 'causaldata' was built under R version 4.3.3 ``` ``` r # call the organ donations dataset od <- causaldata::organ_donations od <- setDT(od) od[, treated := as.integer(State == "California" & (Quarter == "Q32011" | Quarter == "Q42011" | Quarter == "Q12012"))] clfe <- feols(Rate ~ treated | State + Quarter, data = od) summary(clfe) ``` ``` ## OLS estimation, Dep. Var.: Rate ## Observations: 162 ## Fixed-effects: State: 27, Quarter: 6 ## Standard-errors: Clustered (State) ## Estimate Std. Error t value Pr(>|t|) ## treated -0.022459 0.006131 -3.66304 0.0011185 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.021982 Adj. R2: 0.974196 ## Within R2: 0.009221 ``` ``` r # Create dummy for the California od[, California := as.integer(State == "California")] # We implement event study to see the effect of the treatment over time. # it is also a placebo for the treatment effect. There should be no effect before the treatment. # event study clfe <- feols(Rate ~ i(Quarter_Num, California, ref = 3) | State + Quarter_Num, data = od) summary(clfe) ``` ``` ## OLS estimation, Dep. Var.: Rate ## Observations: 162 ## Fixed-effects: State: 27, Quarter_Num: 6 ## Standard-errors: Clustered (State) ## Estimate Std. Error t value Pr(>|t|) ## Quarter_Num::1:California -0.002942 0.005084 -0.578719 0.56775872 ## Quarter_Num::2:California 0.006296 0.002266 2.778832 0.00999724 ** ## Quarter_Num::4:California -0.021565 0.005034 -4.284177 0.00022209 *** ## Quarter_Num::5:California -0.020292 0.004473 -4.536282 0.00011432 *** ## Quarter_Num::6:California -0.022165 0.010013 -2.213610 0.03583451 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## RMSE: 0.021976 Adj. R2: 0.973385 ## Within R2: 0.009787 ``` ``` r coefplot(clfe) ``` <!-- --> ] --- # Event study -- - An event study is a statistical method used to assess the impact of a specific event on the value of a firm or an economy. -- - It serves two thing: parallel trend test that the treatment and control groups have similar trends before the event, and the dynamic effect of the event on the outcome variable. --- # Timing of the treatment -- - Examples above assume that the treatment is applied at the same time for all units, which is defined as non-staggered DiD. -- - If the timing of the treatment is not common across units, staggered DiD can be used to estimate the treatment effect by allowing for different treatment timings across units. -- - Modern methods for staggered DiD include the use of synthetic control methods, which create a synthetic control group that mimics the characteristics of the treatment group before the treatment. --- # Exposure to the treatment -- - In some cases, the treatment may not be applied uniformly across all units, leading to varying levels of exposure to the treatment. -- - In this case treatment is not binary, but continuous, and the treatment effect can be estimated using a continuous treatment variable.