class: center, middle, inverse, title-slide .title[ # Introduction to Data Science ] .subtitle[ ## Week 4: Data Manipulation with data.table ] .author[ ### Ugur Aytun ] .institute[ ### METU, Department of Economics | ECON 413 ] --- --- # Call the data -- - fread function helps us to import the data with external format (txt, csv etc.) ``` r setwd("H:/My Drive/ECON413") library(data.table) ``` ``` ## Warning: package 'data.table' was built under R version 4.3.3 ``` ``` r tariff_dataset <- setDT(fread("Data/import_isic2.csv")) head(tariff_dataset) ``` ``` ## isic2 year value weight sTariff ## <int> <int> <num> <num> <num> ## 1: 230101 1993 29444681 753045995 3 ## 2: 230101 1995 48911100 1163558144 2 ## 3: 230101 1996 57729752 1189007488 1 ## 4: 230101 1997 60687212 1364897536 1 ## 5: 230101 1998 76593112 1695382272 1 ## 6: 230101 1999 47024669 1219192420 1 ``` --- # Call the data (cont'd) -- - We can see the certain number of rows. This is helpful while opening the large datasets. ``` r setwd("H:/My Drive/ECON413") library(data.table) tariff_dataset <- setDT(fread("Data/import_isic2.csv", nrows = 200)) tariff_dataset[, .N] ``` ``` ## [1] 200 ``` --- # Basic characteristics of tariff_dataset object -- - Head of the data (first six observations) ``` r setwd("H:/My Drive/ECON413") tariff_dataset <- setDT(fread("Data/import_isic2.csv")) head(tariff_dataset) ``` ``` ## isic2 year value weight sTariff ## <int> <int> <num> <num> <num> ## 1: 230101 1993 29444681 753045995 3 ## 2: 230101 1995 48911100 1163558144 2 ## 3: 230101 1996 57729752 1189007488 1 ## 4: 230101 1997 60687212 1364897536 1 ## 5: 230101 1998 76593112 1695382272 1 ## 6: 230101 1999 47024669 1219192420 1 ``` --- # Basic characteristics of tariff_dataset object (cont'd) -- - Summary statistics ``` r setwd("H:/My Drive/ECON413") tariff_dataset <- setDT(fread("Data/import_isic2.csv")) summary(tariff_dataset) ``` ``` ## isic2 year value weight ## Min. :230101 Min. :1993 Min. :6.000e+00 Min. :0.000e+00 ## 1st Qu.:312103 1st Qu.:1996 1st Qu.:2.249e+06 1st Qu.:1.165e+06 ## Median :351124 Median :1998 Median :1.580e+07 Median :4.981e+06 ## Mean :342106 Mean :1997 Mean :8.091e+07 Mean :5.318e+07 ## 3rd Qu.:381105 3rd Qu.:2000 3rd Qu.:8.242e+07 3rd Qu.:2.375e+07 ## Max. :390901 Max. :2001 Max. :3.769e+09 Max. :2.376e+09 ## sTariff ## Min. : 0.000 ## 1st Qu.: 2.289 ## Median : 4.900 ## Mean : 8.934 ## 3rd Qu.: 8.836 ## Max. :150.000 ``` --- # Basic characteristics of tariff_dataset object (cont'd) -- - Summary statistics for a given year ``` r setwd("H:/My Drive/ECON413") tariff_dataset <- setDT(fread("Data/import_isic2.csv")) summary(tariff_dataset[year == 1995]) ``` ``` ## isic2 year value weight ## Min. :230101 Min. :1995 Min. :1.880e+02 Min. :0.000e+00 ## 1st Qu.:312103 1st Qu.:1995 1st Qu.:2.141e+06 1st Qu.:8.957e+05 ## Median :351124 Median :1995 Median :1.331e+07 Median :3.247e+06 ## Mean :342003 Mean :1995 Mean :6.707e+07 Mean :4.501e+07 ## 3rd Qu.:381105 3rd Qu.:1995 3rd Qu.:6.834e+07 3rd Qu.:1.760e+07 ## Max. :390901 Max. :1995 Max. :1.346e+09 Max. :1.505e+09 ## sTariff ## Min. : 0.000 ## 1st Qu.: 4.422 ## Median : 7.554 ## Mean : 9.716 ## 3rd Qu.:10.999 ## Max. :89.846 ``` --- # Basic manipulations -- - Let's calculate the median value of simple tariffs for each year: ``` r setwd("H:/My Drive/ECON413") tariff_dataset <- setDT(fread("Data/import_isic2.csv")) tariff_dataset[, median(sTariff), by = c("year")] ``` ``` ## year V1 ## <int> <num> ## 1: 1993 8.000000 ## 2: 1995 7.554153 ## 3: 1996 4.437297 ## 4: 1997 4.419483 ## 5: 1998 3.960000 ## 6: 1999 3.664405 ## 7: 2000 3.515248 ## 8: 2001 3.144226 ``` --- # Basic manipulations (cont'd) -- - Generate this object as data.table ``` r setwd("H:/My Drive/ECON413") tariff_dataset <- setDT(fread("Data/import_isic2.csv")) yearly_tariff <- tariff_dataset[, median(sTariff, na.rm = TRUE), by = c("year")] yearly_tariff ``` ``` ## year V1 ## <int> <num> ## 1: 1993 8.000000 ## 2: 1995 7.554153 ## 3: 1996 4.437297 ## 4: 1997 4.419483 ## 5: 1998 3.960000 ## 6: 1999 3.664405 ## 7: 2000 3.515248 ## 8: 2001 3.144226 ``` --- # Import larger data -- - BACI 2022 trade network data by CEPII: -- - t: year, i: origin (UN country code), j: destination (UN country code), k: six-digit HS product, v: flow value (thousand USD), q: weight (tonnage) ``` r setwd("H:/My Drive/ECON413") trade_dataset_2022 = setDT(fread("Data/BACI_HS02_Y2022_V202401b.csv")) trade_dataset_2022[, .N] # number of observations ``` ``` ## [1] 11113911 ``` ``` r summary(trade_dataset_2022) ``` ``` ## t i j k ## Min. :2022 Min. : 4.0 Min. : 4.0 Min. : 10110 ## 1st Qu.:2022 1st Qu.:246.0 1st Qu.:226.0 1st Qu.:392099 ## Median :2022 Median :442.0 Median :440.0 Median :640399 ## Mean :2022 Mean :460.6 Mean :443.1 Mean :609260 ## 3rd Qu.:2022 3rd Qu.:704.0 3rd Qu.:686.0 3rd Qu.:846791 ## Max. :2022 Max. :894.0 Max. :894.0 Max. :970600 ## ## v q ## Min. : 0 Min. : 0 ## 1st Qu.: 1 1st Qu.: 0 ## Median : 10 Median : 1 ## Mean : 2119 Mean : 1314 ## 3rd Qu.: 105 3rd Qu.: 10 ## Max. :117397713 Max. :743788774 ## NA's :360796 ``` --- # Import larger data (cont'd) -- - Turkey's total export in 2022? ``` r setwd("H:/My Drive/ECON413") trade_dataset_2022 = setDT(fread("Data/BACI_HS02_Y2022_V202401b.csv")) trade_dataset_2022[i == 792, sum(v, na.rm = TRUE)] ``` ``` ## [1] 262640430 ``` -- - Turkey's total export to Germany in 2022? ``` r trade_dataset_2022[i == 792 & j == 276, sum(v, na.rm = TRUE)] ``` ``` ## [1] 21828137 ``` --- # Import larger data (cont'd) -- - Create a dataset showing Turkey's total export by each destination country in descending order ``` r partners <- trade_dataset_2022[i == 792, .(total_export = sum(v, na.rm = TRUE)), by = c("j")][order(-total_export)] partners ``` ``` ## j total_export ## <int> <num> ## 1: 276 21828137.114 ## 2: 842 17135635.726 ## 3: 826 13753341.300 ## 4: 368 13742031.076 ## 5: 380 12567485.738 ## --- ## 215: 304 16.555 ## 216: 534 0.839 ## 217: 500 0.354 ## 218: 162 0.020 ## 219: 520 0.007 ``` --- # Binding two dataset -- - 2021 data? Let's call it and merge with 2022. ``` r setwd("H:/My Drive/ECON413") trade_dataset_2021 <- setDT(fread("Data/BACI_HS02_Y2021_V202401b.csv")) trade_dataset <- rbind(trade_dataset_2022, # 11,113,911 obs. trade_dataset_2021) # 11,358,186 obs. trade_dataset[, .N] ``` ``` ## [1] 22472097 ``` -- - Turkey's total robot export by year ``` r trade_dataset[j == 792 & k == 847950, sum(v, na.rm = TRUE), by = c("t")] ``` ``` ## t V1 ## <int> <num> ## 1: 2022 140851.3 ## 2: 2021 101729.9 ```