class: center, middle, inverse, title-slide .title[ # Introduction to Data Science ] .subtitle[ ## Week 1: Getting Started ] .author[ ### Ugur Aytun ] .institute[ ### METU, Department of Economics | ECON 413 ] --- # What this course is about -- - This course mainly aims to provide you complementary skills to standard econometrics course. -- - This involves the data cleaning, wrangling, data visualization, and data analysis (regression, machine learning etc). --- # What is data? -- - Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of things. -- - Data can be qualitative (text, maps, photographs, social media (X, Instagram, Spotify)..) or quantitative (economic variables (prices, wages, trade, output, consumption etc.)). -- - Qualitative data can be analyzed in multiple ways. One common method is data coding, which refers to the process of transforming the raw collected data into a set of meaningful categories that describe essential concepts of the data. -- - Quantitative data can be measured in different ways: e-commerce data (IKEA, hepsiburada..). This challenges the traditional metrics such as CPI, GDP because they are not able to capture the economic activities adequately. --- .h-300px[  ] .h-300px[  ] --- # Flight data .h-400px[  ] --- # Trade network .h-400px[  ] --- # Trade network .h-400px[  --- # Satelite data .h-400px[  ] --- # Why is the data science relevant in recent years? -- - The amount of data generated by humans is increasing exponentially. This is due to the digitalization of the economy and the society. -- - The data can be used to predict the future events, to understand the past events, and to make better decisions. --- # What is data science? -- - Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. -- .h-400px[  ] --- # What is R? -- - R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. -- - Not just a statistical package but a programming language that can be used for data manipulation, visualization, and analysis and reporting. -- - Open source and free to use. This leads to a large community of users and developers. You can find and use most updated statistical methods and packages. -- - You can also join the discussion groups and ask questions about your problems, contributing to the vibrant community. --- # Why R? -- - R is a powerful tool for data analysis and visualization. It is widely used in academia, industry, and government. -- - R talks to many data sources: Excel, SPSS, SAS, Stata, SQL, Google Sheets, and many more. --- # What is RStudio? -- - RStudio is an editor that makes it easier to write R code. It is a powerful tool that helps you to write, run, and debug R code. -- - You may consider the RStudio as other editors i.e. Overleaf, Sublime Text, Atom, or Jupyter Notebook. Most of R users prefer RStudio because it is user-friendly and has many features that make your life easier. --- # R Screen .h-400px[  ] --- # RStudio screen .h-400px[  ] --- # RStudio screen -- - The left panel is the script editor where you write your code. -- - The right panel is the console where you run your code. -- - The bottom panel is the environment (one of the most advantageous part of the R) where you can see the objects you created and the history of your commands. -- - The top panel is the menu where you can find the tools and options. -- - You can also see the packages you installed and the help files of the functions you use. -- - You can also see the plots you created. --- # Simple exercise - revisiting Kaldor Law #1 -- - Kaldor states that the manufacturing industry is the engine of economic growth. Δ log(y<sub>it</sub>) = β<sub>0</sub> + β<sub>1</sub>Δ log(x<sub>it</sub>) + u<sub>it</sub> -- - To test this hypothesis, in previous statistic packages firstly we need to download the data from the internet (WDI, FRED, Eurostat, etc.)), -- - import it to the software, -- - Prepare the data file and make the appropriate transformations, -- - and then run the regression. --- # Simple exercise - revisiting Kaldor Law #1 -- - Seven lines of code in R can do all these steps. ``` r library(WDI) # Call World Development Indicators package ``` ``` ## Warning: package 'WDI' was built under R version 4.3.3 ``` ``` r library(data.table) # Required packages to define the dataset as ``` ``` ## Warning: package 'data.table' was built under R version 4.3.3 ``` ``` r #a data.table (we will learn this later) series <- WDI_data$series # List of the series countries <- WDI_data$country # List of the countries # Select the indicators, countries and time range dat2 = WDI(indicator = c("NY.GDP.MKTP.KD.ZG", # GDP growth "NV.IND.MANF.KD.ZG"), # manufacturing growth country = c("all"), # All countries start = 1960, end = 2025) # Time range reg1 <- lm(NY.GDP.MKTP.KD.ZG ~ NV.IND.MANF.KD.ZG, data = dat2) # Run the regression summary(reg1) # Summarize the results ``` ``` ## ## Call: ## lm(formula = NY.GDP.MKTP.KD.ZG ~ NV.IND.MANF.KD.ZG, data = dat2) ## ## Residuals: ## Min 1Q Median 3Q Max ## -78.787 -1.865 0.174 2.211 79.084 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.126700 0.055690 56.14 <2e-16 *** ## NV.IND.MANF.KD.ZG 0.109966 0.003525 31.19 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.929 on 8295 degrees of freedom ## (8727 observations deleted due to missingness) ## Multiple R-squared: 0.105, Adjusted R-squared: 0.1049 ## F-statistic: 973 on 1 and 8295 DF, p-value: < 2.2e-16 ``` --- # Simple exercise - revisiting Kaldor Law #1 .h-400px[  ] --- # Graph ``` r library(WDI) library(data.table) library(ggplot2) ``` ``` ## Warning: package 'ggplot2' was built under R version 4.3.3 ``` ``` r dat = setDT(WDI(indicator = "NY.GDP.PCAP.KD", country = c("TR", "KR", "US"), start = 1960, end = 2025)) ggplot(dat, aes(year, NY.GDP.PCAP.KD, color=country)) + geom_line() + xlab("Year") + ylab("GDP per capita") ``` <!-- -->