This exercise was developed by Thierry Mayer for the International Trade and Finance Course. The dataset needed for this exercise is available in Stata format at this dropbox link. Download the file, and read it into R with the function read_stata from the haven package.
What variables are included in the data?
##  [1] "year"        "iso_o"       "iso_d"       "contig"      "comlang_off"
##  [6] "distw"       "pop_o"       "pop_d"       "gdp_o"       "gdp_d"      
## [11] "comcur"      "fta_wto"     "flow"how many observations do we have in total?
## [1] 1106870      13How many unique countries do we have in the columns iso_o and iso_d (origin/destination)?
## [1] 208## [1] 208iso_o do we have by year? How often does each country appear as iso_d within a year? Make a table that counts how often each country appears as iso_d per year!
## # A tibble: 1,106,870 x 3
## # Groups:   year [69]
##     year iso_d     n
##    <dbl> <chr> <int>
##  1  1984 ABW       2
##  2  1984 ABW       2
##  3  1985 ABW       1
##  4  1986 ABW       1
##  5  1987 ABW       1
##  6  1988 ABW       5
##  7  1988 ABW       5
##  8  1988 ABW       5
##  9  1988 ABW       5
## 10  1988 ABW       5
## # ... with 1,106,860 more rowsDo all countries trade with each other? How many country pairs would we observe if each country traded with each other possible country? Produce a graph that illustrates cross country trade. You could think of a square matrix \(M\) with as many row and columns as there are unique countries. rows index origin and cols index destination countries. You could fill the the matrix like this, where \(i,j\) index origin and destination country:
\[
 M(i,j) = \begin{cases} 1 & \text{if flow}_{ij}>0 \\
           0 & \text{else.}
           \end{cases}
 \] Your graph should visualize this matrix somehow. Make the graph for two years, 1948 and 2016, and compute the share of trading countries in each of them. 
Compute a new variable called gravity, defined as
\[ \text{gravity}_{odt} = \frac{GDP_{ot} \cdot GDP_{dt}}{DGP_{wt}\cdot distance_{od}} \]
where indices \(o,d,t\) stand for origin, destination and year. The index \(w\) means world, i.e. here we talk about the sum of all destination countries. You need to be careful here because some countries don’t have any data in certain years (as we know from above), so there will be missing values. When you prepare this computation, apply the following cleaning protocol to your data:
dplyr, I would compute world gdp by year first, and then merge it back onto the main dataset.gdp_o and gdp_d in world gdp and drop observations smaller than the first percentile of either shareflow into flow/1000 i.e. trade flows in thousand dollars.Run a regression of the log of trade flows on the log of gravity, using only data for the year 1995. Interpret the coefficient obtained. In a scatterplot, represent the relationship between the log of trade flows on the log of the gravity prediction, together with the regression line, which is very close to a 45 degree line for the 1995 data. How should we interpret the distance of each point to this 45 degree line?
## 
## Call:
## lm(formula = log(flow1000) ~ log(gravity), data = d95)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.836  -1.192   0.162   1.414   8.501 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.847702   0.052725  -16.08   <2e-16 ***
## log(gravity)  1.036308   0.005989  173.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.295 on 17801 degrees of freedom
## Multiple R-squared:  0.6272, Adjusted R-squared:  0.6271 
## F-statistic: 2.994e+04 on 1 and 17801 DF,  p-value: < 2.2e-16How do the slope coefficient estimates vary by year? You could run the above regression for each year, collect the slopes, and plot them against year.
Do the same scatterplot, but highlighting in a different color the pairs of countries engaged in a Free Trade Agreement (fta_wto = 1 for those in the database). Is it clear what is the effect of agreements graphically? I used function dplyr::sample_frac to randomly select 10% of rows from the 1995 data in order to avoid overplotting.
Run the following regressions using the 1995 data as above.
A classical gravity equation with only GDPs and distance (in logs) explaining the log of trade flows. That is, instead of the computed gravity variable from above, we include the following variables individually: \[\begin{align}
\log(gravity)_{odt} &= \log\left( \frac{GDP_{ot} \cdot GDP_{dt}}{ distance_{od}}\right) \\
           &= \log(GDP_{ot}) + \log(GDP_{dt}) - \log(GDP_{dt}) - \log(distance_{od})
\end{align}\] and so you are supposed to investigate \[
\log \left( \frac{flow_{odt}}{1000} \right) = \log(GDP_{ot}) + \log(GDP_{dt} - \log(distance_{od})
\]
Introduce the fta_wto dummy variable in that regression. What is the impact of becoming a wto member on expected trade flows? To answer that last question, remember that for a zero-one dummy \(d\) , \[\begin{align}
\ln y &= a + b d \\
y =&= \exp(a +b d) \\
E[y|d=0] =& \exp(a)\\
E[y|d=1] =& \exp(a + b )\\
\Delta E[y|d] =& \exp(a + b ) - \exp(a)\\
\%\Delta E[y|d] =& \frac{\exp(a + b ) - \exp(a)}{\exp(a)}\\
  =& e^{a + b - a} - 1 = \exp(  b ) - 1
\end{align}\]
Introduce common language and contiguity. Again compute the impact of having a common official language and of being contiguous contries.