Extending Least Absolute Deviations Regression

This lab will work on building a generalization of the least absolute deviation (LAD) regression framework. We will again be working with the “engel” dataset from the previous lab. Assume we believe the following linear relationship between food expenditures and incomes exists: \[foodexp_i=\beta_0+\beta_1income_i +\epsilon.\]

Recall that, for \(Y\) a continuous random variable with distribution \(F\) (for example, \(F\)=“the normal distribution”), the \(\tau\) quantile of \(Y\) is given by \(Q_{\tau}=\inf\{y\in \mathbb{R}:\tau\leq F(y)\}\).

Intuitively, the quantile is the smallest value of \(y\) such that the mass in the left tail of the distribution is \(\tau\).

  1. Using R, for \(Y\) a standard normal random variable, find the \(\tau=.015\) and \(\tau=.985\) quantiles.

  2. Now, find the \(\tau=.05\) and \(\tau=.95\) quantiles of the food expenditures data using the quantile() function.

  3. When the outcome variable in a regression model is symmetric, OLS can do a great job estimating the regression coefficients. Analyze the shape of the distribution for food expenditures data. Is the distribution symmetric?

  4. Will the presences of a few very large food expenditures and many very small food expenditures affect the OLS coefficient estimates?

  5. How might the shape of this distribution affect the OLS estimates?

  6. To get an understanding of how asymmetry can affect OLS estimates, fit the above regression model to 1) the full data set and 2) only data corresponding to food expenditures less than the corresponding \(\tau=.25\) quantile.

  7. Compare the estimates you obtained in part f. In particular, interpret the regression coefficients.

  8. Does your answer from part g. support or dispel the notion of income inequality?

Recall the differences between the OLS and LAD results from the previous tutorial, we see that (from parts a.- g. above) the asymmetry of the outcome and response variables could partly be behind the difference between the OLS and LAD regression results.

In addition, it is also important to realize that food expenditures should be different depending on where in the income distribution a family sits. That is, in general, families who have higher levels of wealth can afford to spend more on food than those with lower wealth. Therefore, it doesn’t necessarily make sense to say that, for all individuals in the population the coefficients in the regression \[foodexp_i=\beta_0+\beta_1 income_i +\epsilon_i\] are constant. Instead we need to come up with a model that allows \(\beta_0,\beta_1\) to differ between different groups of indivduals, such as those with high and low wealth.

An approach that can deail with this issue is quantile regression. The quantile regression model assumes that, for each quantile \(\tau\in(0,1)\), the regression parameters \(\beta_0\) and \(\beta_1\) can be different. Hueristically, this means that, conditional on a value of \(income_i\), the parameters \(\beta_0\) and \(\beta_1\) describing the behavior of \(foodexp_i\) are allowed to change. We can then define the quantile regression model through the quantile function as \[Q_{\tau}(foodexp_i|income_i)=\beta_0(\tau)+\beta_1(\tau) income_i.\] The package quantreg allows us to fit this model to different quantiles using exactly the same syntax as the lm() function. Implementation of this method requires the optim function. However, we won’t bother with this and will instead let R do the work for us.

  1. Using the function rq(), fit the quantile regression model for \(\tau=.5\). Do the coefficient estimates appear similar to those obtained by OLS? HINT: ?rq may be helpful.
  1. For quantiles \(\tau = .25\) and \(\tau=.75\) fit the quantile regression model using the function rq(). As food expenditures increases, does the slope coefficient in the model change? What about the intercept?

  2. Describe the behavior of the confidence intervals for the qr model as \(\tau\) descreases or increases.

  3. Explain your reasoning behind your answer to part k.