Linear Probabilistic Model and Linear Regression

Min Jean Cho

We frequently encounter paired experimental results or dataset that contain input values (cause) and output values (effect) as follows.

Input can be study time or amount of particular compound applied to facilitate plant growth. Then, output can be test scores or the heights of plant. Since we (experimenters) set the values of an input variable but the values of an output variable depend on the input values, the input variable is called the independent variable and the output variable is called thedependent variable.

Typically, two types of mathematical functions can be used to represent such cause-and-effect relationship. For example, suppose that the true relationship between the dependent variable \(y\) and independent variable \(x\) is modeled as follows:

\[y = \beta_0 + \beta_1 x\]

where \(\beta_0\) and \(\beta_1\) are unknown parameters. This model is called deterministic model because this model always returns the same output value from a fixed input value. In other words, deterministic model does not allow any errors in \(y\). Thus, if true values of parameters are used, \(y = \beta_0 + \beta_1 x\) is the true value of output for \(x\). On the other hand, probabilistic model returns the output in form of random variable \(Y\) because it allows errors in the output.

\[Y = \beta_0 + \beta_1 x + \epsilon\]

The error denoted by \(\epsilon\) is a random variable, which is the reason why the output is an random variable. \(\epsilon\) follows a certain probability distribution and its expected value is zero, \(E(\epsilon)=0\) (i.e., for the dataset of infinite size, the probabilistic model becomes deterministic model). Thus, the expected value of \(Y\) is as follows:

\[\begin{aligned} E(Y) &= E(\beta_0 + \beta_1 x + \epsilon) \\ &= \beta_0 + \beta_1 x + E(\epsilon) \\ &= \beta_0 + \beta_1 x \end{aligned}\]

It is important to note that independent variable \(x\) is not a random variable because we set it to fixed values. Also, the parameters \(\beta_0\) and \(\beta_1\) are unknown but fixed values (in frequentist’s perspective). Now, suppose that we want to predict output value using the probabilistic model.

If true values of parameters are \(\beta_0 = 54.16\) and \(\beta_1 = 0.12\), then the prediction is \(E(Y)=81.16\) for \(x=225\). Note that the prediction is in the form of the expected value if true values of parameters are used (we will use the estimates of the parameter values in linear regression). Thus, we can think that at every possible value of \(x\) there is a separate distribution of population of random variable \(Y\); these populations have the same variance but different mean (expected value) depending on the value of \(x\).

Linear Probabilistic Model

Look at the probabilistic model \(E(Y) = \beta_0 + \beta_1 x\) carefully. It can be said that \(E(Y)\) is a linear function of \(x\) (because intercept \(\beta_0\) and coefficient \(\beta_1\) are constants) or \(E(Y)\) is a linear function (or linear combination) of \(\beta_0\) and \(\beta_1\) (when \(x\) is viewed as a coefficient of \(\beta_1\)). Thus, if \(E(Y) = \beta_0 + \beta_1 \text{log}x\), \(E(Y)\) is a linear functions of \(\beta_0\) and \(\beta_1\) but it is not a linear function of \(x\). When we mention a linear probabilistic model (or simply, linear model), we mean \(E(Y)\) that is a linear function of unknown parameters \(\beta_0\) and \(\beta_1\); \(E(Y)\) does not need to be a linear function of \(x\). Thus, \(E(Y) = \beta_0 + \beta_1 \text{log}x\) is a linear model.

If the linear model takes only one independent variable, the model is called a simple linear model. If the linear model takes more than one independent variable, the model is called a multiple linear model.

\[E(Y) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k\]

Remember that \(E(Y)\) is a linear model if \(E(Y)\) is a linear function of unknown parameters \(\beta_k\). Thus, \(E(Y) = \beta_0 + \beta_1 x + \beta_2 \text{log}x\) is a linear model because \(x_1=x\) and \(x_2 = \text{log}x\), although \(E(Y)\) is not a straight line with respect to \(x\). Similarly, \(E(Y) = \beta_0 + \beta_1 x + \beta_2 x^2\), a second-order polynomial function of \(x\), is also linear model because \(x_1=x\) and \(x_2=x^2\). Thus, polynomial regression is a linear regression even though it fits a non-linear function of \(x\) to data. Non-linear model is the model that is a non-linear combination of unknown parameters (e.g., \(E(Y) = \frac{\beta_1 x}{(\beta_2 + x)}\)).

Simple Linear Regression

The term ’regression’ was originally used to describe a biological phenomenon that the heights of descendants of tall ancestors tend to regress down towards a normal average (this phenomenon was called regression toward the mean). Later, the biological regression analysis was developed into a general statistical analysis.

Let’s suppose that we want to fit the simple linear model, \(E(Y) = \beta_0 + \beta_1 x\), to the data set comprised of \(n\) data points (the values of independent variable \(\{x_i \}_{i=1}^n\) as well as the values of dependent variable \(\{y_i \}_{i=1}^n\)). The objective of simple linear regression (SLR) is to estimate \(E(Y)\) using the estimates of parameters.

\[\hat{y} = \widehat{E(Y)} = \hat{\beta_0} + \hat{\beta_1}x\]

Because \(\hat{y}\) is the estimate of the true value \(y\), there might be an estimation error (or deviation from the true value \(E(Y)\)).

In order to make the regression line (\(\hat{y}\)) close to true line (\(E(Y)\)), we need to find the estimators of the parameters that have good properties such as unbiased estimators (\(\beta = E(\hat{\beta})\)). And the unbiased estimators of parameters \(\beta_0\) and \(\beta_1\) can be found by differentiating (or minimizing) the sum of squared errors (SSE).

\[\begin{aligned} \text{SSE} &= \sum_{i=1}^n (y_i - \hat{y_i})^2 \\ &= \sum_{i=1}^n [y_i - (\hat{\beta_0} + \hat{\beta_1}x_i)]^2 \end{aligned}\]

Finding the estimators of parameters by minimizing SSE is called least-squares method and is intuitive in that minimizing SSE will fit a straight line through a set of data points.

If SSE has a minimum, the minimum occurs where the partial derivatives of SSE with respect to parameters \(\beta_0\) and \(\beta_1\) are zeros(\(\frac{\partial \text{SSE}}{\partial \hat{\beta_0}} = 0\) and \(\frac{\partial \text{SSE}}{\partial \hat{\beta_1}} = 0\)). Let’s find the partial derivatives.

\[\frac{\partial \text{SSE}}{\partial \hat{\beta_0}} = \frac{ \partial \left\{ \sum_{i=1}^n \left[y_i - (\hat{\beta_0} + \hat{\beta_1} x_i)\right]^2 \right\}}{\partial \hat{\beta_0}} = -2\left( \sum_{i=1}^n y_i - n\hat{\beta_0} - \hat{\beta_1}\sum_{i=1}^n x_i \right)\]

\[\frac{\partial \text{SSE}}{\partial \hat{\beta_1}} = \frac{ \partial \left\{ \sum_{i=1}^n \left[y_i - (\hat{\beta_0} + \hat{\beta_1} x_i)\right]^2 \right\}}{\partial \hat{\beta_1}} = -2\left( \sum_{i=1}^n x_iy_i - \hat{\beta_0}\sum_{i=1}^n x_i - \hat{\beta_1}\sum_{i=1}^n x_i \right)\]

Then, by equating the partial derivatives to zero, we obtain two equations called the least-squares equations.

\[n\hat{\beta_0} + \hat{\beta_1} \sum_{i=1}^n x_i = \sum_{i=1}^n y_i\]

\[\hat{\beta_0}\sum_{i=1}^n x_i + \hat{\beta_1} \sum_{i=1}^n x_i^2 = \sum_{i=1}^n x_iy_i\]

Solving the two least-squares equations simultaneously results in the estimators of parameters \(\beta_0\) and \(\beta_1\).

\[\hat{\beta_1} = \frac{S_{xy}}{S_{xx}}, \text{ where } S_{xy}=\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \text{ and } S_{xy} = \sum_{i=1}^n (x_i - \bar{x})^2\]

\[\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\]

The estimators can be obtained more easily using vectors and matrices. In matrix form, the linear regression model is as follows:

\[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\]

\[\begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ \vdots \\ \hat{y_n} \end{bmatrix} % = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} % \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix}\]

\[\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{b}}\]

Also, by noting that the elements of

\[\mathbf{X}^T \mathbf{X} = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \\ \end{bmatrix} % \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} % = \begin{bmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix}\]

are the coefficients of the two least-squares equations and that the elements of

\[\mathbf{X}^T \mathbf{y} = \begin{bmatrix} 1 & 1 & \ldots & 1 \\ x_1 & x_2 & \ldots & x_n \\ \end{bmatrix} % \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} % = \begin{bmatrix} \sum_{i=1}^n y_i \\ \sum_{i=1}^n x_i y_i \end{bmatrix}\]

are the right hand side of the two least-squares equations, the least-squares equations can be written as follows:

\[\begin{bmatrix} n & \sum_{i=1}^n x_i \\ \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix} % \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} % = \begin{bmatrix} \left( n\hat{\beta_0} + \hat{\beta_1}\sum_{i=1}^n x_i \right) \\ \left( \hat{\beta_0}\sum_{i=1}^n x_i + \hat{\beta_1} \sum_{i=1}^n x_i^2 \right) \end{bmatrix} % = \begin{bmatrix} \sum_{i=1}^n y_i \\ \sum_{i=1}^n x_i y_i \end{bmatrix}\]

\[(\mathbf{X}^T \mathbf{X})\hat{\mathbf{b}} = \mathbf{X}^T \mathbf{y}\]

Thus,

\[\hat{\mathbf{b}} = \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}\]

Let's estimate the parameters of the simple linear regression using the small dataset presented at the beginning of this article.

          
        import numpy as np
        def linear_regression(X, y):
            n = X.shape[0]
            X = np.hstack((np.ones(n).reshape(n,-1), X)) 
            b = np.dot(np.dot(np.linalg.inv(np.dot(X.T, X)), X.T), y)
            return b
        
        X = np.array([[  0],
                      [ 50],
                      [100],
                      [150],
                      [200],
                      [250]])
        
        y = np.array([[51.2],
                      [62.1],
                      [70.3],
                      [72.5],
                      [69.9],
                      [87.3]])
        
        b = linear_regression(X, y)
        print(b)
        
        

Result is as follows:

\[\hat{y} = 54.161 + 0.1178x\]

In the regression analysis, we frequently use the statistic \(r^2\) to evaluate the performance of the regression model. \(r^2\) is called the coefficient of determination, and it is a square of sample correlation coefficient (\(r\)), which is a standardized covariance.

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{ \sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})}} = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} = \hat{\beta_1}\sqrt{\frac{S_{xx}}{S_{yy}}}\]

Thus,

\[\begin{equation} \begin{aligned} r^2 &= \left( \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}} \right)^2 = \frac{S_{xy}}{S_{xx}} \cdot \frac{S_{xy}}{S_{yy}} = \frac{\hat{\beta_1} S_{xy}}{S_{yy}} = \frac{S_{yy}-\text{SSE}}{S_{yy}} \\ &= 1 - \frac{\text{SSE}}{S_{yy}} \end{aligned} \end{equation}\]

Note that the correlation coefficient does not make a distinction between the dependent variable and independent variable because it does not assume cause-and-effect relationship between the variables.

Multiple Linear Regression

If there are multiple (\(K \ge 2\)) independent variables, multiple linear regression model can be used.

\[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_{i,1} + \hat{\beta_2}x_{i,2} + \ldots + \hat{\beta_k}x_{i,k}\]

\[\begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ \vdots \\ \hat{y_2} \end{bmatrix} % = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1K} \\ 1 & x_{21} & x_{22} & \ldots & x_{2K} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{nK} \end{bmatrix} % \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \\ \hat{\beta_2} \\ \vdots \\ \hat{\beta_K} \\ \end{bmatrix}\]

\[\hat{\mathbf{y}} = \mathbf{X}\hat{\mathbf{b}}\]

Then, we can employ the same least-squares method used for simple linear regression.

\[\hat{\mathbf{b}} = \begin{bmatrix} \hat{\beta_0} \\ \hat{\beta_1} \end{bmatrix} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\]

Let's estimate the parameters of the simple linear regression of the following small dataset using the same implementation above

          
        X = np.array([[  0, 250],
              [ 50, 200],
              [100, 120],
              [150, 100],
              [200,  80],
              [250,  60]])
        
        y = np.array([[51.2],
                      [62.1],
                      [70.3],
                      [72.5],
                      [69.9],
                      [87.3]])
        
        b = linear_regression(X, y)
        print(b)
        
        

Result is as follows:

\[\hat{y} = 68.65 + 0.07x_1 - 0.06x_2\]