Description
In this lab you will carry out regression fits to data, using the matrix approach, outlined in Lecture 7. You will use the approach two ways:
- Carry out a simple linear regression to fit (x,y) data and analyse the result.
- Transform the independent or the dependent variable to find a suitable regression model for the data.
The matrix regression equation is:
Y=Xβ+ϵi
Where
X=[1x11x2……1xn] Y=[y1y2…yn] β=[β0β1]
To solve for β, multiply both sides by the transpose ofX (the transpose of a matrix is when the rows and columns are interchanged)
XTY=XTXβ
β=(XTX)−1XTY
The steps required to create the X matrix are outlined in the lecture 7 notes.
Problem 1. The data file {\tt IrisData_slr10.xls} contains data for Iris Setosa in 3 columns: Entry number, sepal width, sepal length, in an Excel file. Using matrix regression as shown above, fit a model y=β0+β1x, where x is the sepal width and y is the sepal length. Plot the data with the regression line superposed upon it. Comment on the fit of the model.
To read an Excel file into Python, use the Pandas module.
- import pandas as pd
- dataXY = pd.read_excel(excelFileName).
- dataXY.head() # will display the first few lines of dataXY.
- myDataArray = np.array(dataXY) # will create a 2-D numpy array that you can use.
- x = myDataArray[:,0] # 1st column is x (and 2nd column is y).
Plot the residuals ei=yi−y^ versus y, where yi is the data value corresponding to xi and yi^=b0+b1⋅xi is the value given by the model at the point xi.
The plot of ei versus x and the plot of ei versus y should not show any pattern. You only need make one of the two plots.
Regression with Transformations to a Linear Model####
Problem 2 The data file Boyle_P-V.dat has the air pressure measurements (in inches of Hg) made in a variable volume cylinder by Robert Boyle in 1660. The idea in the exercise is to obtain the correct relation between Pressure and Volume, using linear regression with transformed variables.
- Make a scatter plot of Pressure v/s Volume. Does the data show a linear association?
- Using the matrix regression method (outlined above), fit the model: P^=b0+b1⋅V,
- Compute the residuals: ei=Pi−Pi^ and plot (scatterplot) of ei v/s Pi^. Is there a pattern?
When y v/s x scatter plot does not show a linear association, a common way is to transform the y variable according to a ladder of transformations: −1/y;−1/y;log10(y);y;y;y2… A better way is to examine the residual plot: This amplifies any curvature in the original scatter plot/ The transformations can be summarized as: yp, with p=0→log10(y). (Note p does not need to be an integer).
- If your scatter (and residual) plots are:
- Convex-up (Cup open upwards), move to a lower power.
- Convex-down (cup open down), move to a higher power This can be summarized by the bulge plot, which suggests transformations for either variable:

Transform the dependent variable (y), in this problem to obtain a suitable fit. With a suitable fit, the residual plot shows no pattern. Show residual plots for each step in the transformation. Finally, identify (and eliminate) any points that may be causing the fit to deteriorate. Fit your final model without the point(s). Does your model conform to expectations?
Problem 3 In 1989, Soviet scientists released data for nuclear weapons tests. Western scientists had previously estimated the size of the explosions using seismic data.
(Aside: Seismic data is measured on the Richter scale – it is used for earthquakes. The Richter magnitude of an earthquake is determined from the *logarithm of the amplitude of waves recorded by seismographs. Because of the logarithmic basis of the scale, each whole number increase in magnitude represents a tenfold increase in measured amplitude; in terms of energy, each whole number increase corresponds to an increase of about 31.6 times the amount of energy released, and each increase of 0.2 corresponds to approximately a doubling of the energy released. (wiki: Richter_magnitude_scale) – This may help explain the model you obtain).
The data is available in sovietNuclearExplosion.dat the columns are: date, Western magnitude est., Soviet Reported yield (Kilotons).
Make a scatter plot and simple linear regression fit for est. magnitude (Y) v/s yield (x).
Make a plot of the residuals ei=(yi−yi^) v/s yi.
Do you see any pattern in these curves?
For this problem, transform the independant variable: x (according to the ladder of transformations).
Make a scatter plot of the data in each step.
Pick a suitable transformation and obtain the model for the nuclear test magnitude on the (transfiormed) yield.
This is a way to obtain a model for the data using linear regression.

