------------------------------------------------------------------------------------ Notes on the Project ------------------------------------------------------------------------------------ You need to analyze a regression-type dataset, which means it needs to consist of "cases" (rows of a dataframe), with several predictors (columns), one of which will be the response or outcome (Y), which should be quantitative (for normal regression), or categorical (for GLM). You need to have more cases than predictors. Model selection should be a key component of your analysis. The data should not already have been analyzed in a publication (book or paper). You should carefully consider all the methods we have studied in the first half of the course (Hwks 1-7). Your project must be typed up neatly in the form of a paper, do not include raw computer output, give only relevant information in tables and figures. (Refer to the Handout on the course website on communicating statistical results.) Packages with datasets (there are of course many more...) ----------------------------------------------------------------------------------------- MASS, ISLR, leaps, faraway. Regression data websites ------------------------- * http://people.sc.fsu.edu/~jburkardt/datasets/regression/regression.html * https://www.dcc.fc.up.pt/~ltorgo/Regression/DataSets.html * https://www.kaggle.com/datasets Background on some datasets posted here ----------------------------------------------------------------------------------------- oil.csv ------------------------------------------------- The file oil.csv can be read into R with the code > oil = read.csv("oil.csv") and consists of observations on 7 quantitative variables made at each of 667 oil wells (each row is a different well). The variables co2, c1, c2, c35, and c6 represent various chemical constituents of the oil (measured in mole fraction), while temp and press are the well's temperature and pressure, respectively. The task here is to build a model to predict press as a function of the remaining variables. For some background as to why this is important, you can read the article "Prediction of Pressure, Temperature, and Velocity Distribution of Two-Phase Flow in Oil Wells", Journal of Petroleum Science and Engineering, 46(3), 2005, pp. 195-208. SA.dat --------------------------------------------- The file can be read into R with the code > SA = data.frame(read.table("SA.dat", sep= "\t", header = T)) Data on 311 male individuals. The variables in the data set include; age of patient (age), systolic bloodpressure (sbp), fat in adipose tissue (beneath the skin, around organs) (adiposity), body mass index (bmi) (obesity), typa A behaviour (aggressive personality) (typea), how much alcohol units consumed per week (alcohol), an indicator if drink alcohol at al (alcind), cumulative tobacco consumption in kg (tobacco), and indicator if patient is /have been a smoker (tobind), an indicator whether patient is diagnosed with heart disease (chd), an indicator if the patient has family member with heart disease (famhist), and finally the cholesterol level of the patient (ldl). The task here is to build a model to predict ldl as a function of the remaining variables. cars.dat ------------------------------------------------ The file can be read into R with the code > cars = data.frame(read.table("cars.dat", sep= "\t", header = T)) There are 82 different cars in this data set and 19 variables. The focus is to predict the price (mid.price) as a function of the other variables, which include mileage (miles per gallon in the city and on highway, airbag standard (0 if not included, 1 if for the driver and 2 if passenger/side), the number of cylinders, the engine size, horsepower, maximum rpm of engine, manual transmission (yes=1, no=0), size of the fuel tank, passenger room, length and width of the car, a measure about the space needed to make u-turn with the car, size of the rear room of the car, luggage room, weight of the car, and finally an indicator if the car is domestic (US built=1) or not. Concrete: R library "MAVE" [https://cran.r-project.org/web/packages/MAVE/MAVE.pdf] --------------------------------------------------------------- Modeling concrete strength is very important in civil engineering applications, it being a highly nonlinear function of the age of the poured mix as well as of several ingredients. The "Concrete" dataset in R library "MAVE" consists of n=1030 observations with a total of 8 features relevant to concrete compressive strength (the outcome y, in MPA).