Kimberley Hutson
5/13/2020
The purpose of this document is to provide an overview of data analysis and visualization for the different types of cereals.
Things to know:
Type;
- C = Cold
- H = Hot
Manufacturer;
- A = American Home Food Products
- G = General Mills
- K = Kellogg
- N = Nabisco
- P = Post
- Q = Quaker Oats
- R = Ralston Purina
The data set used in this overview was taken from: https://www.kaggle.com/crawford/80-cereals/data
Data set(Cereal)
Name | Manufacturer | Type | Calories | Protein | Fat | Sodium | Fibre | Carbohydrates | Sugar | Potassium | Vitamins | Shelf | Weight | Cups | Rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100% Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1 | 0.33 | 68.40297 |
100% Natural Bran | Q | C | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1 | 1.00 | 33.98368 |
All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1 | 0.33 | 59.42551 |
All-Bran with Extra Fiber | K | C | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1 | 0.50 | 93.70491 |
Almond Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1 | 0.75 | 34.38484 |
Apple Cinnamon Cheerios | G | C | 110 | 2 | 2 | 180 | 1.5 | 10.5 | 10 | 70 | 25 | 1 | 1 | 0.75 | 29.50954 |
Summary of Data set(Cereal)
## Name Manufacturer Type Calories ## 100% Bran : 1 A: 1 C:74 Min. : 50.0 ## 100% Natural Bran : 1 G:22 H: 3 1st Qu.:100.0 ## All-Bran : 1 K:23 Median :110.0 ## All-Bran with Extra Fiber: 1 N: 6 Mean :106.9 ## Almond Delight : 1 P: 9 3rd Qu.:110.0 ## Apple Cinnamon Cheerios : 1 Q: 8 Max. :160.0 ## (Other) :71 R: 8 ## Protein Fat Sodium Fibre ## Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000 ## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000 ## Median :3.000 Median :1.000 Median :180.0 Median : 2.000 ## Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152 ## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000 ## Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000 ## ## Carbohydrates Sugar Potassium Vitamins ## Min. :-1.0 Min. :-1.000 Min. : -1.00 Min. : 0.00 ## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 40.00 1st Qu.: 25.00 ## Median :14.0 Median : 7.000 Median : 90.00 Median : 25.00 ## Mean :14.6 Mean : 6.922 Mean : 96.08 Mean : 28.25 ## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00 ## Max. :23.0 Max. :15.000 Max. :330.00 Max. :100.00 ## ## Shelf Weight Cups Rating ## Min. :1.000 Min. :0.50 Min. :0.250 Min. :18.04 ## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17 ## Median :2.000 Median :1.00 Median :0.750 Median :40.40 ## Mean :2.208 Mean :1.03 Mean :0.821 Mean :42.67 ## 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83 ## Max. :3.000 Max. :1.50 Max. :1.500 Max. :93.70 ##
Question 1: Which Manufacturer have cereals with the most fat?
It can be observed on the histogram, that Manufacturer K(Kelloggs) has the most fat content.
Question 2: What are the type of cereals the different manufacturers product?
It can be observed on the histogram, that Type C(Cold) cereals are manufacturered the most.We can also see that the Manufacturers for N (Nabisco) and Q (Quaker Oats ) product both hot and cold cereals.
Question 3: Which type of cereal persons prefer?
The Boxplot compares the rating of Cereals by the different types. It can be observed that Hot type of cereals have a Minimum rating = 51, Q1 rating = 53 , Median rating = 55, Q3 rating = 60 and Maximum rating = 65 with a right skew. Cold type Cereals have a Minimum rating = 18, Q1 rating = 33 , Median rating = 40, Q3 rating = 50 and Maximum rating = 95 (including the 1 outliner) with a right skew.
Question 4: How much Calories one can get per serving?
The scatter plot shows, the amount of Calories you can get from a One cup by Manufactures.
Question 5: Which cereal is the unhealthiest?
It is observed in the scatter plot, there is no relationship between hot and cold cereals, additionally it can be observed that cold cereals has the most fat and sugar content.
Question 6: Which type of cereal will give you more energy(protein)?
It can be observed on the histograms, that eating Manufacturer K(Kelloggs) cold Cereal you will get more energy.
Question 7: Which Manufacturer product Cereals with the most Sodium?
The Box plot compares the amount of Potassium that are in the different type of Cereals. It can be observed that Hot type of cereals have a Minimum = 0 Potassium, Q1 = 49 Potassium, Median = 98 Potassium, Q3 = 101 Potassium and Maximum = 110 Potassium with a left skew. Cold type Cereals have a Minimum = 0 Potassium, Q1 = 30 Potassium, Median = 80 Potassium, Q3 = 110 Potassium and Maximum = 330 Potassium (including the 4 outliners) with a right skew.
Question 8: What is the average amount of Carbohydrates?
It can be observed on the histogram, that the average amount of Carbohydrates one can get from eating cereal hot or cold is 14.5974026.
Question 9: What is the total amount of Fiber you can get from eating your cereal cold or hot?
It can be observed on the histogram, that the total amount of fiber you can get from eating you cereal cold is 74 and hot is 3.
Question 10: Which Manufacturer have cereals with the most Vitamins?
It can be observed on the histogram, that Manufacturer G (General Mills) is rich in vitamins.
- train is from row 1 - 50
- test is from row 51 - 77
## ## Call:## lm(formula = Rating ~ Fat, data = train)## ## Coefficients:## (Intercept) Fat ## 47.725 -5.248
Summary for first Simple Linear Regression
## ## Call:## lm(formula = Rating ~ Fat, data = train)## ## Residuals:## Min 1Q Median 3Q Max ## -20.081 -7.102 -2.116 7.976 25.926 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 47.725 2.560 18.643 <2e-16 ***## Fat -5.248 1.963 -2.673 0.0102 * ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 11.94 on 48 degrees of freedom## Multiple R-squared: 0.1295, Adjusted R-squared: 0.1114 ## F-statistic: 7.143 on 1 and 48 DF, p-value: 0.01025
Correlation
## [1] -0.3599192
Anova
## Analysis of Variance Table## ## Response: Rating## Df Sum Sq Mean Sq F value Pr(>F) ## Fat 1 1018.3 1018.35 7.1434 0.01025 *## Residuals 48 6842.8 142.56 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC (Akaike’s information criterion)
## [1] 393.8402
BIC (Bayesian information criterion)
## [1] 399.5763
Simple Linear Regression 2
## ## Call:## lm(formula = Rating ~ Sugar, data = train)## ## Coefficients:## (Intercept) Sugar ## 58.616 -2.324
Summary for second Simple Linear Regression
## ## Call:## lm(formula = Rating ~ Sugar, data = train)## ## Residuals:## Min 1Q Median 3Q Max ## -12.8051 -5.3921 -0.7764 4.7406 23.7296 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 58.6163 2.1726 26.980 < 2e-16 ***## Sugar -2.3238 0.2688 -8.646 2.37e-11 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 8.002 on 48 degrees of freedom## Multiple R-squared: 0.609, Adjusted R-squared: 0.6008 ## F-statistic: 74.75 on 1 and 48 DF, p-value: 2.367e-11
Correlation
## [1] -0.7803697
Anova
## Analysis of Variance Table## ## Response: Rating## Df Sum Sq Mean Sq F value Pr(>F) ## Sugar 1 4787.2 4787.2 74.755 2.367e-11 ***## Residuals 48 3073.9 64.0 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC (Akaike’s information criterion)
## [1] 353.8275
BIC (Bayesian information criterion)
## [1] 359.5636
## ## Call:## lm(formula = Rating ~ Calories, data = train)## ## Coefficients:## (Intercept) Calories ## 86.5206 -0.4161
Summary for third Simple Linear Regression
## ## Call:## lm(formula = Rating ~ Calories, data = train)## ## Residuals:## Min 1Q Median 3Q Max ## -18.3546 -5.1485 -0.0718 6.5752 23.7289 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 86.52055 6.95497 12.440 < 2e-16 ***## Calories -0.41609 0.06465 -6.436 5.4e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 9.376 on 48 degrees of freedom## Multiple R-squared: 0.4632, Adjusted R-squared: 0.452 ## F-statistic: 41.42 on 1 and 48 DF, p-value: 5.399e-08
Correlation
## [1] -0.6805819
Anova
## Analysis of Variance Table## ## Response: Rating## Df Sum Sq Mean Sq F value Pr(>F) ## Calories 1 3641.2 3641.2 41.417 5.399e-08 ***## Residuals 48 4219.9 87.9 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC (Akaike’s information criterion)
## [1] 369.6713
BIC (Bayesian information criterion)
## [1] 375.4073
The Models have a value of(anove, the lower the value the stronger it is);
- Model 1- 142.56
- Model 2- 64.0
- Model 3- 95.16
There R-Square values are(the higher the R-square value the better the fit of the model);
- Model 1= 0.1114
- Model 2= 0.6008
- Model 3= 0.4337
Correlation(the closer to 1 or -1 the stronger the correlation);
- Model 1= -0.3599192
- Model 2= -0.7803697
- Model 3= -0.6805819
AIC (the model with the lowest AIC score is preferred);
- Model 1= 393.8402
- Model 2= 353.8275
- Model 3= 292.291
BIC (the model with the lowest BIC score is preferred);
- Model 1= 399.5763
- Model 2= 359.5636
- Model 3= 297.2817
actuals.Name | actuals.Manufacturer | actuals.Type | actuals.Calories | actuals.Protein | actuals.Fat | actuals.Sodium | actuals.Fibre | actuals.Carbohydrates | actuals.Sugar | actuals.Potassium | actuals.Vitamins | actuals.Shelf | actuals.Weight | actuals.Cups | actuals.Rating | predicteds | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | Cinnamon Toast Crunch | G | C | 120 | 1 | 3 | 210 | 0.0 | 13 | 9 | 45 | 25 | 2 | 1 | 0.75 | 19.82357 | 37.70189 |
33 | Grape Nuts Flakes | P | C | 100 | 3 | 1 | 140 | 3.0 | 15 | 5 | 85 | 25 | 3 | 1 | 0.88 | 52.07690 | 46.99720 |
22 | Crispix | K | C | 110 | 2 | 0 | 220 | 1.0 | 21 | 3 | 30 | 25 | 3 | 1 | 1.00 | 46.89564 | 51.64485 |
26 | Frosted Flakes | K | C | 110 | 1 | 0 | 200 | 1.0 | 14 | 11 | 25 | 25 | 1 | 1 | 0.75 | 31.43597 | 33.05424 |
73 | Triples | G | C | 110 | 2 | 1 | 250 | 0.0 | 21 | 3 | 60 | 25 | 3 | 1 | 0.75 | 39.10617 | 51.64485 |
58 | Quaker Oatmeal | Q | H | 100 | 5 | 2 | 0 | 2.7 | -1 | -1 | 110 | 0 | 1 | 1 | 0.67 | 50.82839 | 60.94015 |
From the comparisons it can be observed that, Model 2 is the best fit and the most accurate model of the dataset, for it has a stronger correlation(steeper curve and a higher R-square value).It can be predicted that the Rating goes up when there is More Calories.