Vtyjkyuk

Question

I wish to understand whether i have interpreted below box & whisker plot correctly; this will also assert my understanding on the same. (I am learning basic statistics & measure of dispersion)

Box & Whisker Plot:

enter image description here

Lets say the number line represents age of students then following is my interpretation.

Students age group is 2-9

There are more students with age 6-7 & 7-8.5

The average student age is 7

Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

Is my above understanding correct ? Also what other interpretations can i make ?

score 0 · Accepted Answer · 2018-09-11 04:42:56Z

Students age group is 2-9

Yes. 2 is the minimum age observed in the sample and 9 is the maximum age.

There are more students with age 6-7 & 7-8.5

Not exactly. Half of the children in the sample have ages represented within
the 'box'; that is between 6 and 8.5. Roughly speaking, a quarter of the students are under 6 yrs old, a quarter of them are from 6 to 7 yrs old, a quarter are between 7 and 8.5 years old, and a quarter are older than 8.5 years.

The average student age is 7

More precisely, the median age is 7. (Less than half are below 7 and less than half are above 7.)

Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

I don't think it is useful to use a boxplot to talk about 'density' with any precision. Certainly, it is true that about 3/4 of the students are concentrated within years 6 and 9 yrs of age (a span of 3-4 years, depending how you view age), while only 1/4
are in the longer span of years from 2 to 6. But a histogram is a better graphical device for showing 'densities'.

Note: A boxplot gives no information about how many students are in the sample.
It is best to use boxplots only for samples larger than a dozen or so. The mechanism of making a boxplot depends on finding three numbers which cut sorted observations into four approximately equal parts. [They are the lower
quartile $(Q_1)$ left end of the box, Median, heavy line within the box, and $(Q_3)$ right end of box.] If you have a sample of only seven observations, it
is difficult to know how to divide them into four approximately equal 'chunks'.

Here is a histogram of a (fake) dataset of 40 ages that might have made your boxplot. A histogram is based on area: notice that each student
is represented by one 'brick' of area within his or her bar of the histogram.

The tick marks beneath the histogram show 'exact' ages of the
students (e.g, to the nearest number of weeks). At the resolution of this graph, tick marks for 2 or more students of
very nearly the same age may appear as one mark.

enter image description here

Addendum: A comment expressed interest in means, medians, and modes of skewed distributions.
Here are samples from two distributions: The first
is $mathsfGamma(shape=2, rate=1/20)$ It is a right-skewed distribution with mode 20, median 33.37,
and mean 40. A sample of size $n = 100$ has the following summary
statistics:

summary(x)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 2.441 19.121 33.433 40.629 49.972 203.525

The sample mean and median are similar to the population mean and
median. There is no formal mode because no two observations are
exactly the same, but one might say that the modal interval of the histogram
(lower-left in the figure below) is $(20, 40].$

The second distribution is $mathsfBeta(2, 1)$ It has a left-skewed
distribution with mode 1, median 0.7071, and mean 2/3. A sample of size $n = 100$ has the following summary
statistics:

summary(y)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 0.08611 0.49792 0.71515 0.67491 0.87883 0.99579

Again here, the sample mean and median closely imitate the population
mean and mean, respectively. The modal histogram interval is $(0.9, 1.0].$

The figure below shows the gamma distribution at left and the
beta distribution at right. The tick marks below the histogram show
the locations of individual points. The curves are the density
functions of the respective distributions.

enter image description here

set.seed(1234)
par(mfcol=c(2,2))
 x = rgamma(100, 2, .05)
 boxplot(x, horizontal=T, col="skyblue2")
 hist(x, prob=T, col="skyblue2"); rug(x)
 curve(dgamma(x, 2, .05), add=T, lwd=2)

 y = rbeta(100, 2, 1)
 boxplot(y, horizontal=T, col="skyblue2")
 hist(y, prob=T, col="skyblue2"); rug(y)
 curve(dbeta(x, 2, 1), add=T, lwd=2)
par(mfrow=c(1,1))

Note to @linuxuser: If your textbook does not discuss gamma and beta distributions, you can
read about them in Wikipedia. Both families of distributions are widely used in applied probability modeling. [Roughly speaking, the gamma function $Gamma(cdot),$ used to define the density functions, is
a continuous version of the factorial function, filling in values
for non-integers. For positive integer $k$, we have $Gamma(k) = (k-1)!;$ for example $Gamma(5) = 4! = 24.$]

I am beginner in statistics, sir please excuse me if i ask some silly questions. I don't think it is useful to use a boxplot to talk about 'density' with any precision : IMO w.r.t density you will always wish to have an idea/approximately know which part of the population is more denser and for that box plot is enough to tell. Why would you want to quantify the density ever ? — Sep 13 at 3:09
Both boxplots and histograms are useful, but it is helpful to unserstand their relative advantages and disadvantages. — Sep 13 at 5:11

score 0 · Answer 2 · 2018-09-11 03:19:59Z

Students age group is 2-9

There are more students with age 6-7 & 7-8.5

The average student age is 7

Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

The point $1$ is correct.

Note that the point 2 contradicts the point 4: each invertal is roughly $25%$ of the data, so $Q_1$-$Q_3$ is roughtly $50%$ of the data. Also, the statement is not complete: "more students with age 6-7 & 7-8.5" than which group? Do you mean more students compared with other specific interval or in general?

In the point $3$, the word "average" is ambiguous, as there are three types of averages: mean, median and mode. Here $Q_2$ is the median. Depending on the shape of distribution (there can be three types: positively-skewed, negatively-skewed, symmetric), you can have different relationships of the mean, median and mode (usually, mode$<$median$<$mean, mean$<$median$<$mode, mean$approx$median$approx$mode, respectively, however, for symmetric not always). The data looks negatively-skewed, because $75%$ data are in the interval $6$-$9$ against $25%$ in $2$-$6$, which implies the data (ages, basically the number of students) is more densely situated in the interval $6$-$9$. Consequently, you can say the data is less variable (closely situated) in the interval $6$-$9$ compared with the interval $2$-$6$.

I believe you mean "positively-skewed, negatively-skewed, and symmetrical." There are many non-normal, but symmetrical distribution families (including Cauchy and Laplace). — Sep 11 at 0:35
@BruceET, thank you for helpful comment, yes, also uniform, bimodal, etc. I wanted to show the relationship of the three averages, which may not hold for symmetric. Though, the comparison may not hold for others too. Sometimes, median$<$mean$<$mode in negatively-skewed. — Sep 11 at 3:22
For positive skewness it's usually mode < median < mean, and for negative skewness it's usually mean < median < mode. (Not always, you can rig distributions to get most anything.) But boxplots show only median, so it's not clear to me how they fit into the picture. Maybe histograms would work better. — Sep 11 at 3:35
@BruceET, yes, by "respectively" I meant the cases for mean, median and mean of different distributions. The box plot also indicates the shape of distribution, no? In this case, the mass is shifted to the right, so it is negatively-skewed, so mean$<$median$<$mode. Thank you, indeed. — Sep 11 at 6:07

score 0 · Accepted Answer · 2018-09-11 04:42:56Z

Students age group is 2-9

Yes. 2 is the minimum age observed in the sample and 9 is the maximum age.

There are more students with age 6-7 & 7-8.5

Not exactly. Half of the children in the sample have ages represented within
the 'box'; that is between 6 and 8.5. Roughly speaking, a quarter of the students are under 6 yrs old, a quarter of them are from 6 to 7 yrs old, a quarter are between 7 and 8.5 years old, and a quarter are older than 8.5 years.

The average student age is 7

More precisely, the median age is 7. (Less than half are below 7 and less than half are above 7.)

Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

I don't think it is useful to use a boxplot to talk about 'density' with any precision. Certainly, it is true that about 3/4 of the students are concentrated within years 6 and 9 yrs of age (a span of 3-4 years, depending how you view age), while only 1/4
are in the longer span of years from 2 to 6. But a histogram is a better graphical device for showing 'densities'.

Note: A boxplot gives no information about how many students are in the sample.
It is best to use boxplots only for samples larger than a dozen or so. The mechanism of making a boxplot depends on finding three numbers which cut sorted observations into four approximately equal parts. [They are the lower
quartile $(Q_1)$ left end of the box, Median, heavy line within the box, and $(Q_3)$ right end of box.] If you have a sample of only seven observations, it
is difficult to know how to divide them into four approximately equal 'chunks'.

Here is a histogram of a (fake) dataset of 40 ages that might have made your boxplot. A histogram is based on area: notice that each student
is represented by one 'brick' of area within his or her bar of the histogram.

The tick marks beneath the histogram show 'exact' ages of the
students (e.g, to the nearest number of weeks). At the resolution of this graph, tick marks for 2 or more students of
very nearly the same age may appear as one mark.

enter image description here

Addendum: A comment expressed interest in means, medians, and modes of skewed distributions.
Here are samples from two distributions: The first
is $mathsfGamma(shape=2, rate=1/20)$ It is a right-skewed distribution with mode 20, median 33.37,
and mean 40. A sample of size $n = 100$ has the following summary
statistics:

summary(x)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 2.441 19.121 33.433 40.629 49.972 203.525

The sample mean and median are similar to the population mean and
median. There is no formal mode because no two observations are
exactly the same, but one might say that the modal interval of the histogram
(lower-left in the figure below) is $(20, 40].$

The second distribution is $mathsfBeta(2, 1)$ It has a left-skewed
distribution with mode 1, median 0.7071, and mean 2/3. A sample of size $n = 100$ has the following summary
statistics:

summary(y)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 0.08611 0.49792 0.71515 0.67491 0.87883 0.99579

Again here, the sample mean and median closely imitate the population
mean and mean, respectively. The modal histogram interval is $(0.9, 1.0].$

The figure below shows the gamma distribution at left and the
beta distribution at right. The tick marks below the histogram show
the locations of individual points. The curves are the density
functions of the respective distributions.

enter image description here

set.seed(1234)
par(mfcol=c(2,2))
 x = rgamma(100, 2, .05)
 boxplot(x, horizontal=T, col="skyblue2")
 hist(x, prob=T, col="skyblue2"); rug(x)
 curve(dgamma(x, 2, .05), add=T, lwd=2)

 y = rbeta(100, 2, 1)
 boxplot(y, horizontal=T, col="skyblue2")
 hist(y, prob=T, col="skyblue2"); rug(y)
 curve(dbeta(x, 2, 1), add=T, lwd=2)
par(mfrow=c(1,1))

Note to @linuxuser: If your textbook does not discuss gamma and beta distributions, you can
read about them in Wikipedia. Both families of distributions are widely used in applied probability modeling. [Roughly speaking, the gamma function $Gamma(cdot),$ used to define the density functions, is
a continuous version of the factorial function, filling in values
for non-integers. For positive integer $k$, we have $Gamma(k) = (k-1)!;$ for example $Gamma(5) = 4! = 24.$]

I am beginner in statistics, sir please excuse me if i ask some silly questions. I don't think it is useful to use a boxplot to talk about 'density' with any precision : IMO w.r.t density you will always wish to have an idea/approximately know which part of the population is more denser and for that box plot is enough to tell. Why would you want to quantify the density ever ? — Sep 13 at 3:09
Both boxplots and histograms are useful, but it is helpful to unserstand their relative advantages and disadvantages. — Sep 13 at 5:11

score 0 · Answer 4 · 2018-09-11 03:19:59Z

Students age group is 2-9

There are more students with age 6-7 & 7-8.5

The average student age is 7

Since each group (Least-Q1, Q1-Q2, Q2-Q3 & Q3-Greatest) in box and whisker plot is roughly equally divided; thus the smallest looking group would be more denser or less variable. So does that mean in above example (Q3-Greatest) group contains most students of aged 8.5-9; so its densest of all and less variable ?

The point $1$ is correct.

Note that the point 2 contradicts the point 4: each invertal is roughly $25%$ of the data, so $Q_1$-$Q_3$ is roughtly $50%$ of the data. Also, the statement is not complete: "more students with age 6-7 & 7-8.5" than which group? Do you mean more students compared with other specific interval or in general?

In the point $3$, the word "average" is ambiguous, as there are three types of averages: mean, median and mode. Here $Q_2$ is the median. Depending on the shape of distribution (there can be three types: positively-skewed, negatively-skewed, symmetric), you can have different relationships of the mean, median and mode (usually, mode$<$median$<$mean, mean$<$median$<$mode, mean$approx$median$approx$mode, respectively, however, for symmetric not always). The data looks negatively-skewed, because $75%$ data are in the interval $6$-$9$ against $25%$ in $2$-$6$, which implies the data (ages, basically the number of students) is more densely situated in the interval $6$-$9$. Consequently, you can say the data is less variable (closely situated) in the interval $6$-$9$ compared with the interval $2$-$6$.

I believe you mean "positively-skewed, negatively-skewed, and symmetrical." There are many non-normal, but symmetrical distribution families (including Cauchy and Laplace). — Sep 11 at 0:35
@BruceET, thank you for helpful comment, yes, also uniform, bimodal, etc. I wanted to show the relationship of the three averages, which may not hold for symmetric. Though, the comparison may not hold for others too. Sometimes, median$<$mean$<$mode in negatively-skewed. — Sep 11 at 3:22
For positive skewness it's usually mode < median < mean, and for negative skewness it's usually mean < median < mode. (Not always, you can rig distributions to get most anything.) But boxplots show only median, so it's not clear to me how they fit into the picture. Maybe histograms would work better. — Sep 11 at 3:35
@BruceET, yes, by "respectively" I meant the cases for mean, median and mean of different distributions. The box plot also indicates the shape of distribution, no? In this case, the mass is shifted to the right, so it is negatively-skewed, so mean$<$median$<$mode. Thank you, indeed. — Sep 11 at 6:07

搜尋此網誌

Vtyjkyuk

Interpretation of the given box and whisker plot

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

這個網誌中的熱門文章

tkz-euclide: tkzDrawCircle[R] not working

How to combine BÃ©zier curves to a surface?

1st Magritte Awards

Interpretation of the given box and whisker plot

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

這個網誌中的熱門文章

tkz-euclide: tkzDrawCircle[R] not working

How to combine BÃ©zier curves to a surface?

1st Magritte Awards

2 Answers
2

2 Answers
2

2 Answers
2