Friday, December 26, 2008

Describing data

Bar charts and pie charts

Bar charts are popular for showing the relative occurrence of an atrribute. The lengths of the bars represent the frequencies with which the various categories were found in the sample.

Sometimes the frequency for a category of an attribute can be subdivided within that category.
  • Stacked bar chart: each subcategory is stacked over
  • Multiple bar chart: subcategories are displayed as a bar adjacent to each other
When the data can be thought of as being a breakdown of some whole entity into component parts then a pie chart cna be drawn.

Numerical measures for raw data

mean or average

Extraordinarily large or small x values, called outliers, will have a disproportionate effect on the average of a sample.
.
The median is defined to be the middle value when all the data is arranged in numerical order.

The mode reprensents the most popular value which occurs more frequently than any other.

To measure the spread of the data:
  • The range equals to the largest data value minus the smallest one.
  • The variance is related to the deviations of the data values from their mean.
  • The standard deviation is the square root of the variance.
There are usually two standard deviation options, one marked \sigma_{n-1}, which is the sample standard deviation, and another marked \sigma_{n}, the population standard deviation. The latter applies only when the data relates to the entire population. We shall have no use for it as all our work is with samples which are not complete polulations.

Similar to the median (greater than 50% of the data), we can define lower quartile (greater than 25% of the data) and upper quatile (greater than 75% of the data). The distance between the lower and upper quartiles is the interquartile range which contains half the values.

Sometimes the size of the numbers in a sample may be too large and lead to numerical errors. It is helpful to subtract a suitable number, an assumed mean, from every data value. After getting the mean of the new sample, it can be adjusted by adding back the assumed mean afterwards. The range, variance and standard deviation all remain the same for the adjusted sample as its spread is precisely the same as that of the original data.

Grouped data, histograms and ogives

When the sample contains many different values, data can be classified. For example, The lengths of the stalks of hundreds of dandelions were measured and classified to form a grouped frequency table:

Stalk length (mm) 1.0-4.0 4.0-6.0 6.0-8.0 ........
Number of dandelions 9 25 34 ........

The diagram must illustrate not only the frequencies with which the data values occur within the different classes, but also the sizes of the class widths. A histogram is a kind of bar chart where the widths of the bars represent the widths of the classes. As areas have more visual impact than lengths, we draw bars whose areas are proportional to the frequencies for each class.

The scall of the vertical axis and its label, 'Frequency/class width' are ofter omitted, leaving the areas to indicate the relative sizes of the frequencies without conveying their exact values.

In order to estimate the mean, variance, etc., we assume that the value of each class equals to the mid-point of the class to which it belongs.

The cumulative frequency of x is the number of data values which are less than or equal to x. A graph of cumulative frequency against x should be a smoothly varying curve. It is called a cumulative frequency curve or ogive. The ogive can be constructed from the grouped frequency table.

Time series, frequency polygons and indices

A sequence of measurements corresponding to different instants or periods of time is called a time series. It should be noted that a time series is a single sample with two variables being measured, the quantitiy of interest and time.

Time series can be illustrated by a frequency polygon. It consists of straight lines joining the values of each measurement. It is often felt necessary to focus attention on the movement of the figures in a time series rather than on their absolute size. An index is established for the series by expressing each data value as a percentage of the figure for a base year. For example, CPI. A frequency polygon with indices would look exactly the same as the actual scale on the vertical axis.

Scatter diagrams

A scatter diagram is a plot of one variable against another showing how, if at all, they are related.

No comments: