Friday, December 26, 2008

Describing data

Bar charts and pie charts

Bar charts are popular for showing the relative occurrence of an atrribute. The lengths of the bars represent the frequencies with which the various categories were found in the sample.

Sometimes the frequency for a category of an attribute can be subdivided within that category.
  • Stacked bar chart: each subcategory is stacked over
  • Multiple bar chart: subcategories are displayed as a bar adjacent to each other
When the data can be thought of as being a breakdown of some whole entity into component parts then a pie chart cna be drawn.

Numerical measures for raw data

mean or average

Extraordinarily large or small x values, called outliers, will have a disproportionate effect on the average of a sample.
.
The median is defined to be the middle value when all the data is arranged in numerical order.

The mode reprensents the most popular value which occurs more frequently than any other.

To measure the spread of the data:
  • The range equals to the largest data value minus the smallest one.
  • The variance is related to the deviations of the data values from their mean.
  • The standard deviation is the square root of the variance.
There are usually two standard deviation options, one marked \sigma_{n-1}, which is the sample standard deviation, and another marked \sigma_{n}, the population standard deviation. The latter applies only when the data relates to the entire population. We shall have no use for it as all our work is with samples which are not complete polulations.

Similar to the median (greater than 50% of the data), we can define lower quartile (greater than 25% of the data) and upper quatile (greater than 75% of the data). The distance between the lower and upper quartiles is the interquartile range which contains half the values.

Sometimes the size of the numbers in a sample may be too large and lead to numerical errors. It is helpful to subtract a suitable number, an assumed mean, from every data value. After getting the mean of the new sample, it can be adjusted by adding back the assumed mean afterwards. The range, variance and standard deviation all remain the same for the adjusted sample as its spread is precisely the same as that of the original data.

Grouped data, histograms and ogives

When the sample contains many different values, data can be classified. For example, The lengths of the stalks of hundreds of dandelions were measured and classified to form a grouped frequency table:

Stalk length (mm) 1.0-4.0 4.0-6.0 6.0-8.0 ........
Number of dandelions 9 25 34 ........

The diagram must illustrate not only the frequencies with which the data values occur within the different classes, but also the sizes of the class widths. A histogram is a kind of bar chart where the widths of the bars represent the widths of the classes. As areas have more visual impact than lengths, we draw bars whose areas are proportional to the frequencies for each class.

The scall of the vertical axis and its label, 'Frequency/class width' are ofter omitted, leaving the areas to indicate the relative sizes of the frequencies without conveying their exact values.

In order to estimate the mean, variance, etc., we assume that the value of each class equals to the mid-point of the class to which it belongs.

The cumulative frequency of x is the number of data values which are less than or equal to x. A graph of cumulative frequency against x should be a smoothly varying curve. It is called a cumulative frequency curve or ogive. The ogive can be constructed from the grouped frequency table.

Time series, frequency polygons and indices

A sequence of measurements corresponding to different instants or periods of time is called a time series. It should be noted that a time series is a single sample with two variables being measured, the quantitiy of interest and time.

Time series can be illustrated by a frequency polygon. It consists of straight lines joining the values of each measurement. It is often felt necessary to focus attention on the movement of the figures in a time series rather than on their absolute size. An index is established for the series by expressing each data value as a percentage of the figure for a base year. For example, CPI. A frequency polygon with indices would look exactly the same as the actual scale on the vertical axis.

Scatter diagrams

A scatter diagram is a plot of one variable against another showing how, if at all, they are related.

Thursday, December 25, 2008

The General Two-Person, Zero-Sum Game

The games in this chapter have no equilibrium points; if you are to win anything like the amount that you should, you must start second-guessing your opponent in earnest.

A strategy that prescribes the selection of a pure strategy by means of random device is called a mixed strategy.

Once a player resorts to randomizing, not only can he/she not lose (on average), no matter how well an opponent plays; he/she will not gain, however badly an opponent plays. The more capable your opponent, then, the more attractive the randomizing procedure.

von Meumann's Minimax Theorem

One can assign to every finite, two-person, zero-sum game a value V: the average amount that player I can expect to win from player II if both players act sensibly. This predicted outcome is plausible, for three reasons:
  1. Player I will not settle for anything less than V.
  2. Player I can prevented from getting more than V.
  3. Since the game is zero-sum, player II is motivated to limit I's average return to V.
We can treat all two-person, zero-sum games as though they had equilibrium points. The only difference between games with actual equilibrium points and those without them is that in one case you can use a pure strategy and obtain the value of the game with certainty, while in the other case you must use a mixed strategy and you obtain the value of the game on average.

Calculating Mixed Strategies

A strategy may be dominated by either a pure strategy or a mixed one.

  1. Calculate the maximin and minimax - if they are equal, this is a game with exact equilibrium point(s). Otherwise, go on to step 2.
  2. Eliminate all strategies that are dominated.
  3. Assign probabilities to each of your strategies so that the outcome of the game will be the same on average whatever your opponent does. Assume your opponent does the same; if the outcome when you use this mixed strategy is the same as the outcome when your opponent uses his/her mixed strategy and if all probabilities are nonnegative, you have solved the game.
If there is a gap between the outcomes or if some of the probabilities are negative, reexamine the games for dominated strategies; if there are none, then this method failed.

For example,




Your Opponent


D
E

A
15
10

B
6
15
You
C
5
20

1/5 A and 4/5 C7
18

p A and (1-p)C
15*p + 5*(1-p)10*p + 20*(1-p)
Your strategy B is dominated by the mixed strategy of A and C, such as play A 1/5 of the time and C 4/5 of the time. Then B is eliminated.

Then let's assume that you play A with probability p and C with (1-p), and your opponent plays D with probability r and E with (1-r). We need 15*p + 5*(1-p) = 10*p + 20*(1-p). Therefore, p = 3/4 and the corresponding payoff is 12.5 on average.

By a similar calculation, you find that your opponent should play each of his/her strategies half the time and the payoff is the same 12.5.

Notes

Player I can be stopped from getting any more; moreover, player II is motivated to stop I from getting more. If I chooses another strategy, he/she is gambling. If II also gambles, there is no telling what will happen. The minimax strategies are attractive in that they offer security. The appeal of security is a matter of personal taste.

If should be emphasized that the minimax strategy is essentially defensive, and when you use it, you often eliminate any chance of doing better.

Experimental studies show that people approach minimax strategies by "learning" during the game. They "learned" how to react effectively to the specific behavior of the experimenter, and adopt defensive strategies when meeting clever opponent or aggressive ones to exploit the others.

The weakest part of the theory is undoubtedly the assumption that a player should always act so as to maximize the average payoff. The justification for the assumption is that, in the long run, not only the average return but the actual return will be maximized. This is because any tricks or pure strategy will be recognized and exploited by the opponent. But if a game is played only once, ofter a strategy that maximizes the average return is not desirable, much less compelling.

In summary,
  1. Minimax strategy is secure but defensive, thus not always be the choice of people (depends on the level of the opponent).
  2. It relies on the concept of average payoff, which implicitly assumes repeated games. For game just goes once, it may not be suitable.
  3. The pre-requirement of zero-sum is strict and not always be the truth in real life. For this issue, refer to the next chapter of utility theory.

Some Thoughts

  1. Do we really need to eliminate dominated strategies? Maybe after solving the maxtrix, their corresponding probability equals to 0.
  2. The method may fail. Then what to do next?
  3. The definition of payoff matrix may be a sujective process.

The Two-Person, Zero-Sum Game with Equilibrium Points

Zero-sum means the players have diametrically opposed interests.

Two strategies are said to be in equilibrium (they come in pairs, one for each player) if neither player gains by changing strategy unilaterally. The outcome corresponding to this pair of strategies is defined as the equilibrium point, which is considered stable because a player unilaterally picking a new strategy is hurt by the change.

For two-person, zero-sum games, there may be more than one equilibrium point, but it there is, they will all have the same payoff.

How to find the equilibrium point

Theoretic method

If equilibrium points exist, they are easy to find. For a given payoff matrix (A's choice is represented as rows, while B's as columns):
  1. Since B would choose a strategy that yields the minimum value of any row A chooses, A should choose a strategy that yields the maximum of these minimum values; this value is called the maximin.
  2. Since A would choose a strategy that yields the maximum value of any column B chooses, B should choose the column that minimizes these maximum values; this value is called the minimax.
  3. If the minimax equals the maximin, the payoff is an equilibrium point and the corresponding strategies are an equilibrium strategy pair.
When an equilibrium point exists in a two-person, zero-sum game, it is called the solution. The reasons why equilibrium points are considered solutions are:
  1. By playing his/her equilibrium strategy, a player will get at least the value of the game.
  2. By playing his/her equilibrium strategy, an opponent can stop a player from getting any more than the value of the game.
  3. Since the game is zero-sum, a player's opponent is motivated to minimize the player's payoff.
In games with equilibrium points, payoffs that are not associated with either equilibrium strategy have no bearing on the outcome. That is, the change to these values will not change player's strategies.

Simplification with domination

It is often possible to simply a game by eliminating dominated strategies.

Strategy A dominates strategy B if a player's payoff with A is
  • always at lease as much as that of B (whatever other players do)
  • at lease some of the time actually better than strategy B.
If all strategies but one are dominated for each player, the equilibrium point(s) can be calculated.

If there is no equilibrium point, a mixed strategy is required. Refer to the next chapter.

Saturday, December 20, 2008

An Overview

The theory of games is a theory of decision making. While decision makers are trying to manipulate their environment, their environment is trying to manipulate them.

Statistical methodology

The first phase of a statistical investigation is the identification of the population together with the variables which will be measured.

A population is a totality of entities about which we hope to draw conclusions.

A property of a member of a population which varies from one individual to another is called a random variable.

Data can be classified by its level of measurement:
  • Nominal: items will be classified into categories.
  • Ordinal: categories which are ranked in some sort of logical order.
  • Metric: the quantity has a physical significance.
Data can also be classified by the continuity of the scale:
  • Continuous: can take any value within a certain range.
  • Discrete: there are gaps in the scale between allowable values.
A factor is an effect we can control by setting its level before any variables are measured. Notice that a factor is not a random variable as its value is completely within our control as part of the experiment design.

Each factor can be operative at two or more levels and we define a treatment to be a combination of specific levels of each factor present.

Each treatment defines a separate sample, a set of entities from a population. It is emphasized that the items in a sample must be homogeneous with respect to the characteristics we are studying. The fundamental assumption in a statistical analysis is that members of a sample are identical with each other except for that variability we are prepared to write off as being due to random unexplained variation.

The Law of Large Numbers states that the larger the size of sample, the better its average estimates the corresponding average of the population.

Most statistical techniques assume that the sample is randomly selected, and every single member of the population has an equal chance of being included in the sample. Individual data measurements must be statistically independent of each other and ideally should not interact at all.

Methods of analysis range from the calculation of summary statistics like averages and the drawing of diagrams like bar charts to more sophisticated techniques like analysis of variance and regression. Because the information content of the whole operation is being compressed into a diagram or a simple statement, the result can be like a woman's bikini - what it reveals is interesting but what it covers up is vital.