4 - Statistics and probability (IB Maths: Analysis and Approaches HL 166711)

#4.1

Population, sampling, outliers

Concepts of population, sample, random sample, discrete and continuous data.

This is designed to cover the key questions that students should ask when they see a data set/analysis.

Reliability of data sources and bias in sampling.

Dealing with missing data, errors in the recording of data.

Interpretation of outliers.

Outlier is defined as a data item which is more than 1.5 × interquartile range (IQR) from the nearest quartile.

Awareness that, in context, some outliers are a valid part of the sample but some outlying data items may be an error in the sample.

Link to: box and whisker diagrams (SL 4.2) and measures of dispersion (SL 4.3).

Sampling techniques and their effectiveness.

Simple random, convenience, systematic, quota and stratified sampling methods.

#4.2

Presentation of data

Presentation of data (discrete and continuous): frequency distributions (tables).

Class intervals will be given as inequalities, without gaps.

Histograms.

Cumulative frequency; cumulative frequency graphs; use to find median, quartiles, percentiles, range and interquartile range (IQR). Frequency histograms with equal class intervals.

Not required: Frequency density histograms.

Production and understanding of box and whisker diagrams.

Use of box and whisker diagrams to compare two distributions, using symmetry, median, interquartile range or range. Outliers should be indicated with a cross.

Determining whether the data may be normally distributed by consideration of the symmetry of the box and whiskers.

#4.3

Measures of central tendency and dispersion

Measures of central tendency (mean, median and mode).

Estimation of mean from grouped data.

Calculation of mean using formula and technology.

Students should use mid-interval values to estimate the mean of grouped data.

Modal class.

For equal class intervals only.

Measures of dispersion (interquartile range, standard deviation and variance).

Calculation of standard deviation and variance of the sample using only technology, however hand calculations may enhance understanding.

Variance is the square of the standard deviation.

Effect of constant changes on the original data.

Examples: If three is subtracted from the data items, then the mean is decreased by three, but the standard deviation is unchanged.

If all the data items are doubled, the mean is doubled and the standard deviation is also doubled.

Quartiles of discrete data.

Using technology. Awareness that different methods for finding quartiles exist and therefore the values obtained using technology and by hand may differ.

#4.4

Linear correlation

Linear correlation of bivariate data.

Pearson’s product-moment correlation coefficient, $r$ .

Technology should be used to calculate $r$ . However, hand calculations of $r$ may enhance understanding.

Critical values of $r$ will be given where appropriate.

Students should be aware that Pearson’s product moment correlation coefficient ( $r$ ) is only meaningful for linear relationships.

Scatter diagrams; lines of best fit, by eye, passing through the mean point.

Positive, zero, negative; strong, weak, no correlation.

Students should be able to make the distinction between correlation and causation and know that correlation does not imply causation.

Equation of the regression line of $y$ on $x$ .

Technology should be used to find the equation.

Use of the equation of the regression line for prediction purposes.

Interpret the meaning of the parameters, $a$ and $b$ , in a linear regression $y=ax+b$ .

Students should be aware:

of the dangers of extrapolation
that they cannot always reliably make a prediction of $x$ from a value of $y$ , when using a $y$ on $x$ line.

#4.5

Probability

Concepts of trial, outcome, equally likely outcomes, relative frequency, sample space (U) and event.

The probability of an event $A$ is $\text{P}(A)=\dfrac{n(A)}{n(U)}$ .

The complementary events $A$ and $A'$ (not $A$ ).

Sample spaces can be represented in many ways, for example as a table or a list.

Experiments using coins, dice, cards and so on, can enhance understanding of the distinction between experimental (relative frequency) and theoretical probability.

Simulations may be used to enhance this topic.

Expected number of occurrences.

Example: If there are 128 students in a class and the probability of being absent is 0.1, the expected number of absent students is 12.8.

#4.6

Probability of combined events

Use of Venn diagrams, tree diagrams, sample space diagrams and tables of outcomes to calculate probabilities.

Combined events:

$\text{P}(A∪B)=\text{P}(A)+\text{P}(B)−\text{P}(A∩B)$ .

Mutually exclusive events:

$\text{P}(A∩B)=0$ .

The non-exclusivity of “or”.

Conditional probability:

$\text{P}(A|B)=\dfrac{\text{P}(A∩B)}{\text{P}(B)}$ .

An alternate form of this is:

$\text{P}(A∩B)=\text{P}(B)\text{P}(A|B)$ .

Problems can be solved with the aid of a Venn diagram, tree diagram, sample space diagram or table of outcomes without explicit use of formulae.

Probabilities with and without replacement.

Independent events:

$\text{P}(A∩B)=\text{P}(A)\text{P}(B)$ .

#4.7

Discrete random variables

Concept of discrete random variables and their probability distributions.

Expected value (mean), for discrete data.

Applications.

Probability distributions will be given in the following ways:

$X$	1	2	3	4	5
$\text{P}(X=x)$	0.1	0.2	0.15	0.05	0.5

$\text{P}(X=x)=\dfrac{1}{18}(4+x), x∈{1,2,3}$

$\text{E}(X)=0$ indicates a fair game where $X$ represents the gain of a player.

#4.8

Binomial distribution

Binomial distribution.

Mean and variance of the binomial distribution.

Situations where the binomial distribution is an appropriate model.

In examinations, binomial probabilities should be found using available technology.

Not required: Formal proof of mean and variance.

Link to: expected number of occurrences (SL 4.5).

#4.9

Normal distribution

The normal distribution and curve.

Properties of the normal distribution.

Diagrammatic representation.

Awareness of the natural occurrence of the normal distribution.

Students should be aware that approximately 68% of the data lies between $μ±σ$ , 95% lies between $μ±2σ$ and 99.7% of the data lies between $μ±3σ$ .

Normal probability calculations.

Probabilities and values of the variable must be found using technology.

Inverse normal calculations

For inverse normal calculations mean and standard deviation will be given.

This does not involve transformation to the standardized normal variable $z$ .

#4.10

Regression line

Equation of the regression line of $x$ on $y$ .

Use of the equation for prediction purposes.

Students should be aware that they cannot always reliably make a prediction of $y$ from a value of $x$ , when using an $x$ on $y$ line.

#4.11

Conditional probabilities, independent events

Formal definition and use of the formulae:

$\text{P}(A|B)=\dfrac{\text{P}(A∩B)}{\text{P}(B)}$ for conditional probabilities, and

$\text{P}(A|B)=\text{P}(A)=\text{P}(A|B')$ for independent events.

An alternate form of this is: $\text{P}(A∩B)=\text{P}(B)\text{P}(A|B)$ .

Testing for independence.

#4.12

Standard normal distribution

Standardization of normal variables ( $z$ -values).

Probabilities and values of the variable must be found using technology.

The standardized value ( $z$ ) gives the number of standard deviations from the mean.

Inverse normal calculations where mean and standard deviation are unknown.

Use of $z$ -values to calculate unknown means and standard deviations.

#4.13 (AHL)

Bayes' theorem

Use of Bayes' theorem for a maximum of three events.

Link to: independent events (SL 4.6).

#4.14 (AHL)

Discrete and continuous random variables

Variance of a discrete random variable.

Link to: discrete random variables (SL 4.7)

Continuous random variables and their probability density functions.

$0≤f(x)≤1, \displaystyle\int_{-∞}^{∞}f(x)\text{dx}=1$ including piecewise functions.

Mode and median of continuous random variables.

For a continuous random variable, a value at which the probability density function has a maximum value is called a mode and for the median:

$\displaystyle\int_{-∞}^{m}f(x)\text{dx}=\dfrac{1}{2}$ .

Mean, variance and standard deviation of both discrete and continuous random variables.

Use of the notation $\text{E}(X)$ , $\text{E}(X^2)$ , $\text{Var}(X)$ ,

where $\text{Var}(X)=\text{E}(X^2)−[\text{E}(X)]^2$

and related formulae.

Use of $\text{E}(X)$ for “fair” games.

The effect of linear transformations of X.

$\text{E}(aX+b)=a\text{E}(X)+b$
$\text{Var}(aX+b)=a^2\text{Var}(X)$

IB Maths: Analysis and Approaches HL 166711