Chapter 9: Sampling Distributions

We often like to know something about the entire population; however, due to time, cost, and other restrictions, we can only take a sample of the target population. The idea of a sample (as discussed in Chapter 1) is an exploration from the part to the whole. In order to take a sample, it must be chosen carefully by following established procedures and distributions. The idea behind the population distribution is that it is derived from the information on all elements of the population being studied. In comparison, a sampling distribution consists of a random sample that represents the entire population. There are many types of sampling methodologies, but the five most common include: Random sampling, Systematic sampling, Convenience sampling, Cluster sampling, and Stratified sampling. In Chapter 7, we discussed selecting a random variable, where we illustrated how to find a simple random variable by using a sample command:

Simple Random sampling in R:

>x <- 1:10 # Your dataset from 1 to 10
>sample(x) # Your sample
[1] 3 2 1 10 7 9 4 8 6 5
sample(10) # asking R to generate 10 numbers.
[1] 10 7 4 8 2 6 1 9 5 3

However, there are other methodologies for sampling including systematic sampling, cluster sampling and stratified  sampling.

Systematic sampling command in R:

# population of size N
# the values of N and n must be provided
>sys.sample = function(N,n){
>k = ceiling(N/n)
#ceiling(x) rounds to the nearest integer that’s larger than x.
#This means ceiling (2.1) = 3
>r = sample(1:k, 1)
>sys.samp = seq(r, r + k*(n-1), k)
>cat(“The selected systematic sample is: \””, sys.samp, “\”\n”)
#Note: the last command “\”\n” prints the result in a
}
#To select a systematic sample, we will use the values of N and n
in the following command:
>sys.sample(50, 5)

Convenience sampling in R:

R does not support convenience sampling. Researchers can use utilities in R to summarize the data collected from convenience sampling.

Cluster sampling in R:

A cluster sample is a probability sample in which each sampling unit is a collection or a group of elements. It is useful when a list of elements of the population is not available but it is easy to obtain a list of clusters.
In R:
>cluster.mu <- function(N, m.vec, y, total=T, M=NA) {
# N = number of clusters in population
# M = number of elements in the population
# m.vec = vector of the cluster sizes in the sample
# y = either a vector of totals per cluster, or a list of
# the observations per cluster (this is set by total)
>n <- length(m.vec)
# If M is unknown, m.bar is estimated with the mean of m.vec
>if(is.na(M)) {mbar <- mean(m.vec)}
>else {mbar <- M/N}
# If there are not totals of observations they are computed
>if(total==F) {y <- unlist(lapply(y,sum))}
>mu.hat <- sum(y)/sum(m.vec)
>s2.c <- sum((y-(mu.hat*m.vec))ˆ2)/(n-1)
>var.mu.hat <- ((N-n)/(N*n*mbarˆ2))*s2.c
>B <- 2*sqrt(var.mu.hat)
>cbind(mu.hat,s2.c,var.mu.hat,B)
}

Stratified sampling:

This type of sampling is based on  a population sample that requires the population to be divided into smaller groups, called ‘strata’. Random samples can be taken from each stratum, or group. 
In R there is a function called stratified_sampling. The code for it includes:
>stratified_sampling<-function(df,id, size) {
#df is the data to sample from
#id is the column to use for the groups to sample
#size is the count you want to sample from each group
# Order the data based on the groups
>df<-df[order(df[,id],decreasing = FALSE),]
# Get unique groups
groups<-unique(df[,id])
group.counts<-c(0,table(df[,id]))
#group.counts<-table(df[,id])
>rows<-mat.or.vec(nr=size, nc=length(groups))
# Generate a matrix of sample rows for each group
#for (i in 1:(length(group.counts)-1)) {
>start.row<-sum(group.countsi])+1
>samp<-sample(group.counts[i+1]-1,size,replace=FALSE)
>rows[,i]<-start.row+samp
}
>sample.rows<-as.vector(rows df[sample.rows,]
}

Sampling Distribution

The sampling distribution of a particular statistic is the distribution of all possible values of the statistic, computed from samples of the same size randomly drawn from the same population. In order to conduct a successful sampling distribution, we need to provide an equal chance of selection for all units found in the entire population. The calculation of the sampling distribution needs to be an estimation of the entire population distribution versus the sample distribution.

The concept behind a sampling distribution is the probability distribution of a sample given a finite population with mean (μ) and variance (σ2). The calculation of the sampling distribution needs to be an estimation of the entire population distribution versus the sample distribution. As a result, we need to compare the population values also known as parameters to sample values also known as statistics. What happens if we do not have the information about the value of x? The Central Limit Theorem tell us what we should expect.

The Central Limit Theorem

The Central Limit Theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. The importance of the Central Limit Theorem is hard to overstate; indeed it is the reason that many statistical procedures work. For any distribution with finite mean and standard deviation, samples taken from that population will tend towards a normal distribution around the mean of the population as sample size increases. Furthermore, as sample size increases, the variation of the sample means will decrease.

The formulas of the Central Limit Theorem:
The Central Theorem

 

 

 

 

 

Where:
n = sample size
x = number of success

In order to calculate the three formulas above, we follow 5 common steps to finalize our result. 
Step 1:
 Calculate the following values:
The mean (average or μ), the standard deviation (σ), the population size, sample size (n), and the number associated with “greater than” is represented by .
Step 2:  Draw a graph that identifies the mean.
Step 3: Find the z-score using the second formula by:
3.1: Subtracting the mean (μ in step 1) from the ‘greater than’ value (xbarin step 1).
3.2: Dividing the standard deviation (σ in step 1) by the square root of your sample (n in step 1).
3.3:  And last, divide your result from step 1 by your result from step 2 (i.e. step 1/step 2)
Step 4:  Find out the z-score you calculated in step 3.
Step 5: Convert the decimal in step 4 to a percentage.

In order to employ R, we will use the rnorm command to draw 500 numbers at random distribution. The mean for this distribution is 100 and standard deviation is 10.

>x =rnorm (500, mean=100, sd=10)
>x

[1] 11155209 9724050 9268848 8252463 10931802 10832459 8300670 8926875
[9] 7570270 12358992 10209190 11977364 8812079 10697782 9859419 10314760
[17] 9461791 8813449 9929828 9185665 10426802 9480050 10850890 8433962
[25] 10109177 8112858 9136644 9693328 9912372 11296364 10781306 10795731
[33] 9643726 10851994 10968077 10483131 9932821 10360105 10419988 9396107
[41] 10233610 9378330 12402103 9478894 10564514 9579113 10430137 8944551
[49] 11375918 9331709 11763567 10731241 12834292 11288165 10720717 11337049
[57] 10059420 11303992 11558488 8873811 7285903 10596066 10792803 10303057
[65] 10483090 10581510 8753651 9472101 11160492 10754120 9160892 9055675
[73] 9322649 10125328 10763417 10267403 8129807 8306487 9336482 9370351
[81] 10806343 10703462 9592501 9333976 10985680 9124292 10556427 9046225
[89] 9669797 9928140 9963375 9176809 9470627 9659201 9513531 11779783
[97] 9233372 8674711 8856511 10206469 11970357 10440460 10209666 10134539
[105] 10359292 9932866 10101474 10158607 9374571 10859612 10067548 11678889
[113] 9680618 9262621 10717173 9638856 10068643 11006115 11236466 10530423
[121] 9238832 10010897 9782776 11170773 8882512 9520199 11480327 9745863
[129] 10750334 9810763 10299769 10639846 9929961 9320238 9300196 10003094
[137] 11350025 11503300 10632779 8958524 8514172 10370424 10034439 10809727
[145] 7972313 9622978 10905198 10118664 12650670 9857598 9238971 9955717
[153] 10314452 10905158 10054290 11265490 10679928 10735994 9053019 8114386
[161] 9861420 10748035 10312688 8712294 9908614 9289029 10588213 8292831
[169] 9659139 10529636 8787654 10541993 8964116 10123920 10323246 11441911
[177] 9083871 9473443 12292020 10540389 11358954 11754110 9764189 8995689
[185] 9308089 8892455 8716113 8423658 9556815 9585473 8722373 10584172
[193] 9283464 12095720 8394792 10777612 10319188 9549539 10804202 9685295
[201] 8446574 10508998 8323930 9722247 10200901 9278544 8653811 10421367
[209] 9733453 8586610 10869772 13201008 9767792 10333140 10623286 11131900
[217] 10236186 10079340 10832016 10286359 8864807 10788597 8206701 11022867
[225] 10056341 11609703 10108097 9005288 10550572 10699322 8718559 10681418
[233] 8626361 9558739 8164357 8085432 9559627 9656567 10628033 9622533
[241] 12220326 9991373 10035222 10686913 8919629 9393497 10101534 10637162
[249] 8715815 12324625 10489270 11101097 9056193 10584058 7986703 9341226
[257] 9370946 10657830 9283771 10841502 10225140 9266213 10231731 9912675
[265] 11967439 9327267 8321695 11074516 9994568 10707684 8485125 9675518
[273] 11235921 8427979 9043687 11241555 10383653 9343820 11205262 10278605
[281] 10396401 10647644 8641875 10398403 11221884 9161851 10352600 9998722
[289] 10573103 10645804 7332174 11381926 8630343 11153781 9486229 8685868
[297] 11095911 9340308 9960810 11186818 9371954 8857434 9863304 9437386
[305] 10048269 10069264 8607424 10578913 8379600 9439545 9509701 8831949
[313] 10205952 9412686 9106158 8404382 9539729 9138391 9683824 10120211
[321] 8917918 12125700 12252074 9468952 9763818 11914110 9214390 9188988
[329] 11375349 9135794 7476374 10551984 11106501 8427096 8624828 10240973
[337] 10425186 9794886 10930634 9243647 9971722 9593561 8592015 8550492
[345] 10781461 10631641 10420502 8897319 10024124 9501394 10478500 10332714
[353] 10420762 9142431 9956947 10271849 9555139 9343830 9229035 8674374
[361] 10231213 9986325 8084533 9725405 10802706 10067164 10406942 10442769
[369] 9564327 10967884 9269703 11491947 8297577 9549521 12446985 10486343
[377] 9600504 10057825 10414272 9845316 11432171 9536987 8269245 8688861
[385] 10188278 10788787 11060463 9609254 10685077 11636704 10305464 7910533
[393] 10607226 9535260 9505443 8231125 9184258 8762534 10614509 10058569
[401] 8590184 10076999 10243461 10964790 9692951 10046193 9081334 8138203
[409] 10613624 10494034 9340309 9061116 9741716 10974482 8508824 12028829
[417] 9494676 10919260 11224533 9841927 9795787 10095752 10260309 10135629
[425] 10901787 9592091 11040156 10874196 10284021 10940487 10192498 10408255
[433] 9813174 10696881 8961117 9313745 11403345 10875272 10721131 9317538
[441] 10396577 8118744 12059262 10366089 11242153 10022605 9728623 10252154
[449] 8032988 9094998 8325729 9270170 10504413 11267451 9190482 9954297
[457] 9513043 11791574 10277992 9796377 9123315 10569915 10254660 6625475
[465] 8498554 10248492 11085954 8419404 10813689 10941963 9133799 11334345
[473] 8690366 9623221 11191153 9864547 8328638 10255118 9714488 11838495
[481] 9512027 9751625 9463760 10575733 10373914 10537714 11200177 12035917
[489] 10420671 8917663 10023328 9987450 11670476 8693272 11040801 10218216
[497] 10051997 8210726 10522389 10546758

When you examine the numbers stored in R generated for x, it is hard to see if the mean of this distribution is equal to 100 as we asked. In order to overcome this perception, we will employ the histogram function to see the visual distribution of 500 random variables.

To use R to simulate the sample mean and the standard deviation of x, we will create a histogram for these simulations in order to determine the mean and variance for these simulations. How do these values compare to the distributional values?

The code in R:
>mean(x)
[1] 99.86607
>sd(x)
[1] 10.17966

The code for a histogram in R is hist(x, prob=TRUE, col=”red”)
>hist(x, prob=TRUE, col=”red”)
The result:

The Central Theorem example

Using just the population mean [μ = 99.86607] and standard deviation [σ = 10.17966], we can calculate the z-score for any given value of x. In this example, we will use the value of variance equals 103.6256

>z < (103.6256  99.83306) / 10.17966
>z
[1] 0.3725606
Find the value on the z-score table.
The score is converted to a percentage and it is equal to 37.25%.

Next, Chapter 10, Confidence Interval Estimation
Previous, Chapter 8, Random Variables and Probability

A Primer for Using Open Source R Software for Accessibility and Visualization