Chapter 7: Probability Theory - Statistics for LIS with Open Source R

The term probability is often associated with the outcome of winning or losing in gambling. The idea, under the probability theory, is that a numerical measure indicates the likelihood of an event to occur. All probability values are between 0 and 1, inclusively. When the probability of an event is equal to 0, it means that this event is impossible. When the probability of an event is equal to 1, it means the event is certain to occur. Events with probabilities nearer to 1 are more likely to occur.

In R you can simulate these situations with the sample function. If you randomly pick a number from the set 1–20, then you can type:

>sample (1:20, 5)
[1] 18 14 8 1 4

R will come back with five randomized numbers, in this case 18, 14, 8, 1, 4. In your case, the numbers might be different since they are random.

Notice that the default behavior of the sample function is sampling without replacement. That is, the samples will not contain the same number twice, and size obviously cannot be bigger than the length of the vector to be sampled. If you want the result of sample with replacement, you will need to add the argument = TRUE (T):

>sample (1:20, replace = T)
[1] 13 1 19 17 8 16 8 1 11 3 5 1 18 20 18 12 15 20 9 3

In fair coin tossing, the probability of heads should equal the probability of tails, but the idea of a random event is not restricted to symmetric cases. It should be equally applied to other cases as the successful outcome of any procedure. A sample space in R is usually represented by a data.frame, that is, a collection of variables of equal length. For example, the following variable df is a data.frame containing three vectors n, s, b:

The code in R
>n = c(2, 3, 5)
>s = c(“aa”, “bb”, “cc”)
>b = c(TRUE, FALSE, TRUE)
>df = data.frame(n, s, b)
>df
n s b
1 2 aa TRUE
2 3 bb FALSE
3 5 cc TRUE

A data.frame allows us to calculate and store the data using different variables that we must take into account in probability distribution.

Bayes’ Theorem

Bayes’ Theorem is a way of understanding how the probability that a theory is true is affected by a new piece of evidence. It examines the relation between two conditional probabilities that are the reverse of each other. Bayes’ Theorem expresses the conditional probability, or ‘posterior probability’, of an event A after B is observed in terms of the ‘prior probability’ of A, prior probability of B, and the conditional probability of B given A. Bayes’ Theorem is valid in all common interpretations of probability. This function provides one of several forms of calculations that are possible with Bayes’ Theorem.

The basic formula for Bayes’ Theorem includes: conditional probability, joint probability, mutually exclusive and collectively exhaustive events. In many cases, under this theorem many researchers measure the probability of future success of an event. Bayes’ Theorem allows us to measure the impact factor of a new event, where:

B_i = i^th event of k mutually exclusive and collectively exhaustive events
A = new event that might impact P(B_i)

Example:
The success of cooking has been a major consideration for publishers. Recent reports showed that the probability of producing a successful cookbook received 0.4 from a library support group. Thus, the probability of an unsuccessful book is 1 – 0.4 = 0.6. Under Bayes’ Theorem we will find the following:

Event S = successful report
Event F = favorable report
Event S’ = unsuccessful report
Event F’ = unfavorable report

P(S) = 0.40
P(S’) = 0.60
P(F/S’) = 0.80
P(S/F) = (0.8)(0.4)/(0.8)(0.6) +(0.3)(0.6) = 0.32/0.50 = 0.64

Much of the work in software applications and probability focuses on the basic notion of the random sample. The basic notion then is that a random sample will provide us a more accurate reading of the cards or the experiment we will conduct.

Let’s look at the following example and we will illustrate the solution in R:

The editor of a major textbook publishing company is trying to decide whether to publish a proposed statistics textbook for information science. Information on previous textbooks published indicates that 10% are huge successes, 20% are modest success, 40% break even and 30% are losers. However, before a publishing decision is made, the book will be reviewed by experts. In the past, 99% of the huge successes received favorable reviews, 70% of the moderate successes received favorable reviews, 40% of the break even publications received favorable reviews and 20% of the losers received favorable reviews.
A. If the proposed text receives a favorable review, how should the editor revise the probabilities of the various outcomes to take this information into account?
B. What proportion of textbooks receive favorable reviews?

The answer to this question using Bayes’ Theorem:
A.
P (huge success/favorable review ) = 0.2157
P(moderate success/favorable review) = 0.3050
P(break even/favorable review) =0.3486
P(loser/favorable review) = 0.1307
B.
P(favorable review) = 0.459

In R, we set up arrays that will allow us to measure Bayes’ Theorem. an array is an orderly arrangement of data. The containers are called Prior and Second. We collected the data from the question.
Step 1:
>Prior <- c(0.1,0.2,0.4,0.3)
>Second <- c(0.99,0.7,0.4,0.2)
# Then we defined the Bayes’ functionality:
>Bayes <- function(Array1, Array2, Sample){
Prior <- Array1 * Sample
Second <- Prior * Array2
outofSample <- abs(second – Prior)
value1 <- sum(Second)
value2 <- sum(outofSample)
results <- Second / Value1
answer <- cbind(Prior, Second, outofSample, Results)
value3 <- Value1/Sample
cat(“Favorable”, Value3 ,”\n”)
return(Answer)
}
# Then, in the last part we called the data arrays we set up:
Prior <- c(0.1,0.2,0.4,0.3) and Second <- c(0.99,0.7,0.4,0.2) and compiled Bayes’ formula.
>Bayes(prior, second, 10000)
favorable 0.459
Prior Second outofSample Results
[1,] 1000 990 10 0.215
[2,] 2000 1400 600 0.305
[3,] 3000 1600 2400 0.386
[4,] 4000 600 2400 0.1307

Step 2:
P(favorable review) = 0.99(0.1) + 0.7(0.2) + 0.4(0.4) + 0.2(0.3)
= 0.459

Next, Chapter 8, Random Variables and Probability Distributions
Previous, Chapter 6, Bivariate Statistics

A Primer for Using Open Source R Software for Accessibility and Visualization