{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3d303553e1d72eea",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Chapter 3: Mathematical foundations of statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "433dc0d71a2be92e",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 3.1 Sample Spaces"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e9b4f1ee162fa4d",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**Definition 3.1:**\n",
    "\n",
    "1. The **sample space** $\\Omega$ is the set of all possible outcomes of an experiment.\n",
    "2. Elements in the sample space are called **outcomes**, and are written in short as $\\omega \\in \\Omega$.\n",
    "3. Subsets of the sample space are called **events**, and are written in short as $A \\subset \\Omega$.\n",
    "\n",
    "Let's start with a simple example: you and your friend just finished the exam for this course. There are two outcomes for each of you, passing $P$ and failing $F$. Then, the set of all possible outcomes is:\n",
    "$$ \\Omega = \\{PP, PF, FP, FF\\} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd72a43bd85c5b38",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Now, you are going to a café to celebrate the end of the exam. There are two possible events: having a something to drink ($A$) and having some cake ($B$).\n",
    "\n",
    "To visualize events, we like to use Venn diagrams. The full rectangle represents the sample space, and the circles represent the events. The shaded areas show the outcome. \n",
    "\n",
    "**Definition 3.2:**  \n",
    "For a given event $A$, let \n",
    "\\begin{equation}\\tag{3.1}\n",
    "A^c = \\{ \\omega \\in \\Omega : \\omega \\notin A \\}\n",
    "\\end{equation} \n",
    "This is called the **complement** of $A$, and is the event 'not $A$'.\n",
    "\n",
    "The first possible event, is that you do not buy a drink:  $\\{ \\omega \\in \\Omega : \\omega \\notin A \\}$. In this case, you could buy a cake, or nothing.  \n",
    "\n",
    "![ss_notA](figures/ss_notA.png)\n",
    "**Figure 3.1**: The sample space for the event 'not $A$'."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28fd3bb84e84ceeb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "\n",
    "\n",
    "It is also possible that you do not buy a drink, but also no cake. This is written as \n",
    "\\begin{equation} \\tag{3.2}\n",
    "\\{ \\omega \\in \\Omega : \\omega \\notin A  : \\omega \\notin B \\}\n",
    "\\end{equation}\n",
    "This is the **complement** of $A$ **and** $B$.\n",
    "\n",
    "![ss_notAnorB](figures/ss_notAnorB.png)\n",
    "**Figure 3.2**: The sample space for the event 'not $A$ nor $B$'."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1ebcfae19e4f4432",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**Definition 3.3:**  \n",
    "The **union** of the events $A$ and $B$ is the event '$A$ or $B$' and is defined as:\n",
    "\\begin{equation} \\tag{3.3}\n",
    "A \\cup B = \\{ \\omega \\in \\Omega : \\omega \\in A \\text{ or } \\omega \\in B \\text{ or both} \\}\n",
    "\\end{equation}\n",
    "Here \"or\" is non-exclusive, meaning that $A$ and $B$ can happen separately or simultaneously.\n",
    "\n",
    "In our example, this would mean that you are having cake, or both a drink and cake.\n",
    "![ss_AorB](figures/ss_AorB.png)\n",
    "**Figure 3.3**: The sample space for the event '$A$ or $B$'.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a932bcb0e53dcead",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**Definition 3.4:**\n",
    "The **intersection** of $A$ and $B$ is defined as:\n",
    "\\begin{equation} \\tag{3.4}\n",
    "A \\cap B = \\{ \\omega \\in \\Omega : \\omega \\in A \\text{ and } \\omega \\in B \\text{ simultaneously} \\}\n",
    "\\end{equation}\n",
    "\n",
    "In our example, this would mean that you are having both a drink and some cake.\n",
    "![ss_AandB](figures/ss_AandB.png)\n",
    "**Figure 3.4**: The sample space for the event '$A$ and $B$'."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9795ea5c353a8798",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**Definition 3.5:**\n",
    "The **difference** in $A$ and $B$ is the event '$A$ but not $B$'. This is defined as:\n",
    "\\begin{equation} \\tag{3.5}\n",
    "A \\backslash B = \\{ \\omega \\in \\Omega : \\omega \\in A \\text{ and } \\omega \\notin B \\}\n",
    "\\end{equation}\n",
    "\n",
    "![ss_AbutnotB](figures/ss_AbutnotB.png)\n",
    "**Figure 3.5**: The sample space for the event '$A$ but not $B$'.\n",
    "\n",
    "The café has three cakes: chocolate, apple and lemon.\n",
    "The event that you have some cake is $A$. The event that you have chocolate cake is $B$.\n",
    "The event that you have some cake, that is not chocolate is: $A \\backslash B = \\{ \\text{ chocolate, apple, lemon } \\} \\backslash \\{ \\text{ chocolate }\\} = \\{ \\text{ apple, lemon } \\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c454f37c2b6d8da6",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 3.2 Derivation of Bayes theorem\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79bc6cd63e7f45e",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**Definition 3.6:**\n",
    "    \n",
    "The **Bayes theorem** is a mathematical tool that is used to find the conditional probability. This is the likelihood of a certain event happening based on previous outcomes in similar situations. To find the formula for the Bayes theorem we combine two equations. \n",
    "The first equation calculates the probability of two events happening:\n",
    "\n",
    "$$P(A \\cap B) = P(B)P(A | B) \\tag{3.6}$$ \n",
    "Equation 3.6 says that the probability of both A and B happening depends on the probability of B happening multiplied \n",
    "by the probability of A happening when we already know that B is happening.\n",
    "\n",
    "Similarly, the probability of both A and B happening can also be calculated with equation 3.7:\n",
    "\n",
    "\\begin{equation}\n",
    "P(A \\cap B) = P(A)P(B | A) \\tag{3.7}\n",
    "\\end{equation}\n",
    "\n",
    "Combining these equations gives:\n",
    "\n",
    "\\begin{equation}\n",
    "P(B)P(A | B) = P(A)P(B | A)\n",
    "\\tag{3.8}\n",
    "\\end{equation}\n",
    "Which means\n",
    "\n",
    "\\begin{equation}\n",
    "P(A | B) = \\frac{P(A)P(B | A)}{P(B)} \\tag{3.9}\n",
    "\\end{equation}\n",
    "\n",
    "Equation 3.9 is called the Bayes theorem. To explain this further we will use the following example. Imagine if you want to know the probability of a person named Bob being an astronomy student.\n",
    "\n",
    "B is the prior here (more about priors in [**Chapter 5**](https://bayesian-statistics-for-astrophysics-2024.readthedocs.io/en/latest/lecture_notes/group5/group5.html)), which is that the person is named Bob and A is the person being an astronomy student. \n",
    "To find this, you thus need to know the probability of a person being named Bob. \n",
    "Then you also want to know the probability of a person being an astronomy student. Finally, you want to know the probability of a person being named Bob when you know they are an astronomy student.\n",
    "You can for example find this by looking up the names of people that study astronomy and see how many of those people are named Bob.\n",
    "\n",
    "So now when you meet someone named Bob you know the probability of them being an astronomy student.\n",
    "\n",
    "Reading this example you might not think the Bayes theorem is that groundbreaking. However, one of many important implication of the\n",
    "Bayes theorem is when giving a patient a medical diagnosis. With the Bayes theorem you can calculate the probability of a\n",
    "patient having a certain disease based on the symptoms the patient has. Showing how essential such an equation can be.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d985c00ef160f14a",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "One classic example of Bayes' theorem in astronomy, is identifying the probability of a star being a certain type (variable star, main sequence star, etc.) based on observed properties like brightness. In this example we assume priors, these can be for example taken from previous data. The likelihoods are calculated using a set of simulated stars. This gives us all the data needed to calculate the probability of a star being variable. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "52b34be6f5755026",
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-12-19T18:42:18.618828200Z",
     "start_time": "2024-12-19T18:42:18.558439300Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability that the star is a variable star given brightness variability: 0.16\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "# Priors\n",
    "P_A = 0.05  # Probability of variable star\n",
    "P_not_A = 1 - P_A  # Probability of non-variable star\n",
    "\n",
    "# Sample data: observed variability for variable and non-variable stars (1 = variability observed, 0 = no variability)\n",
    "variable_stars_variability = [1, 1, 0, 1, 0, 1, 0, 1, 1, 1]  \n",
    "non_variable_stars_variability = [0, 0, 1, 0, 0, 0, 1, 0, 0, 0]\n",
    "\n",
    "# Likelihoods\n",
    "P_B_given_A = np.mean(variable_stars_variability)  # Probability of variability given variable star\n",
    "P_B_given_not_A = np.mean(non_variable_stars_variability)  # Probability of variability given non-variable star\n",
    "\n",
    "# Total probability of observing variability\n",
    "P_B = P_B_given_A * P_A + P_B_given_not_A * P_not_A\n",
    "\n",
    "# Posterior probability using Bayes' theorem\n",
    "P_A_given_B = (P_B_given_A * P_A) / P_B\n",
    "\n",
    "print(f\"Probability that the star is a variable star given brightness variability: {P_A_given_B:.2f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24cf97bdcfc916f",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 3.3 Law of total probability \n",
    "The law of total probability expresses the probability that a certain outcome will occur; it adds up all the probabilities of distinct events that lead to this same outcome. To make it more clear let's throw two indistinguishable dice and calculate the probability that the sum of the two dice is four. There are two distinct events that have this outcome, namely $2 + 2$ and $3 + 1$. In this case the probabilities of these two distinct events are respectively $\\frac{1}{36}$ and $\\frac{1}{18}$, the law of total probability tells us then that the total probability of having four as the outcome is $\\frac{1}{36} + \\frac{1}{18} = \\frac{1}{12}$.\n",
    "\n",
    "### 3.3.1 Discrete case\n",
    "**Definition 3.7:**\n",
    "\n",
    "Let $A = \\{A_1, A_2, ..., A_n\\}$ be a set of collectively exhaustive and disjoint events, meaning that one and only one of these events will occur. Then the total probability of an event $B$ happening is\n",
    "\\begin{equation} \\tag{3.10}\n",
    "P(B) = \\sum_n P(B|A_n)P(A_n)  \n",
    "\\end{equation}\n",
    "In the example with the two indistinguishable dice $B$ is the event that the two dice sum to four. The distinct events that make that happen are $A_1 = 2 + 2$ and $A_2 = 3 + 1$ with  $P(A_1) = \\frac{1}{36}$ and $P(A_2) = \\frac{1}{12}$, furthermore $P(B|A_n)=1$, because the sum is always four if $A_1$ or $A_2$ happens. Using equation 3.7 we can also write equation 3.10 as\n",
    "\\begin{equation} \\tag{3.11}\n",
    "P(B) = \\sum_n P(B \\cap A_n)  \n",
    "\\end{equation}\n",
    "\n",
    "### 3.3.2 An example in the discrete case\n",
    "Suppose that you have a drawer filled with batteries, 20% of them are of the brand Duracell (D) and 80% are of the brand Panasonic (P). Of the Duracell batteries 50% are fully charged, while only 20% of the Panasonic batteries are charged. We can use the law of total probability to calculate the chance of randomly picking a battery out of the drawer that is charged (event $B$). The separate probabilities are\n",
    "\\begin{equation*} \\\n",
    "P(A_D) = 0.2 \\hspace{1cm}\n",
    "P(A_P) = 0.8 \\hspace{1cm}\n",
    "P(B|A_D) = 0.5 \\hspace{1cm}\n",
    "P(B|A_P) = 0.2 \n",
    "\\end{equation*}\n",
    "These probabilities can then be used to calculate the total probability of getting a charged battery when picking a random one out of the drawer. The total probability is\n",
    "\\begin{equation*}\n",
    "P(B) = P(B|A_D)P(A_D) + P(B|A_P)P(A_P) = 0.5\\cdot0.2 + 0.2\\cdot0.8 = 0.26\n",
    "\\end{equation*}\n",
    "So there is a 26% chance of getting a charged battery out of the drawer.\n",
    "\n",
    "### 3.3.3 Continuous case\n",
    "**Definition 3.8:**\n",
    "\n",
    "We will now discuss the case in which the random variables are not discrete, but continuous. Instead of there being a finite set of distinct possible events ${A_n}$, there is a variable A with a probability density function $f_A(x)$. So $f_A(x)dx$ is the probability that $x < A < x+dx$. In this case the probability of an event $B$ happening is\n",
    "\\begin{equation} \\tag{3.12}\n",
    "P(B) = \\int_{-\\infty}^{\\infty} P(B|A=x)f_A(x)dx \n",
    "\\end{equation}\n",
    "Here again $P(B|A=x)$ is the probability of $B$ happening when $A$ is exactly equal to $x$.\n",
    "\n",
    "### 3.3.4 An example in the continuous case\n",
    "Suppose in a different universe galaxies have a mass A and the distribution of these galaxy masses is given by a normal distribution $f_A(x)$. If the average mass is $\\mu = 1 M_{MW}$ and the width of the normal distribution is $\\sigma = 0.2 M_{MW}$, where $M_{MW}$ is the mass of our Milky Way, then the probability distribution of the masses of the different galaxies is given by\n",
    "\\begin{equation*}\n",
    "f_A(x) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\exp\\left(-\\frac{(x-\\mu)^2}{2\\sigma^2}\\right)\n",
    "\\end{equation*}\n",
    "In this particular universe, the chance of life to exist in a galaxy is dependent on their total mass, let us call a galaxy containing life outcome $B$. Larger mass galaxies have a larger chance of containing life in this case, so let us assume the chance of life existing in a galaxy with mass A is \n",
    "\\begin{equation*}\n",
    "P(B|A=x) = 0.10\\tanh\\left(\\frac{x}{\\mu}\\right)\n",
    "\\end{equation*}\n",
    "In this case of course $x>0$ as galaxies can't have a negative mass, the probability is thus zero for $x<0$. The expression for $P(B|A=x)$ tells us that less massive galaxies have a smaller chance of containing life, and more massive galaxies have a chance of containing life, tending towards 10% for the supermassive galaxies. Using equation 3.12 we can calculate the total probability of a random galaxy containing life, which is\n",
    "\\begin{equation*}\n",
    "P(B) = \\int_{-\\infty}^{\\infty} P(B|A=x)f_A(x)dx = \\frac{0.1}{\\sqrt{2\\pi\\sigma^2}}  \\int_{0}^{\\infty} \\tanh\\left(\\frac{x}{\\mu}\\right) \\exp\\left(-\\frac{(x-\\mu)^2}{2\\sigma^2}\\right) dx\n",
    "\\end{equation*}\n",
    "This integral can't be solved analytically, so using numerical methods one can find that $P(B) \\approx 0.075$, which means that about 7.5% of the galaxies in this universe contain life."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "edbc401009b25c53",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## 3.4 The difference between frequentist and Bayesian views"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c18f5ad30d90d1bb",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Two competing philosophies on statistics have sparked debates. These two approaches, frequentist and Bayesian, differ in how they treat probabilities and statistical inference.\n",
    "\n",
    "A frequentist assigns probabilities to data rather than hypotheses, focusing on the long-run frequency of events. In this framework, any collected data set is viewed as one of many hypothetical data sets that addresses the same question. Uncertainty arises solely from sampling error. Confidence intervals, therefore, are constructed to contain the true parameter a specific percentage of the time when the experiment is repeated infinitely often. As the number of repetitions increases, the influence of false outliers diminishes, and the statistical methods are designed to perform reliably under such repeatable conditions. Probability is treated the same as frequency. The likelihood of an event is defined by its relative frequency in infinite repetitions of the same experiment. Importantly, in the frequentist perspective, the true parameters of a probability model are treated as fixed values. This implies that probabilistic statements about these parameters are invalid. An event occuring is either true or false, with no intermediate probabilities. \n",
    "\n",
    "In contrast, Bayesians assign probabilities to hypotheses, viewing probabilities as degrees of belief. From the Bayesian perspective, a probability is given to a hypothesis. The parameter is taken as a random variable. This means that there is a probability the event will occur. This perspective allows for probabilistic statements about unknown parameters even before any data is observed, with probabilities ranging from 0% to 100%. An important difference is that Bayesian inference takes prior knowledge into account. The prior probability reflects beliefs about a parameter before data is collected, and Bayesian statistics is designed to update these beliefs in response to new evidence. By incorporating prior knowledge, Bayesian models refine the probabilities of hypotheses as more data becomes available.\n",
    "\n",
    "In many cases, Bayesian methods closely resemble other statistical approaches, especially when working with large samples from a fixed model. For smaller sample sizes, many conventional methods can be understood as approximations to Bayesian inferences based on specific prior distributions. Recognizing the implicit priors in these methods can provide valuable insights into their assumptions. However, some methods, like hypothesis testing, may yield results that differ significantly from those obtained using Bayesian approaches.\n",
    "\n",
    "An example of the different approaches of frequentists and Bayesians is in the medical field when you want to diagnose a patient. In the frequentist approach, the doctor would look at the current complaints the patient has and compare that to previous records of other patients with similar pains and what their diagnosis was to hopefully get a diagnosis for the current patient. In the Bayesian approach, the doctor would also take the patient's previous medical records into account. This means that they will also consider prior knowledge for a diagnosis. \n",
    "\n",
    "In astronomy, marganilisation is often used to account for uncertainties in parameters. For example, let's say we have observed the magnitudes of stars within a cluster and we want to determine the distance to a star using the distance modulus\n",
    "$$ m - M = 5 log_{10}(d) - 5 $$\n",
    "$m$ is the apparent magnitude, $M$ the absolute magnitude and $d$ the distance. In practise $m$ and $M$ will have uncertainties.\n",
    "A frequentist might directly try to fit $d$ to the observed magnitudes using the model. Often they will assume a fixed $M$ (or its mean value). This does not take into account the uncertainties in $M$, leading to an overconfidence of estimates. A Bayesian will calculate the posterior probability distribution for the distance, marganilising over $M$. \n",
    "$$ P(d|m) = \\int P(d, M|m)dM = \\int P(m|d,M)p(d)p(M)dM $$\n",
    "$P(m|d,M)$ is the likelihood of observing $m$ given $d$ and $M$, $p(d)$ is a prior on the distance and $p(M)$ a prior on the absolute magnitude. This now accounts for all possible values. \n",
    "\n",
    "The debate between frequentism and Bayesianism often revolves around the use of priors, which frequentists criticize for introducing potential biases into analysis. Despite this critique, Bayesian methods are widely regarded as reasonable if the chosen prior is sensible. This will be explained further in [**Chapter 5**](https://bayesian-statistics-for-astrophysics-2024.readthedocs.io/en/latest/lecture_notes/group5/group5.html)\n",
    "\n",
    "Chapter 5. Since this course is BASTA, we will continue working on statistics from the Bayesian view. The processes of Bayesian inteference and chosing priors will be explained and shown with examples in the further chapters."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf0ede3bf773201c",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "**References**:\n",
    "\n",
    "[Fornacon-Wood, I. et al 2022, Understanding the Differences Between Bayesian and Frequentist Statistics, International Journal of Radiation Oncology, Biology, Physics, Volume 112, Issue 5, 1076 - 1082](https://www.redjournal.org/article/S0360-3016(21)03256-9/fulltext#:~:text=%3A%20the%20frequentist%20approach%20assigns%20probabilities,as%20more%20data%20become%20available)\n",
    "\n",
    "[Gelman, A. et al 2021, Bayesian Data Analysis, Third edition (with errors fixed as of 15 February 2021)](www.stat.columbia.edu/~gelman/book/BDA3.pdf)\n",
    "\n",
    "Nagler T. 2021, Statistics for Astronomy and Physics students, Leiden University\n",
    "\n",
    "[Kozyrkov C. 2021, Statistics: Are you Bayesian or Frequentist?, Towards Data Science](https://towardsdatascience.com/statistics-are-you-bayesian-or-frequentist-4943f953f21b)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   },
   "id": "ef460ae6ec811699"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}