$$ \newcommand{\defeq}{\stackrel{\small\bullet}{=}} \newcommand{\ra}{\rangle} \newcommand{\la}{\langle} \newcommand{\norm}[1]{\left\|#1\right\|} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\Abs}[1]{\Bigl\lvert#1\Bigr\rvert} \newcommand{\pr}{{\mathbb P}} \newcommand{\qr}{{\mathbb Q}} \newcommand{\xv}{{\boldsymbol{x}}} \newcommand{\av}{{\boldsymbol{a}}} \newcommand{\bv}{{\boldsymbol{b}}} \newcommand{\cv}{{\boldsymbol{c}}} \newcommand{\dv}{{\boldsymbol{d}}} \newcommand{\ev}{{\boldsymbol{e}}} \newcommand{\fv}{{\boldsymbol{f}}} \newcommand{\gv}{{\boldsymbol{g}}} \newcommand{\hv}{{\boldsymbol{h}}} \newcommand{\nv}{{\boldsymbol{n}}} \newcommand{\sv}{{\boldsymbol{s}}} \newcommand{\tv}{{\boldsymbol{t}}} \newcommand{\uv}{{\boldsymbol{u}}} \newcommand{\vv}{{\boldsymbol{v}}} \newcommand{\wv}{{\boldsymbol{w}}} \newcommand{\zerov}{{\mathbf{0}}} \newcommand{\onev}{{\mathbf{0}}} \newcommand{\phiv}{{\boldsymbol{\phi}}} \newcommand{\cc}{{\check{C}}} \newcommand{\xv}{{\boldsymbol{x}}} \newcommand{\Xv}{{\boldsymbol{X}\!}} \newcommand{\yv}{{\boldsymbol{y}}} \newcommand{\Yv}{{\boldsymbol{Y}}} \newcommand{\zv}{{\boldsymbol{z}}} \newcommand{\Zv}{{\boldsymbol{Z}}} \newcommand{\Iv}{{\boldsymbol{I}}} \newcommand{\Jv}{{\boldsymbol{J}}} \newcommand{\Cv}{{\boldsymbol{C}}} \newcommand{\Ev}{{\boldsymbol{E}}} \newcommand{\Fv}{{\boldsymbol{F}}} \newcommand{\Gv}{{\boldsymbol{G}}} \newcommand{\Hv}{{\boldsymbol{H}}} \newcommand{\alphav}{{\boldsymbol{\alpha}}} \newcommand{\epsilonv}{{\boldsymbol{\epsilon}}} \newcommand{\betav}{{\boldsymbol{\beta}}} \newcommand{\deltav}{{\boldsymbol{\delta}}} \newcommand{\gammav}{{\boldsymbol{\gamma}}} \newcommand{\etav}{{\boldsymbol{\eta}}} \newcommand{\piv}{{\boldsymbol{\pi}}} \newcommand{\thetav}{{\boldsymbol{\theta}}} \newcommand{\tauv}{{\boldsymbol{\tau}}} \newcommand{\muv}{{\boldsymbol{\mu}}} \newcommand{\phiinv}{\Phi^{-1}} \newcommand{\Fiinv}{F^{-1}} \newcommand{\giinv}{g^{-1}} \newcommand{\fhat}{\hat{f}} \newcommand{\ghat}{\hat{g}} \newcommand{\ftheta}{f_\theta} \newcommand{\fthetav}{f_{\thetav}} \newcommand{\gtheta}{g_\theta} \newcommand{\gthetav}{g_{\thetav}} \newcommand{\ztheta}{Z_\theta} \newcommand{\xtheta}{\Xv_\theta} \newcommand{\ytheta}{\Yv_\theta} \newcommand{\p}{\partial} \newcommand{\f}{\frac} \newcommand{\cf}{\cfrac} \newcommand{\e}{\epsilon} \newcommand{\indep}{\perp\kern-5pt \perp} \newcommand{\inner}[1]{\langle#1\rangle} \newcommand{\pa}[1]{\left(#1\right)} \newcommand{\pb}[1]{\left\{#1\right\}} \newcommand{\pc}[1]{\left[#1\right]} \newcommand{\pA}[1]{\Big(#1\Big)} \newcommand{\pB}[1]{\Big\{#1\Big\}} \newcommand{\pC}[1]{\Big[#1\Big]} \newcommand{\ty}[1]{\texttt{#1}} \newcommand{\borel}[1]{\mathscr{B}\pa{#1}} \newcommand{\scr}{\mathcal} \newcommand{\scrb}{\mathscr} \newcommand{\argmin}{\mathop{\text{arg}\ \!\text{min}}} \newcommand{\arginf}{\mathop{\text{arg}\ \!\text{inf}}} \newcommand{\argmax}{\mathop{\text{arg}\ \!\text{max}}} \newcommand{\argsup}{\mathop{\text{arg}\ \!\text{sup}}} \newcommand{\bigo}[1]{\mathcal{O}_{p}\!\left(#1\right)} \newcommand{\f}{\frac} \newcommand{\e}{\epsilon} \newcommand{\inv}{^{-1}} \newcommand{\phiinv}{\Phi^{-1}} \newcommand{\Fiinv}{F^{-1}} \newcommand{\giinv}{g^{-1}} \newcommand{\fhat}{\hat{f}} \newcommand{\ghat}{\hat{g}} \newcommand{\ftheta}{f_\theta} \newcommand{\fthetav}{f_{\thetav}} \newcommand{\gtheta}{g_\theta} \newcommand{\gthetav}{g_{\thetav}} \newcommand{\ztheta}{Z_\theta} \newcommand{\xtheta}{\Xv_\theta} \newcommand{\ytheta}{\Yv_\theta} \newcommand{\absdet}[1]{\abs{\det\pa{#1}}} \newcommand{\jac}[1]{\Jv_{#1}} \newcommand{\absdetjx}[1]{\abs{\det\pa{\Jv_{#1}}}} \newcommand{\absdetj}[1]{\norm{\Jv_{#1}}} \newcommand{\sint}{sin(\theta)} \newcommand{\cost}{cos(\theta)} \newcommand{\sor}[1]{S\mathcal{O}(#1)} \newcommand{\ort}[1]{\mathcal{O}(#1)} \newcommand{\A}{{\mathcal A}} \newcommand{\C}{{\mathbb C}} \newcommand{\E}{{\mathbb E}} \newcommand{\F}{{\mathcal{F}}} \newcommand{\N}{{\mathbb N}} \newcommand{\R}{{\mathbb R}} \newcommand{\Q}{{\mathbb Q}} \newcommand{\Z}{{\mathbb Z}} \newcommand{\X}{{\mathbb{X}}} \newcommand{\Y}{{\mathbb{Y}}} \newcommand{\G}{{\mathcal{G}}} \newcommand{\M}{{\mathcal{M}}} \newcommand{\betaequivalent}{\beta\text{-equivalent}} \newcommand{\betaequivalence}{\beta\text{-equivalence}} \newcommand{\Mb}{{\boldsymbol{\mathsf{M}}}} \newcommand{\Br}{{\mathbf{\mathsf{Bar}}}} \newcommand{\dgm}{{\mathfrak{Dgm}}} \newcommand{\Db}{{\mathbf{\mathsf{D}}}} \newcommand{\Img}{{\mathbf{\mathsf{Img}}}} \newcommand{\mmd}{{\mathbf{\mathsf{MMD}}}} \newcommand{\Xn}{{\mathbb{X}_n}} \newcommand{\Xm}{{\mathbb{X}_m}} \newcommand{\Yn}{{\mathbb{Y}_n}} \newcommand{\Ym}{Y_1, Y_2, \cdots, Y_m} \newcommand{\Xb}{{\mathbb{X}}} \newcommand{\Yb}{{\mathbb{Y}}} \newcommand{\s}{{{\sigma}}} \newcommand{\fnsbar}{{\bar{f}^n_\s}} \newcommand{\fns}{{f^n_\s}} \newcommand{\fs}{{f_\s}} \newcommand{\fsbar}{{\bar{f}_\s}} \newcommand{\barfn}{{{f}^n_\sigma}} \newcommand{\barfnm}{{{f}^{n+m}_\sigma}} \newcommand{\barfo}{{{f}_\sigma}} \newcommand{\fn}{{f^n_{\rho,\sigma}}} \newcommand{\fnm}{{f^{n+m}_{\rho,\sigma}}} \newcommand{\fo}{{f_{\rho,\sigma}}} \newcommand{\K}{{{K_{\sigma}}}} \newcommand{\barpn}{{\bar{p}^n_\sigma}} \newcommand{\barpo}{{\bar{p}_\sigma}} \newcommand{\pn}{{p^n_\sigma}} \newcommand{\po}{{p_\sigma}} \newcommand{\J}{{\mathcal{J}}} \newcommand{\B}{{\mathcal{B}}} \newcommand{\pt}{{\tilde{\mathbb{P}}}} \newcommand{\Winf}{{W_{\infty}}} \newcommand{\winf}{{W_{\infty}}} \newcommand{\HH}{{{\scr{H}_{\sigma}}}} \newcommand{\D}{{{\scr{D}_{\sigma}}}} \newcommand{\Ts}{{T_{\sigma}}} \newcommand{\Phis}{{\Phi_{\sigma}}} \newcommand{\nus}{{\nu_{\sigma}}} \newcommand{\Qs}{{\mathcal{Q}_{\sigma}}} \newcommand{\ws}{{w_{\sigma}}} \newcommand{\vs}{{v_{\sigma}}} \newcommand{\ds}{{\delta_{\sigma}}} \newcommand{\fp}{{f_{\pr}}} \newcommand{\prs}{{\widetilde{\pr}_{\sigma}}} \newcommand{\qrs}{{\widetilde{\qr}_{\sigma}}} \newcommand{\Inner}[1]{\Bigl\langle#1\Bigr\rangle} \newcommand{\innerh}[1]{\langle#1\rangle_{\HH}} \newcommand{\Innerh}[1]{\Bigl\langle#1\Bigr\rangle_{\HH}} \newcommand{\normh}[1]{\norm{#1}_{\HH}} \newcommand{\norminf}[1]{\norm{#1}_{\infty}} \newcommand{\gdelta}{{\G_{\delta}}} \newcommand{\supgdelta}{{\sup\limits_{g\in\gdelta}\abs{\Delta_n(g)}}} \newcommand{\id}{\text{id}} \newcommand{\supp}{\text{supp}} \newcommand{\cech}{\v{C}ech} \newcommand{\Zz}{{\scr{Z}}} \newcommand{\psis}{\psi_\s} \newcommand{\phigox}{\Phis(\xv)-g} \newcommand{\phigoy}{\Phis(\yv)-g} \newcommand{\fox}{{f^{\epsilon,{\xv}}_{\rho,\sigma}}} \newcommand{\prx}{{\pr^{\epsilon}_{\xv}}} \newcommand{\pro}{{\pr_0}} \newcommand{\dotfo}{\dot{f}_{\!\!\rho,\s}} \newcommand{\phifo}{{\Phis(\yv)-\fo}} \newcommand{\phifox}{{\Phis(\xv)-\fo}} \newcommand{\kinf}{{\norm{\K}_{\infty}}} \newcommand{\half}{{{\f{1}{2}}}} \newcommand{\Jx}{\J_{\epsilon,{\xv}}} \newcommand{\dpy}{\text{differential privacy}} \newcommand{\edpy}{$\epsilon$--\text{differential privacy}} \newcommand{\eedpy}{$\epsilon$--edge \text{differential privacy}} \newcommand{\dpe}{\text{differentially private}} \newcommand{\edpe}{$\epsilon$--\text{differentially private}} \newcommand{\eedpe}{$\epsilon$--edge \text{differentially private}} \newcommand{\er}{Erdős-Rényi} \newcommand{\krein}{Kreĭn} % \newcommand{\grdpg}{\mathsf{gRDPG}} % \newcommand{\rdpg}{\mathsf{RDPG}} % \newcommand{\eflip}{{\textsf{edgeFlip}}} % \newcommand{\grdpg}{\text{gRDPG}} % \newcommand{\rdpg}{\text{RDPG}} \newcommand{\grdpg}{\mathsf{gRDPG}} \newcommand{\rdpg}{\mathsf{RDPG}} \newcommand{\eflip}{{\text{edgeFlip}}} \newcommand{\I}{{\mathbb I}} \renewcommand{\pa}[1]{\left(#1\right)} \renewcommand{\pb}[1]{\left\{#1\right\}} \renewcommand{\pc}[1]{\left[#1\right]} \renewcommand{\V}{\mathbb{V}} \renewcommand{\W}{\mathbb{W}} %%%%%%%%%%%%%%%%%%%%%%%%%%% \providecommand{\fd}{\frac 1d} % \renewcommand{\fpp}{{\frac 1p}} \providecommand{\pfac}{\f{p}{p-1}} \providecommand{\ipfac}{\f{p-1}{p}} \providecommand{\dbq}{\Delta b_{n,m,Q}\qty(\qty{\xvo})} \providecommand{\db}{\Delta b_{n,m}\qty(\qty{\xvo})} \providecommand{\bbv}{{{\mathbb{V}}}} \providecommand{\bbw}{{{\mathbb{W}}}} \providecommand{\md}{\textsf{MoM Dist}} \providecommand{\bF}{{\mathbf{F}}} \providecommand{\sub}{{\text{Sub}}} \providecommand{\samp}{\text{$\pa{\scr{S}}$}} \providecommand{\tp}{{2^{\f{p-1}{p}}}} %%%%%%%%%%%%%%%%%%%%%%%%%% \providecommand{\Xmn}{{\mathbb{X}_{n+m}}} \newcommand{\Dnmq}{\D[n+m, Q]} \newcommand{\Dnmh}{\D[n+m, \H]} \newcommand{\Dn}{\D[n]} \providecommand{\xvo}{\xv_0} \providecommand{\bn}[1][\null]{b^{#1}_{n}\pa{\pb{\xvo}}} \providecommand{\bnm}[1][\null]{b^{#1}_{n+m}\pa{\pb{\xvo}}} \providecommand{\bnq}[1][\null]{b^{#1}_{n,Q}\pa{\pb{\xvo}}} \providecommand{\bnmq}[1][\null]{b^{#1}_{n+m,Q}\pa{\pb{\xvo}}}\providecommand{\prq}{\pr_q} \providecommand{\dxvo}{{\delta_{\xvo}}} \providecommand{\sq}{S_q} \providecommand{\Sq}{\abs{S_q}} \providecommand{\no}{{n_o}} \providecommand{\mmdn}{\mmd\pa{\pr_n, \delta_{\xvo}}} \newcommand{\rqt}{\xi_{q}(t; n, Q)} \providecommand{\nq}{\f{n}{Q}} \providecommand{\Ot}{\Omega(t, n/Q)} \providecommand{\ut}[1]{U^{#1}} \providecommand{\vt}[1]{V^{#1}} \providecommand{\wt}[1]{W^{#1}} \providecommand{\but}[1]{\mathbb{U}^{#1}} \providecommand{\bvt}[1]{\mathbb{V}^{#1}} \providecommand{\bwt}[1]{\mathbb{W}^{#1}} \providecommand{\ball}[1]{B_{f\!, \rho}\pa{#1}} \newcommand*{\medcap}{\mathbin{\scalebox{0.75}{{\bigcap}}}}% \newcommand*{\medcup}{\mathbin{\scalebox{0.75}{{\bigcup}}}}% \providecommand{\dsf}{\mathsf{d}} \newcommand{\Dnh}{{\mathsf{D}_{n,\scr{H}}}} \newcommand{\Dph}{{\mathsf{D}_{\pr,\scr{H}}}} \newcommand{\D}[1][1={ },usedefault]{{\mathsf{D}_{#1}}} \newcommand{\Dnq}{{\mathsf{D}_{n, Q}}} \newcommand{\dnq}{{\mathsf{d}_{n, Q}}} \newcommand{\dn}{{\mathsf{d}_{n}}} \newcommand{\dnm}{{\mathsf{d}_{n-m}}} \newcommand{\dmn}{{\mathsf{d}_{n+m}}} \newcommand{\dx}{{\mathsf{d}_{\mathbb{X}}}} \providecommand{\med}{\text{median}} \providecommand{\median}{\text{median}} \providecommand{\Xnm}{{\mathbb{X}^*_{n-m}}} $$

Week-1

Math 183 • Statistical Methods • Spring 2026

Siddharth Vishwanath

Learning objectives

  • Understanding data
  • Variable vs. Observation
  • Classification of Variables
  • Population Quantity vs. Sample Statistic
  • Implementation in R
  • Foundations of Data Summarization
  • Data Visualization Techniques

The Big Picture

Anatomy of Data

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Observation

An individual unit from which data are collected.

Anatomy of Data

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

Variable

A characteristic for which different observations can take on different values.

Anatomy of Data

Observation

An individual unit from which data are collected.

Variable

A characteristic for which different observations can take on different values.

Constant

A characteristic that is the same for all observations

Coffee Consumption & Sleep Quality

A research team recruits 100 adults aged between 25-40 to participate in a 6-month study. Participants log their daily coffee intake and wear sleep trackers at night to record their sleep quality.

  • Observation: The adults
  • Variables: Coffee consumption frequency, Sleep quality
  • Constant: Age group (all adults are between 25-40)

Exercise Regime & Stress Levels

150 office workers are surveyed over a 3-month period where they report their weekly exercise routines and undergo monthly stress tests.

  • Observation: The individuals
  • Variables: Exercise regime, Stress levels
  • Constant: Time (all cases are measured in the same span of time)

Type of Cooking Oil & Heart Health

200 households in a city participate in a year-long study where their usage of cooking oil is recorded monthly. Additionally, all adult members undergo quarterly heart health check-ups.

  • Observation: The households
  • Variables: Type of cooking oil used, Heart health indicators
  • Constant: Geographic location (all households are in the same city)

Types of Variables

Image credit: OpenClassrooms

Warning

Sometimes computers can’t (or won’t) understand the difference between different types of variables. It’s up to us to tell them!

Examples

Beverage Preference

A survey in a school asks students their preferred beverage among tea, coffee, or juice.

Daily Screen Time

A study measures the daily screen time in hours of 100 individuals.

Types of Pets Owned

A neighborhood survey asks households about the types of pets they own.

Monthly Savings

Individuals are asked about their monthly savings in dollars.

Work Commute Method

A city survey asks residents about their preferred method of commuting to work.

Number of Books Read

A library conducts a survey asking individuals about the number of books they read in a month.

Favorite Music Genre

A radio station surveys its listeners to know their favorite music genre.

Weekly Exercise Hours

A health app collects data on the weekly exercise hours logged by its users.

Explanatory vs Response Variables

  • Explanatory Variable (Independent):
    • The variable that is manipulated to observe its effects on another variable.
  • Response Variable (Dependent):
    • The variable whose values are predicted or explained by the explanatory variable.

Example

A local gym aims to find the most effective workout routine for weight loss. They create a 3-month program where participants are divided into two groups. One group follows a cardio-centric routine, while the other engages in strength training. Participants’ weights are recorded at the start and end of the program. The gym seeks to understand which workout type leads to greater weight loss, to offer better guidance to its members.

  • Explanatory Variable: Type of exercise (e.g., cardio, strength training)
  • Response Variable: Amount of weight loss

Exercise Type & Weight Loss

A fitness center compares the effectiveness of two workout routines - HIIT and Yoga, for weight loss. The type of exercise is the explanatory variable, while the amount of weight loss is the response variable.

Teaching Methods & Student Performance

An educator evaluates two teaching methods to understand which one enhances student performance. The teaching method is the explanatory variable, and the students’ performance is the response variable.

Diet Type & Energy Levels

A nutritionist compares vegetarian and non-vegetarian diets to assess their impact on energy levels. The diet type is the explanatory variable, while the energy level is the response variable.

Medication Dosage & Recovery Time

In a clinical trial, different dosages of a medication are administered to patients to observe the effects on recovery time. The dosage is the explanatory variable, and the recovery time is the response variable.

Sleep Hours & Productivity

A company explores the relationship between hours slept and productivity the next day among its employees. The sleep hours is the explanatory variable, and the productivity is the response variable.

The need for a statistical framework

Student Performance Evaluation Data

Study

  • Data collected from two Portuguese schools regarding student achievement in high school.
  • Features include student demographics, social attributes, and school-related features.

Aim

  • Evaluate if there’s a difference in the final grades between male and female students.

What the data looks like

school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 3 yes no yes no yes yes yes no 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes yes yes yes yes yes yes 3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes yes no yes yes no no 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes yes yes yes yes yes no 5 4 2 1 2 5 10 15 15 15

Results

sex mean grade
F 30.98
M 33.22
  • The average grade for females is 30.98
  • The average grade for males is 33.22
  • The difference in grades for males vs females is:

\[ 33.22 - 30.98 = \color{green}{2.24} \]

Results

  • The average grade for females is 30.98
  • The average grade for males is 33.22
  • The difference in grades for males vs females is:

\[ 33.22 - 30.98 = \color{green}{2.24} \]

Do males have higher grades than females?
  • What if the study included 10 males and 12 females?
  • What if the study included 10,000 males and 12,000 females?

Intuition vs Statistics

Vaccine efficacy. A vaccine trial is conducted as follows:

💉✅ 💉❌
COVID 🙁 \(n_1\) \(n_2\)
No COVID 🙂 \(m_1\) \(m_2\)

\[ \begin{aligned} \%[🤒+✅💉] = \frac{n_1}{n_1 + m_1} \quad\text{ and }\quad \%[🤒+❌💉] = \frac{n_2}{n_2 + m_2} \end{aligned} \]



\[ \text{Efficacy} = 1 - \frac{\%[🤒+✅💉]}{\%[🤒+❌💉]} \]

Intuition vs Statistics

Consider the following* examples from a vaccine trial. Which is more reliable?

Setting 1
💉✅ 💉❌
COVID 🙁 5 1200
No COVID 🙂 45 4800

  • [🤒+✅💉] = \(5/(5+45) = 10\%\)
  • [🤒+❌💉] = \(1200/(1200+4800) = 20\%\)


Efficacy: \[ E = 1 - \frac{10\%}{20\%} = 50\% \]

Setting 2
💉✅ 💉❌
COVID 🙁 500 1200
No COVID 🙂 4500 4800

  • [🤒+✅💉] = \(500/(500+4500) = 10\%\)
  • [🤒+❌💉] = \(1200/(1200+4800) = 20\%\)


Efficacy: \[ E = 1 - \frac{10\%}{20\%} = 50\% \]

* hypothetical

A Glimpse Ahead

Test of Significance

A principled statstical approach to examining whether some observed effect may truly exist as opposed to it being an artefact of random chance


The Power of Formal Tools:
  • Translate observations into actionable insights.
  • Reduce ambiguity, increase confidence.


Takeaway
  • Observations alone can be misleading; statistical tools provide clarity.

Summarizing your Data

Principles of Data Summarization

Data Summarization

Data summarization is the process of condensing large amounts of data into smaller, more informative representations that capture the essential information of the dataset. This is crucial for making the data more understandable, interpretable, and manageable.

Here are some key methods:

  • Shape: Describes the distribution of data, e.g., graphs, charts, and metrics like skewness and kurtosis.
  • Central Tendency: Measures that describe the center of a dataset, like the mean, median, and mode.
  • Variability: Measures that describe the spread or dispersion of data, such as the variance, standard deviation, and range.

Visualizing Data

Cars Data

attach(mtcars)
data <- mtcars

data %>% head
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Bar chart

data$cyl %>% 
  table %>% 
  barplot(col=c("red", "white", "blue"))
## Equivalently
barplot(table(data$cyl), col=c("red", "white", "blue"))

Scatterplot

data %>% 
  select(mpg, disp) %>% 
  plot(col="dodgerblue", pch=20)
# Equivalently
plot(data[, c("mpg", "disp")]col="dodgerblue", pch=20)

Boxplot

boxplot(mpg~cyl, data)

Boxplot

cyl4 <- data %>% 
  filter(cyl == 4) %>% 
    select(mpg) %>% 
    unlist()

up_q  <- quantile(cyl4, 0.75)
low_q <- quantile(cyl4, 0.25)
med   <- median(cyl4)

Boxplot

abline(h=c(up_q, low_q, m), col=c("red", "blue", "black"))

Histogram

hist(data$disp)

Histogram

hist(data$disp, breaks=20)

Histogram

hist(data$disp, breaks=3)

Histogram

hist(data$disp, freq=F)

Histogram

hist(data$disp, freq=F)
lines(density(data$disp), col="red")

Flat

Right Skewed

Symmetric

Left Skewed

Measures of Central Tendency: Mean

Mean

The mean of a set of quantitative variables \(x_1, x_2, \dots, x_n\) is given by

\[ \bar{x} = \frac 1n \sum_{i=1}^n x_i = \frac{x_1 + x_2 + \dots + x_n}{n} \]

What is the mean of \(\{8, 1, 4, 4, 3\}\)?

x <- c(8, 1, 4, 4, 3)
mean(x)
[1] 4

The mean of \(\{1, 2, 3, 4, 10, x\}\) is \(3.333\). What is \(x\)?

Measures of Central Tendency: Median

Median

Let the data points be \(x_1, x_2, \ldots, x_n\) arranged in non-decreasing order, i.e., \(x_i \le x_{i+1}\) for all \(i\). Then the median, \(M\), is:

\[ M = \begin{cases} x_{\frac{n+1}{2}} & \text{if $n$ is odd}\\ \\ \frac{x_{\frac{n}{2}} + x_{(\frac{n}{2} + 1)}}{2} & \text{if $n$ is even} \end{cases} \]

What is the median of \(\{8, 1, 4, 4, 3\}\)?

x <- c(8, 1, 4, 4, 3)
median(x)
[1] 4

The median of \(\{1, 2, 3, 4, 10, x\}\) is \(2.5\). What is \(x\)?

Measures of Central Tendency: Mode

Mode

The mode of data points \(x_1, x_2, \ldots, x_n\) is the value which appears most frequently


What is the mode of \(\{8, 1, 4, 4, 3\}\)?

x <- c(8, 1, 4, 4, 3)
Mode = \(x) x %>% table %>% which.max %>% names
Mode(x)
[1] "4"

The mode of \(\{1, 2, 3, 4, 10, x\}\) is \(1\). What is \(x\)?

Measures of Dispersion: Variance

Which of the following histograms exhibits more variability?

Variance, as the name suggests, is a measure of this variability.

What effect does variance have on our perception of the data?

Variance & Standard Deviation

Variance

The variance of a set of quantitative variables \(x_1, x_2, \dots, x_n\) is given by

\[ s^2 = \frac{1}{n-1} \sum_{i=1}^n({x_i - \bar{x}})^2 \]

The standard deviation, \(s\), is is the square root of the variance.

What is the variance and standard deviation of \(\{8, 1, 4, 4, 3\}\)?

x <- c(8, 1, 4, 4, 3)
c(var(x), sd(x))
[1] 6.50000 2.54951

The standard deviation of \(\{1, 2, 3, 4, 10, x\}\) is \(1\). What is \(x\)?

Lower Quantile

Lower Quantile

For \(0 \le \alpha \le 1\), the lower \(\alpha\)–quantile of \(x_1, x_2, \ldots, x_n\) is the value for which at least \(\alpha\) fraction of the points have a value less than or equal to it.

\(Q_{0.1}\) quantile

Lower Quantile

Lower Quantile

For \(0 \le \alpha \le 1\), the lower \(\alpha\)–quantile of \(x_1, x_2, \ldots, x_n\) is the value for which at least \(\alpha\) fraction of the points have a value less than or equal to it.

\(Q_{0.6}\) quantile

Lower Quantile

Lower Quantile

For \(0 \le \alpha \le 1\), the lower \(\alpha\)–quantile of \(x_1, x_2, \ldots, x_n\) is the value for which at least \(\alpha\) fraction of the points have a value less than or equal to it.

\(Q_{0.95}\) quantile

Upper Quantile

Upper Quantile

For \(0 \le \alpha \le 1\), the upper \(\alpha\)–quantile of \(x_1, x_2, \ldots, x_n\) is the value for which at least \(\alpha\) fraction of the points have a value greater or equal to it.

\(q_{0.05}\) quantile

Upper Quantile

Upper Quantile

For \(0 \le \alpha \le 1\), the upper \(\alpha\)–quantile of \(x_1, x_2, \ldots, x_n\) is the value for which at least \(\alpha\) fraction of the points have a value greater or equal to it.

\(q_{0.1}\) quantile

Upper Quantile

Upper Quantile

For \(0 \le \alpha \le 1\), the upper \(\alpha\)–quantile of \(x_1, x_2, \ldots, x_n\) is the value for which at least \(\alpha\) fraction of the points have a value greater or equal to it.

\(q_{0.95}\) quantile