The answer here seems pretty obvious – you

must understand the data type of each

variable in order to record its values in a

consistent manner. This probably won’t require

much thought in most cases, but consider the following example.

Suppose you are interested in th

e variable creatinine but plan

to analyze it as a binary

variable by classifying patient

s as creatinine < 1.8 or crea

tinine ³ 1.8. You could simply

collect which of these categories

each individual falls into, bu

t this probably isn’t the best

choice. If a categorical variab

le is based on the value of a continuous variable, it is

generally a good idea to collect

the continuous variable. A c

ontinuous variable provides

more information than a binary variable, whic

h usually translates into more statistical

power to detect differences among patients. If

, in the analysis phase

, you decide that you

really do want to use the bina

ry version of the variable, you

can easily use a formula in a

spreadsheet or statistical software packag

e to create the binary variable from the

continuous one you collected. On the other hand

, if you only collect th

e binary variable,

you do not have the source measurement recorded to go back to if necessary

You are probably frequently exposed to

terms such as mean, median, frequency,

proportion, two-sample t-test, chi-square test

, regression, correlation, logistic regression,

etc. These are all sta

tistical calculatio

ns or procedures, but which ones do you use – and

when? The appropriate statistical

calculation or procedure is

driven in large part by the

data types

the variables body temperature (°C) and

diabetes (0 = No diabetes, 1 = Yes diab

etes) among 1420 hospitalized cancer patients.

Diabetes is a nominal variable with only two possible values. Thus, we want to know the

number (frequency) of patients with diabetes and what proportion of the total sample they

represent. Because body temperature is a continuous variable with many possible values,

we summarize its distribution

by reporting statistics such as the median, minimum,

maximum, mean and standard deviation. Clearl

y it would not be feasible or helpful to

summarize the number and proportion of patients who had each specific body

temperature value, just as it would make no se

nse to calculate the mean of the diabetes

variable.

e a continuous va

riable (body temperature)

for each of two groups (females and males). A statistical quantity used to summarize the

distribution of a continuous variable is the mean. We see that the mean body temperature

for males was 36.90°, compared to 36.99° for females. Just as we compare means in the

two groups in our descriptive statistical analysis, we need a procedure that will

statistically compare the mean among males to the mean among females. One statistical

test for comparing means between two groups is a two-sample t-test.