Variables & Visualization
What Is The Story You Are Trying To
Tell?
Andy Grogan-Kaylor
2024-10-31
Possibilities
As you move forward through this presentation you can press
b
to make text bigger, or s
to make text
smaller.
Background
- Deciding upon the right data visualization to represent your data
can be a daunting process.
- I believe that a starting point for this thinking is some
basic statistical thinking about the type of variables that you
have.
- At the broadest level, variables may be conceptualized as
categorical variables, or continuous variables.
Data Often Come From A Survey Questionnaire.
What is Data?
A data set is nothing more than a series of rows and columns that
contain answers to responses to a survey.
- Rows are usually used for individuals, while columns indicate the
questionnaire answers, or measures, from those people.
- Answers to questions are often given numerical responses (e.g. “no”
is frequently coded as “0” and “yes” is frequently coded as “1”)
Hypothetical Data
1 |
1 |
0 |
100 |
2 |
2 |
0 |
200 |
3 |
1 |
1 |
-9 |
Some Notes on Data
- In working through our research questions, we’ll constantly be going
back and forth between the actual data (to see the pattern of responses)
and the documentation, to figure out the actual question asked as well
as how the different responses are coded.
- Often in a spreadsheet, you’ll see the full text of a question
written out (e.g. “What is your gender identity”?)
- Most programs that work with data are going to want abbreviations
(e.g. “Q1” or “gender”“) for the questions. These abbreviations should
usually have no spaces and be 8 characters or less.
Missing Data
- One cell of the sample data set has a negative number.
- Frequently negative numbers are used to indicate what are called
“missing values”. A missing value is a response like “don’t know”” or
“refused to answer” or “did not answer”.
- Before we start doing calculations with our data, we’ll want to
change negative numbers to true missing values (usually symbolized by a
“.” or “NA”, so that they don’t goof up our calculations.)
What are Variables?
- By variables, I simply mean the columns of data that you have.
- For our purposes, you may think of variables as synonymous with
questionnaire items, or columns of data.
Variable Types
- categorical variables represent unordered categories like
neighborhood, or religious affiliation, or place
of residence.
- continuous variables represent a continuous scale like a
mental health scale, or a measure of life
expectancy.
A Data Visualization Strategy
Once we have discerned the type of variable that have, there are two
followup questions we may ask before deciding upon a chart strategy:
- Is our graph about one thing at a time?
- How much of x is there?
- What is the distribution of x?
- Is our graph about two things at a time?
- What is the relationship of x and y?
- How are x and y associated?
More On Strategy
Simulated Data
This example uses simulated data on social work clients, of the kind
that a social service agency might collect.
Simulated Data
25.42 |
105.8 |
Group B |
Neighborhood B |
25.55 |
93.27 |
Group A |
Neighborhood B |
23.18 |
131.3 |
Group A |
Neighborhood B |
25.07 |
112.9 |
Group A |
Neighborhood B |
51.61 |
110 |
Group A |
Neighborhood B |
Show One Thing At A Time
We start by visualizing one indicator at a time.
Continuous Variable
Sometimes the most interesting visualizations, are visualizations
that give us a sense of the maximum, minimum, and average values. For
example, the histogram and dotplot display information
on age.
Categorical Variable
We would use a slightly different visualization, for example, a
barchart, when our data are grouped into categories.
Show The Relationship Of Two Things
Our task becomes somewhat more complicated when we want to understand
the relationship of one thing to another thing.
Categorical by Categorical
Here, for example, we visualize two categorical variables,
neighborhood, by group. In this graph, the height of
the bars represents the count of observations.
Continuous by Continuous
Here, we visualize two continuous variables, mental
health, by age.
Continuous by Categorical
Last, we visualize a continuous variable by a categorical
variable, mental health, by group. In this graph, the
height of the bars represents the mean score.
Show Where Something Is
Sometimes our task is different. We want to visualize information,
but add information on spatial location, using a map.
Credits
Graphics made with the ggplot2
graphing library created by Hadley Wickham.