Variables & Visualization
What Is The Story You Are Trying To Tell?
Andy Grogan-Kaylor
2021-11-04
Possibilities
As you move forward through this presentation you can press b
to make text bigger, or s
to make text smaller.
Background
- Deciding upon the right data visualization to represent your data can be a daunting process.
- I believe that a starting point for this thinking is some basic statistical thinking about the type of variables that you have.
- At the broadest level, variables may be conceptualized as categorical variables, or continuous variables.
Data Often Come From A Survey Questionnaire.
What is Data?
A data set is nothing more than a series of rows and columns that contain answers to responses to a survey.
- Rows are usually used for individuals, while columns indicate the questionnaire answers, or measures, from those people.
- Answers to questions are often given numerical responses (e.g. “no” is frequently coded as “0” and “yes” is frequently coded as “1”)
Hypothetical Data
1 |
1 |
0 |
100 |
2 |
2 |
0 |
200 |
3 |
1 |
1 |
-9 |
Some Notes on Data
- In working through our research questions, we’ll constantly be going back and forth between the actual data (to see the pattern of responses) and the documentation, to figure out the actual question asked as well as how the different responses are coded.
- Often in a spreadsheet, you’ll see the full text of a question written out (e.g. “What is your gender identity”?)
- Most programs that work with data are going to want abbreviations (e.g. “Q1” or “gender”") for the questions. These abbreviations should usually have no spaces and be 8 characters or less.
Missing Data
- One cell of the sample data set has a negative number.
- Frequently negative numbers are used to indicate what are called “missing values”. A missing value is a response like “don’t know”" or “refused to answer” or “did not answer”.
- Before we start doing calculations with our data, we’ll want to change negative numbers to true missing values (usually symbolized by a “.” or “NA”, so that they don’t goof up our calculations.)
What are Variables?
- By variables, I simply mean the columns of data that you have.
- For our purposes, you may think of variables as synonymous with questionnaire items, or columns of data.
Variable Types
- categorical variables represent unordered categories like neighborhood, or religious affiliation, or place of residence.
- continuous variables represent a continuous scale like a mental health scale, or a measure of life expectancy.
A Data Visualization Strategy
Once we have discerned the type of variable that have, there are two followup questions we may ask before deciding upon a chart strategy:
- Is our graph about one thing at a time?
- How much of x is there?
- What is the distribution of x?
- Is our graph about two things at a time?
- What is the relationship of x and y?
- How are x and y associated?
More On Strategy
Simulated Data
This example uses simulated data on social work clients, of the kind that a social service agency might collect.
Simulated Data
32.82 |
108.9 |
Group A |
Neighborhood A |
52.17 |
82.18 |
Group A |
Neighborhood C |
36.22 |
95.6 |
Group A |
Neighborhood B |
45.41 |
99.98 |
Group A |
Neighborhood B |
26.52 |
96.73 |
Group A |
Neighborhood A |
Show One Thing At A Time
We start by visualizing one indicator at a time.
Continuous Variable
Sometimes the most interesting visualizations, are visualizations that give us a sense of the maximum, minimum, and average values. For example, the histogram and dotplot display information on age.
Categorical Variable
We would use a slightly different visualization, for example, a barchart, when our data are grouped into categories.
Show The Relationship Of Two Things
Our task becomes somewhat more complicated when we want to understand the relationship of one thing to another thing.
Categorical by Categorical
Here, for example, we visualize two categorical variables, neighborhood, by group. In this graph, the height of the bars represents the count of observations.
Continuous by Continuous
Here, we visualize two continuous variables, mental health, by age.
Continuous by Categorical
Last, we visualize a continuous variable by a categorical variable, mental health, by group. In this graph, the height of the bars represents the mean score.
Show Where Something Is
Sometimes our task is different. We want to visualize information, but add information on spatial location, using a map.
Credits
Graphics made with the ggplot2 graphing library created by Hadley Wickham.