Why Spreadsheets Are A Bad Format For Storing Data
stats
workflow
Author
Andy Grogan-Kaylor
Published
July 1, 2025
Background
Excel and Google Sheets are sometimes used as programs to collect and store data. Excel and Google Sheets are often used because they are the only programs that some individuals or agencies have for storing data. Excel and other spreadsheet programs may also be used because spreadsheets can be very intuitive and easy ways of managing small amounts of data.
However, Excel and Google Sheets may be problematic as a data storage solution for a number of reasons detailed below, especially as data sets grow in size. Notably, statistical programs like Stata, SAS, or SPSS all store additional information with each variable such as: a variable label, describing the contents of the variable, or the survey question that resulted in the variable; and a value label, which attaches qualitative information to each possible value of the response.1
Excel and Google Sheets do not generally contain this extra information about each variable, or column of data, which may lead to errors in working with quantitative information.
The data below are stored in Stata format, but could as easily be stored in SAS or SPSS format.
Get And Describe The Data
Screenshot of Stata
use"simulated-happiness-data.dta", cleardescribe
Contains data from simulated-happiness-data.dta
Observations: 100
Variables: 2 18 Jul 2025 11:51
-------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------------
id float %9.0g id
Q1 float %9.0g Likert1 How often do you feel happy?
-------------------------------------------------------------------------------------
Sorted by:
Descriptive Statistics and Bar Graph
The Descriptive Statistics And Graph Are Informative
Notice how the descriptive statistics and graph are informative in that they contain information on the variable label and value label. These help us to get an intuitive sense of the information in the data. We see this information when we list out the data as well.
How often |
do you feel |
happy? | Freq. Percent Cum.
------------+-----------------------------------
Never | 21 21.00 21.00
Rarely | 29 29.00 50.00
Sometimes | 28 28.00 78.00
Often | 22 22.00 100.00
------------+-----------------------------------
Total | 100 100.00
Bar Graph
use"simulated-happiness-data.dta", cleargraphbar, over(Q1) title("How often do you feel happy?") asyvarsgraphexport mybar1.png, width(1500) replace
Statistical Formats Save Work and Reduce Error Rates
Adding this valuable information back into the data set may take a great deal of extra effort, which may lead to errors.
Descriptive Statistics and Bar Graph
Notice how the descriptive statistics and graph are much less informative. For example, it is now not immediately clear what Q1 represents.
It is also not clear whether higher values of Q1 represent higher levels of happiness, or higher levels of unhappiness, a crucially important substantive distinction. The information on variable label and value label will have to be added back into the data when preparing a report for dissemination.
If there are several, or many, variables in the data set, this step of adding information back into the data will have to be done multiple times, increasing the time for data processing, as well as introducing a possibility for error.
use"simulated-happiness-data-EXCEL.dta", cleargraphbar, over(Q1) title("How often do you feel happy?") asyvarsgraphexport mybar2.png, width(1500) replace
Bar Graph
A Few Final Issues
Notice how Excel doesn’t enforce the idea of whether variables are numeric, or string, and so would allow storage of different types of information in the same column. Relatedly, numeric variables may be improperly stored as strings, often necessitating recoding before graphical or statistical procedures can be employed.
Excel also doesn’t enforce the idea that data in the same row belong to the same individual, making it easily to accidentally move, or write over, data.
Lastly, Excel would allow some of your columns to have the same name, which might make data difficult to work with in other software.
x
y
verylongvariablename
verylongvariablename
verylongvariablename
100
1
Smith
20
30
200
2
30
NA
40
not applicable
x
yes
60
50
Footnotes
Over time, I’ve realized that the issues discussed in this post also apply to the use of CSV (comma separated value) files. While CSV files are convenient, and can be imported by almost any software, they may lead to many of the same issues discussed in this post.↩︎