Why Spreadsheets Are A Bad Format For Storing Data

stats
workflow
Author

Andy Grogan-Kaylor

Published

July 1, 2025

Background

Excel and Google Sheets are sometimes used as programs to collect and store data. Excel and Google Sheets are often used because they are the only programs that some individuals or agencies have for storing data. Excel and other spreadsheet programs may also be used because spreadsheets can be very intuitive and easy ways of managing small amounts of data.

However, Excel and Google Sheets may be problematic as a data storage solution for a number of reasons detailed below, especially as data sets grow in size. Notably, statistical programs like Stata, SAS, or SPSS all store additional information with each variable such as: a variable label, describing the contents of the variable, or the survey question that resulted in the variable; and a value label, which attaches qualitative information to each possible value of the response.1

Excel and Google Sheets do not generally contain this extra information about each variable, or column of data, which may lead to errors in working with quantitative information.

The data below are stored in Stata format, but could as easily be stored in SAS or SPSS format.

Get And Describe The Data

Screenshot of Stata

Screenshot of Stata

use "simulated-happiness-data.dta", clear

describe
Contains data from simulated-happiness-data.dta
 Observations:           100                  
    Variables:             2                  18 Jul 2025 11:51
-------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------
id              float   %9.0g                 id
Q1              float   %9.0g      Likert1    How often do you feel happy?
-------------------------------------------------------------------------------------
Sorted by: 

Descriptive Statistics and Bar Graph

The Descriptive Statistics And Graph Are Informative

Notice how the descriptive statistics and graph are informative in that they contain information on the variable label and value label. These help us to get an intuitive sense of the information in the data. We see this information when we list out the data as well.

Descriptive Statistics


use "simulated-happiness-data.dta", clear

tabulate Q1
  How often |
do you feel |
     happy? |      Freq.     Percent        Cum.
------------+-----------------------------------
      Never |         21       21.00       21.00
     Rarely |         29       29.00       50.00
  Sometimes |         28       28.00       78.00
      Often |         22       22.00      100.00
------------+-----------------------------------
      Total |        100      100.00

Bar Graph


use "simulated-happiness-data.dta", clear

graph bar, over(Q1) title("How often do you feel happy?") asyvars
    
graph export mybar1.png, width(1500) replace

Bar Graph

Bar Graph

List Out A Sample Of The Data


use "simulated-happiness-data.dta", clear

list in 1/10
     | id          Q1 |
     |----------------|
  1. |  1       Never |
  2. |  2   Sometimes |
  3. |  3       Never |
  4. |  4   Sometimes |
  5. |  5   Sometimes |
     |----------------|
  6. |  6       Often |
  7. |  7      Rarely |
  8. |  8   Sometimes |
  9. |  9       Never |
 10. | 10      Rarely |
     +----------------+

Now Use The Data In Excel Format

Spreadsheets Are All Similar

The example below focuses on Excel, but could as easily apply to a program like Google Sheets.

I’ve saved this simulated data in Excel format.

Screenshot of Excel

Screenshot of Excel

I now import the Excel data file. I use the first row of data as variable names.

Spreadsheet Data Are Less Informative

We see right away–when we list some of the data–that the data are less informative.


use "simulated-happiness-data-EXCEL.dta", clear

list in 1/10
     | id   Q1 |
     |---------|
  1. |  1    1 |
  2. |  2    3 |
  3. |  3    1 |
  4. |  4    3 |
  5. |  5    3 |
     |---------|
  6. |  6    4 |
  7. |  7    2 |
  8. |  8    3 |
  9. |  9    1 |
 10. | 10    2 |
     +---------+
Statistical Formats Save Work and Reduce Error Rates

Adding this valuable information back into the data set may take a great deal of extra effort, which may lead to errors.

Descriptive Statistics and Bar Graph

Notice how the descriptive statistics and graph are much less informative. For example, it is now not immediately clear what Q1 represents.

It is also not clear whether higher values of Q1 represent higher levels of happiness, or higher levels of unhappiness, a crucially important substantive distinction. The information on variable label and value label will have to be added back into the data when preparing a report for dissemination.

If there are several, or many, variables in the data set, this step of adding information back into the data will have to be done multiple times, increasing the time for data processing, as well as introducing a possibility for error.

Descriptive Statistics


use "simulated-happiness-data-EXCEL.dta", clear

tabulate Q1
         Q1 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         21       21.00       21.00
          2 |         29       29.00       50.00
          3 |         28       28.00       78.00
          4 |         22       22.00      100.00
------------+-----------------------------------
      Total |        100      100.00

Bar Graph


use "simulated-happiness-data-EXCEL.dta", clear

graph bar, over(Q1) title("How often do you feel happy?") asyvars
    
graph export mybar2.png, width(1500) replace

Bar Graph

Bar Graph

A Few Final Issues

Notice how Excel doesn’t enforce the idea of whether variables are numeric, or string, and so would allow storage of different types of information in the same column. Relatedly, numeric variables may be improperly stored as strings, often necessitating recoding before graphical or statistical procedures can be employed.

Excel also doesn’t enforce the idea that data in the same row belong to the same individual, making it easily to accidentally move, or write over, data.

Lastly, Excel would allow some of your columns to have the same name, which might make data difficult to work with in other software.

x y verylongvariablename verylongvariablename verylongvariablename
100 1 Smith 20 30
200 2 30 NA 40
not applicable x yes 60 50

Footnotes

  1. Over time, I’ve realized that the issues discussed in this post also apply to the use of CSV (comma separated value) files. While CSV files are convenient, and can be imported by almost any software, they may lead to many of the same issues discussed in this post.↩︎