Tidy Data and How to Get It
By John Spencer, MA, Senior GIS Technical Specialist, MEASURE Evaluation
You’re ready to sit down with a newly-obtained dataset, excited about how it will open a world of insight and understanding, and then find you can’t use it. You’ll first have to spend a significant amount of time to restructure the data to even begin to produce a set of basic descriptive statistics or link it to other data you’ve been using.
If you’ve had this experience, you’ve run into an untidy dataset.
Untidy data is a mess. The variable names are weird; observations are stored in columns when they should be in rows; time series data is recorded so that it’s difficult to calculate elapsed time. Bottom line: When data is untidy, a separate effort will likely be required before you can use it for any analysis.
In global health, untidy data is a problem that can undermine the effectiveness of even the strongest health information system and introduce a barrier to effective data use. As health systems grow, it is as important that data coming out of the system is as easy to work with as the data going into the system. This is because more and more people rely on the data for programmatic decisions, specialized analysis, and linking with other systems. Tidy data is essential to a strong health system.
“Tidy data” is a term meant to provide a framework for producing data that conform to standards that make data easier to use. Tidy data may still require some cleaning for analysis, but the job will be much easier. The concepts behind tidy data were described in a 2014 paper: “Tidy Data” by Hadley Wickham.[1] He describes three fundamental attributes of tidy data:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
Untidy data most often look like this:
- Column headers are values, not variable names
- Multiple variables are stored in one column
- Variables are stored in both rows and columns
- Multiple types of observational units are stored in the same table
- A single observational unit is stored in multiple tables
I’ve run across all of these issues over the years. It’s such a common problem that people reference the 90/10 rule—90 percent of your effort will be getting the data ready and only 10 percent will be analysis and mapping. There are those who refer to it as the 80/20 rule, but they’re the optimists among us.
Where does untidy data come from? I’ve usually found it originates in two ways:
- Data in independent systems such as national health information systems, donor reporting systems or other government ministries are designed to serve that organization’s needs but little consideration is given to how the data can be integrated with data from other systems.
- Data are originally tidy but become untidy when they are exported. For instance, data stored in a routine health information system is well structured but becomes untidy when it is exported to an Excel file or CSV file.
Resolving both of these issues requires an increased awareness of tidy data among data providers and data users. Data providers can make their data available in one well-defined export format that conforms to tidy data standards. Data users might still need to make some modifications to the data but it’s much easier to do that when the data is tidy to begin with.
Tidy data are an important component of fully realizing the potential that exists as data proliferate. Until the world is rid of untidy data, here are some tools that can help with the tidying:
Jean-Nicholas Hould provides an overview of tools in Python programming language. Find it at (http://www.jeannicholashould.com/tidy-data-in-python.html).
The statistical programming language R is a great tool for data analysis and for tidying data. There are several packages available in R that can help with tidy data. Tidyr by Hadley Wickham is probably the best one to start with. Find it at https://blog.rstudio.org/2014/07/22/introducing-tidyr/
More information about the R programming language in general can be found at https://www.r-project.org/about.html and there are many good R tutorials on the web.
Excel is not necessarily the best tool to tidy data but it can do some things. Microsoft describes how to clean data and includes some plug-ins that could be helpful at https://support.office.com/en-us/article/Top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19.
And, you’ll also find a good overview of some useful functions at http://myexcelonline.com/blog/top-excel-data-cleansing-techniques/.
For more information
MEASURE Evaluation is funded by USAID to strengthen capacity in developing countries to gather, interpret, and use data to improve health. For more information on the project’s work in data science, visit: https://www.measureevaluation.org/our-work/data-science.
And for useful information on tidy data and the importance of privacy and confidentiality regarding geospatial data, see these two FAQs from MEASURE Evaluation.
[1] Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software; Vol 59; Issue 10. Retrieved from https://www.jstatsoft.org/article/view/v059i10