4. Clean data

Data needs to be in a usable format to conduct an analysis. This often requires some preparation and “cleaning” of the data. Cleaning refers to the removal or modification of incorrect, incomplete or irrelevant data. Even open datasets may require some cleaning.

Structured or unstructured

Structured data is data that is well-organised. It follows a specific order, in a specific format. The data is categorised by different characteristics. Unstructured data has a range of information that is not organised into separate categories. It needs to be structured into a machine-readable format so that a computer can “read” the data.

Preparing data

You may have to sort or standardise the data. Dates, names and addresses are common data that require preparation.

Data example

Survey respondents may provide a range of answers to a question:

e.g. “Where do you live?”

  • Brisbane, QLD, Australia
  • Alice Springs, NT
  • Hobart, Aus

You would need to sort the responses into separate fields and standard formats.

Example of standard data
City State Country
Brisbane Queensland Australia
Alice Springs Northern Territory Australia
Hobart Tasmania Australia

Data formats

Data that you analyse by computer must be in a format that software can open and read. The delimited format, where each variable is separated by a comma, tab or space, can be imported into software for analysis.

The *.csv (comma-separated value) extension is useful because it maintains field separations across software. If you have created a table in Microsoft Excel, and want to export it to another type of software, it is better to save the file as a csv.

Save an Excel file as a csv:

  1. Click File
  2. Select Save As
  3. Then select the CSV (Comma delimited) *.csv option from the list
  4. Choose the location to save the file
  5. Click Save.

A csv file may not import correctly into other software if there are:

  • blanks lines
  • special (non-standard) characters e.g. # & @
  • missing commas between fields.

Clean file example

Data in a “clean” csv file should appear like this example of female Oscar winners:

Index, Year, Age, Name, Movie
1, 2016, 26, Brie Larson, Room
2, 2017, 28, Emma Stone, La La Land
3, 2018, 60, Frances McDormand, Three Billboards Outside Ebbing Missouri
4, 2019, 45, Olivia Colman, The Favourite
5, 2020, 50, Renee Zellweger, Judy

Download this example data file (xlsx, 10KB) to see how it appears in a spreadsheet.

A data file may require special statistical software to open it.

Tools for cleaning data

You can manually sort data into the correct fields but if you have a lot of data it can be very time consuming. You can use formulas and functions in software to do the task much faster.

Data cleaning tools

Tools with functions to clean data

Tool Freely available Guides Tutorials
OpenRefine Yes OpenRefine Documentation Introduction to Google Refine (YouTube, 6m47s)
R Yes Essential R resources Cleaning Bad Data in R (LinkedIn Learning, 1h 54m) UQ login required
Microsoft Excel Get Office 365 for personal use (via UQ) Top 10 ways to clean your data Clean data imported into Excel (LinkedIn Learning, 5m 13s) UQ login required

Our Text mining and text analysis guide has a lot of information about preparing text for analysis.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Work with Data and Files Copyright © 2023 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Share This Book