4. Clean data
Data needs to be in a usable format to conduct an analysis. This often requires some preparation and “cleaning” of the data. Cleaning refers to the removal or modification of incorrect, incomplete or irrelevant data. Even open datasets may require some cleaning.
Structured or unstructured
Structured data is data that is well-organised. It follows a specific order, in a specific format. The data is categorised by different characteristics. Unstructured data has a range of information that is not organised into separate categories. It needs to be structured into a machine-readable format so that a computer can “read” the data.
Preparing data
You may have to sort or standardise the data. Dates, names and addresses are common data that require preparation.
Data example
Survey respondents may provide a range of answers to a question:
e.g. “Where do you live?”
- Brisbane, QLD, Australia
- Alice Springs, NT
- Hobart, Aus
You would need to sort the responses into separate fields and standard formats.
City | State | Country |
---|---|---|
Brisbane | Queensland | Australia |
Alice Springs | Northern Territory | Australia |
Hobart | Tasmania | Australia |
Data formats
Data that you analyse by computer must be in a format that software can open and read. The delimited format, where each variable is separated by a comma, tab or space, can be imported into software for analysis.
The *.csv (comma-separated value) extension is useful because it maintains field separations across software. If you have created a table in Microsoft Excel, and want to export it to another type of software, it is better to save the file as a csv.
Save an Excel file as a csv:
- Click File.
- Select Save As.
- Then select the CSV (Comma delimited) *.csv option from the list.
- Choose the location to save the file.
- Click Save.
A csv file may not import correctly into other software if there are:
- blanks lines
- special (non-standard) characters e.g. # & @
- missing commas between fields.
Clean file example
Data in a “clean” csv file should appear like this example of female Oscar winners:
Index, Year, Age, Name, Movie
1, 2016, 26, Brie Larson, Room
2, 2017, 28, Emma Stone, La La Land
3, 2018, 60, Frances McDormand, Three Billboards Outside Ebbing Missouri
4, 2019, 45, Olivia Colman, The Favourite
5, 2020, 50, Renee Zellweger, Judy
Download this example data file (xlsx, 10KB) to see how it appears in a spreadsheet.
A data file may require special statistical software to open it.
Tools for cleaning data
You can manually sort data into the correct fields but if you have a lot of data it can be very time consuming. You can use formulas and functions in software to do the task much faster.
Data cleaning tools
Tools with functions to clean data
Our Text mining and text analysis guide has a lot of information about preparing text for analysis.