"

4. Data processing

The processing phase of a project consists of all the steps involved in transforming your original data files (or the importable versions) into the fully cleaned and processed analysis data files that you use to generate your results.

If you are not using statistical software for your analysis, you may not need to perform all the steps outlined here but it is still important to document how you have cleaned and processed the data. Remember to use stable, well documented file formats to make it possible to access the file or reuse the data in the long term.

The Library regularly has training sessions on various statistical software packages, including R and RStudio.

Command files folder

If you are using statistical software, the Command files folder should contain one or more files with code written in the syntax of the software you use for the study.  The code in these command files should execute all the data processing and analysis necessary to replicate the study and reproduce the reported results.

The best way to construct and organise your command files may vary depending on the nature of the project or the requirements of the software. In many cases, you will be able to organise it into three steps, with one or more command files executing the steps in each phase.

Step 1: Processing the data using statistical software

If you are using statistical software, all of the commands necessary for processing your data must be written in a command file, or in several command files that can be run sequentially.  When you have finished writing these command files, executing them will automatically conduct all the procedures necessary to transform your original (or importable) data files into your analysis data files.

Writing, experimenting with and editing these command files, until they successfully carry out the necessary steps of processing, is the focal point of the work you do with your data

Transform your data file into analysis data files

In one or more command files, write code that transforms your importable data files into your analysis data files. Exactly what steps of processing are required varies but some common procedures include:

  • Having your software open your importable data files
  • Cleaning the data to resolve any errors or discrepancies
  • Removing variables or cases that you do not need
  • Combining data from different importable data files
  • Transposing a data table so that columns become rows and rows become columns
  • Generating new variables
  • Saving intermediate and analysis data files.

Decide how to organise the code that processes your data into one or more command files

It is possible to put all the necessary commands in a single file. However, in many cases separating different parts of the processing phase into different command files can help you keep track of what you are doing. The best way of dividing your data processing among command files will depend on the particulars of your project but the following scheme often works well:

  1. For every importable data file you have, write one command file that reads and cleans the data it contains. This ensures that the data is prepared for merging with the data from the other importable data files. Then save them in a new file, in the native format of the software you are using
  2. Then write one additional command file that merges all these natively formatted files, processes them as necessary to construct the analysis data files and then saves the analysis data files in your software’s native format
  3. Depending on the number of importable data files you have and how the data in them is organised, other schemes for dividing the processing phase among your command files may be more convenient. You should use whatever scheme you find works best for your project.

Whatever scheme you choose, explain in your Readme file (section 6) the order in which the command files need to be run to replicate your project.

Save your files

Save your command files and your analysis data files in the appropriate folders:

  1. The command files that process your data and create your analysis data files in your Command files folder
  2. The analysis data files in your Analysis data folder (The Analysis data folder is explained in section 5).

Step 2: Constructing the Data appendix

The data appendix, saved in the Documents folder, is one of the three documents you created before you began working with your data, in the pre-data phase.

Construct your data appendix after you have finished writing the command files that create your analysis data files and before you begin your analysis, as you may learn things about your data that you should know before you start the analysis.

The data appendix:

  • should provide information about every variable in your analysis data files, including names, definitions and coding (for all variables), summary statistics and histograms (for quantitative variables), and relative frequency tables and charts (for categorical variables).
  • serves as a codebook and users’ guide for your analysis data files.

For every variable, include:

  • The name of the variable and a complete definition (e.g. coding or units of measurement, the wording of a survey question the variable is based on or adjustments made for inflation)
  • The name of the original data file from which the variable was extracted, or from which the variables used to construct it were extracted, and the names of the variables extracted from the original data files
  • The number of observations with valid values for the variable, and the number of observations with missing values.

Generate the descriptive statistics, tables and figures

Writing a command file generates all the descriptive statistics, tables and figures needed for the data appendix. These should be created using the data in your analysis data files.

  1. Give this command file the name DataAppendix
  2. Save DataAppendix in your Command files folder
  3. Finish composing the data appendix, inserting the descriptive statistics, tables and figures in the appropriate places
  4. When you have finished, save the data appendix in your Documents folder. Refer to Section 6 for more information about the Documents folder.

Step 3: Generating the results

Using the data in the analysis data files, the command files conduct the procedures that generate the results reported in the research project. Before each command that generates any of the results, a comment indicating which results it produces should be mentioned (e.g. by table, or figure number or page on which the numerical result appears).

Even if you deviate from this scheme, it provides a useful framework to begin thinking about the most effective way to organise your command files.

Licence

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Document your research data Copyright © 2023 by The University of Queensland is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.