Cut continuous variables into discrete categorical

Table of contents

 

 

Introduction

Often continuous variables need to be cut into categorical along the ranges of their values. For example, some continuous scales in large-scale assessments and surveys can be converted into two, three or more categories depending on the cut-points provided by the user. Examples of such cases are the complex background scales in TIMSS and PIRLS which are also provided as “index” variables with fixed categories in the form of “from-to”. This function cuts continuous variables into discrete ones using user-defined ranges. The resulting variables can be numeric or categorical (i.e. factors) depending on if value labels for the new values are provided.

The continuous variables cutting function and its arguments

The lsa.cut.vars function has the following arguments:

  • data.file – The file containing lsa.data object. Either this or data.object shall be specified, but not both. See details.
  • data.object – The object in the memory containing lsa.data object. Either this or data.file shall be specified, but not both. See details.
  • src.variables – Names of the variables to cut into categories. Accepts only continuous variables. No PV variables are accepted. See details.
  • new.variables – The names of the new, cut variables to append to the dataset. See details.
  • new.var.labels – Optional, vector of strings to add as variable labels for the new.variables. See details.
  • cut.points – Vector of numeric values to cut the src.variables between. See details.
  • value.labels – Optional, character vector of values to assign to the newly formed categorical discrete values in the new.variables. See details.
  • out.file – Full path to the .RData file to be written. If missing, the original object will be overwritten in the memory. See examples.

Notes:

  1. Either data.file or data.object shall be provided as source of data. If both of them are provided, the function will stop with an error message.
  2. The src.variables specifies the variables that shall be cut. Only continuous variables are accepted. Multiple src.variables can be passed. These will be split at the same cut points (see below). PVs are not accepted.
  3. The new.variables argument is optional and specifies the names of the new discrete variables from the src.variables. The sequence of the new.variables names is the same as the src.variables. If the new.variables argument is omitted, the function will create the names automatically, appending CUT at the end of the src.variables and store the discrete variable data under these names. If provided, the number of new.variables must be the same as the number of src.variables.
  4. The new.var.labels is optional. Regardless whether new.variables are provided, if new.var.labels are provided, they will be assigned to the new.variables generated from the discretization. If neither new.variables not new.var.labels are provided, the function will automatically generate new.variables (see above) and copy the variable labels from src.variables to the newly generated variables, appending Cut at the beginning. The argument takes a vector with the same number of elements as the number of variable names in src.variables.
  5. cut.points is a mandatory argument. It specifies the ranges (from-to) in the original variables to be cut into discrete categories. There can be multiple cut.points, the new values will be the ranges between them. For example, if the 3.29309, 7.97028, 9.98618, and 10.99411 cut points are passed, there will be five categories in the resulting discrete variables, as follow:
    1. From lowest up to 3.29309;
    2. From above 3.29309 up to 7.97028;
    3. From above 7.97028 up to 9.98618;
    4. From above 9.98618 up to 10.99411; and
    5. From above 10.99411 to the highest value.
  6. The cut.points must be within the range of the src.variables. Otherwise the function will stop with an error.
  7. The value.labels is optional. If omitted, the values in the new discrete variables will be numeric (integers). If the data was exported with missing.to.NA = FALSE (i.e. user-defined missings are kept) the missing values will remain as they are. If the value.labels are provided, the new values will be converted to factor levels. If the data was exported with missing.to.NA = FALSE the names of missing values will be assigned to factor levels too. Either way, the missing values will remain as missing values and handled properly by the analysis functions. If missing.to.NA = TRUE (i.e. setting the user-defined missing values to NA), the NA values will remain as NA in the resulting discrete new.variables.
  8. If full path to .RData file is provided to out.file, the data.set will be written to that file. If no, the complemented data will remain in the memory.
  9. A lsa.data object in memory (if out.file is missing) or .RData file containing lsa.data object with the new discrete variables.

Cutting continuous variables into discrete categorical using the command line

In the examples that follow we will merge a new data file (see how to merge files here) with student data from PIRLS 2021 (Australia and Slovenia), taking all variables from both file types:

lsa.merge.data(inp.folder = "C:/temp",
               file.types = list(acg = NULL, asg = NULL),
               ISO = c("aus", "svn"),
               out.file = "C:/temp/merged/PIRLS_20221_ASG_merged.RData")

Note that the selected variables must be continuous, and not categorical. The variables also cannot be PVs. If any of these conditions is not met, the lsa.cut.vars will stop with error messages. So, let’s cut the Students Like Reading (ASBGSLR) and the Home Resources for Learning (ASBGHRL) continuous scales into discrete categorical variables. The cut points used for cutting the variable must be within the range of values of each source variables. To check the ranges (minimum and maximum values) of the source variables, use the lsa.data.diag function (you can see how to do this here).

As we now have the data from these two countries merged, we will cut the PIRLS 2021 Students Like Reading (ASBGSLR) and the Home Resources for Learning (ASBGHRL). The variables for these two scales in the database are continuous. Multiple variables can be cut into discrete ones at the same time, as in this example. In cutting them into discrete variables, we will assign labels for the discrete categories, as well as variable labels. Note that, as explained in the previous section, if we omit the new value labels, the resulting variables will contain numeric (integer) values. If these are provided, the variables will be set to categorical (factor) ones. If variable labels are provided, these will be assigned as descriptive labels for these variables. If omitted, the variable labels will be copied over from the source variables, adding “Cut” at the very front. The syntax below provides all these details. The source data file is overwritten with the data containing these two discretized variables.

lsa.cut.vars(data.file = "C:/temp/merged/PIRLS_20221_ASG_merged.RData",
             src.variables = c("ASBGSLR", "ASBGHRL"),
             new.variables = c("ASBGSLRREC", "ASBGHRLREC"),
             new.var.labels = c("Categorical like reading", "Categorical learning resources"),
             cut.points = c(4.1, 7.9, 9.9, 10.7),
             value.labels = c("Very low", "Low", "Medium", "High", "Very high"),
             out.file = "C:/temp/merged/PIRLS_20221_ASG_merged.RData")

The call to this syntax will return the following output in the console:

If the data was exported with missing.to.NA = FALSE (i.e. user-defined missings are kept) the codes for the missing values will remain as they are and they will be marked as such, so that any other functions (data preparation or analysis) will treat them as missing values when using the data.

Cutting continuous variables into discrete categorical using the GUI

To start the RALSA user interface, execute the following command in RStudio:

ralsaGUI()

For the examples that follow, merge a new file with PIRLS 2021 data for Australia and Slovenia (Slovenia, not Slovakia) taking all student variables. See how to merge data files here. You can name the merged file PIRLS_2021_ASG_merged.RData.

When done merging the data, select Data preparation > Cut variables from the menu on the left. When navigated to the Cut variables in the GUI, click on the Choose data file button. Navigate to the folder containing the merged PIRLS_2021_ASG_merged.RData file, select it and click the Select button.

Once the file is loaded, you will see the two panels with the available variables and selected variables (the latter is currently empty):

Use the mouse to select individual variables and the single arrow buttons to move them from the list of available variables to the list of selected variables and vice versa. You can use the filter boxes on the top of the panels to find the needed variables quickly.  Note that the selected variables must be continuous, and not categorical. The variables also cannot be PVs. If any of these conditions is not met, the GUI will not let you continue any further and warnings will be displayed. So, let’s cut the Students Like Reading (ASBGSLR) and the Home Resources for Learning (ASBGHRL) continuous scales into discrete categorical variables. The cut points used for cutting the variable must be within the range of values of each source variables. To check the ranges (minimum and maximum values) of the source variables, use the Data diagnostics functionality of RALSA (you can see how to do this here). Find the variables for the two scales in the list of available variables on the left (you can use the filter at the top) and move it to the list of the selected variables. Once there are any variables in the Selected variables panel, the following will appear at the bottom of the screen:

As the note on top states, each selected variable must have a new variable name for saving the data under it. The variable labels are short descriptions of the variable content. All or none shall be specified. If not specified, the variable labels will be copied over from the source variables, appending “Cut” at the beginning. After the information in the boxes above is completed, the following elements will appear at the bottom of the page:

In the text box, enter the cut points that will define the new categories. For this example, let’s enter 4.1, 7.9, 9.9, and 10.7. Enter them divided by spaces, no commas or other separators. Pressing the Reset button will clear all entered values. After entering the values, the following elements will appear:

The table shows what will be the new categories will be. The new categories will be defined in ranges, according to the cut points, from below the lowest cut point to above the highest cut point. If no value labels are defined in the last column, the resulting new variables will be numeric (integers). If new value labels are defined, the resulting new variables will be categorical (factors). Enter the following labels from top to bottom: “Very low”, “Low”, “Medium”, “High”, and “Very high”. Press the Define the new output file name button, navigate to the folder where you want to save the new data file, define the file name and click on the Save button. In this case, we will define the same file name as the source file in the same location. This will overwrite the file, adding the new variables to it. The final screen should look like the image below:

Click on the Execute syntax button. The GUI console will appear at the bottom and will log all completed operations: