Table of contents
- Introduction
- The continuous variables cutting function and its arguments
- Cutting continuous variables into discrete categorical using the command line
- Cutting continuous variables into discrete categorical using the GUI
Introduction
Often continuous variables need to be cut into categorical along the ranges of their values. For example, some continuous scales in large-scale assessments and surveys can be converted into two, three or more categories depending on the cut-points provided by the user. Examples of such cases are the complex background scales in TIMSS and PIRLS which are also provided as “index” variables with fixed categories in the form of “from-to”. This function cuts continuous variables into discrete ones using user-defined ranges. The resulting variables can be numeric or categorical (i.e. factors) depending on if value labels for the new values are provided.
The continuous variables cutting function and its arguments
The lsa.cut.vars
function has the following arguments:
data.file
– The file containinglsa.data
object. Either this ordata.object
shall be specified, but not both. See details.data.object
– The object in the memory containinglsa.data
object. Either this ordata.file
shall be specified, but not both. See details.src.variables
– Names of the variables to cut into categories. Accepts only continuous variables. No PV variables are accepted. See details.new.variables
– The names of the new, cut variables to append to the dataset. See details.new.var.labels
– Optional, vector of strings to add as variable labels for thenew.variables
. See details.cut.points
– Vector of numeric values to cut thesrc.variables
between. See details.value.labels
– Optional, character vector of values to assign to the newly formed categorical discrete values in thenew.variables
. See details.out.file
– Full path to the.RData
file to be written. If missing, the original object will be overwritten in the memory. See examples.
Notes:
- Either
data.file
ordata.object
shall be provided as source of data. If both of them are provided, the function will stop with an error message. - The
src.variables
specifies the variables that shall be cut. Only continuous variables are accepted. Multiplesrc.variables
can be passed. These will be split at the same cut points (see below). PVs are not accepted. - The
new.variables
argument is optional and specifies the names of the new discrete variables from thesrc.variables
. The sequence of thenew.variables
names is the same as thesrc.variables
. If thenew.variables
argument is omitted, the function will create the names automatically, appendingCUT
at the end of thesrc.variables
and store the discrete variable data under these names. If provided, the number ofnew.variables
must be the same as the number ofsrc.variables
. - The
new.var.labels
is optional. Regardless whethernew.variables
are provided, ifnew.var.labels
are provided, they will be assigned to thenew.variables
generated from the discretization. If neithernew.variables
notnew.var.labels
are provided, the function will automatically generatenew.variables
(see above) and copy the variable labels fromsrc.variables
to the newly generated variables, appendingCut
at the beginning. The argument takes a vector with the same number of elements as the number of variable names insrc.variables
. cut.points
is a mandatory argument. It specifies the ranges (from-to) in the original variables to be cut into discrete categories. There can be multiplecut.points
, the new values will be the ranges between them. For example, if the3.29309
,7.97028
,9.98618
, and10.99411
cut points are passed, there will be five categories in the resulting discrete variables, as follow:- From lowest up to
3.29309
; - From above
3.29309
up to7.97028
; - From above
7.97028
up to9.98618
; - From above
9.98618
up to10.99411
; and - From above
10.99411
to the highest value.
- From lowest up to
- The
cut.points
must be within the range of thesrc.variables
. Otherwise the function will stop with an error. - The
value.labels
is optional. If omitted, the values in the new discrete variables will be numeric (integers). If the data was exported withmissing.to.NA = FALSE
(i.e. user-defined missings are kept) the missing values will remain as they are. If thevalue.labels
are provided, the new values will be converted to factor levels. If the data was exported withmissing.to.NA = FALSE
the names of missing values will be assigned to factor levels too. Either way, the missing values will remain as missing values and handled properly by the analysis functions. Ifmissing.to.NA = TRUE
(i.e. setting the user-defined missing values toNA
), theNA
values will remain asNA
in the resulting discretenew.variables
. - If full path to
.RData
file is provided toout.file
, the data.set will be written to that file. If no, the complemented data will remain in the memory. - A
lsa.data
object in memory (ifout.file
is missing) or.RData
file containinglsa.data
object with the new discrete variables.
Cutting continuous variables into discrete categorical using the command line
In the examples that follow we will merge a new data file (see how to merge files here) with student data from PIRLS 2021 (Australia and Slovenia), taking all variables from both file types:
lsa.merge.data(inp.folder = "C:/temp", file.types = list(acg = NULL, asg = NULL), ISO = c("aus", "svn"), out.file = "C:/temp/merged/PIRLS_20221_ASG_merged.RData")
Note that the selected variables must be continuous, and not categorical. The variables also cannot be PVs. If any of these conditions is not met, the lsa.cut.vars
will stop with error messages. So, let’s cut the Students Like Reading (ASBGSLR) and the Home Resources for Learning (ASBGHRL) continuous scales into discrete categorical variables. The cut points used for cutting the variable must be within the range of values of each source variables. To check the ranges (minimum and maximum values) of the source variables, use the lsa.data.diag
function (you can see how to do this here).
As we now have the data from these two countries merged, we will cut the PIRLS 2021 Students Like Reading (ASBGSLR) and the Home Resources for Learning (ASBGHRL). The variables for these two scales in the database are continuous. Multiple variables can be cut into discrete ones at the same time, as in this example. In cutting them into discrete variables, we will assign labels for the discrete categories, as well as variable labels. Note that, as explained in the previous section, if we omit the new value labels, the resulting variables will contain numeric (integer) values. If these are provided, the variables will be set to categorical (factor) ones. If variable labels are provided, these will be assigned as descriptive labels for these variables. If omitted, the variable labels will be copied over from the source variables, adding “Cut” at the very front. The syntax below provides all these details. The source data file is overwritten with the data containing these two discretized variables.
lsa.cut.vars(data.file = "C:/temp/merged/PIRLS_20221_ASG_merged.RData", src.variables = c("ASBGSLR", "ASBGHRL"), new.variables = c("ASBGSLRREC", "ASBGHRLREC"), new.var.labels = c("Categorical like reading", "Categorical learning resources"), cut.points = c(4.1, 7.9, 9.9, 10.7), value.labels = c("Very low", "Low", "Medium", "High", "Very high"), out.file = "C:/temp/merged/PIRLS_20221_ASG_merged.RData")
The call to this syntax will return the following output in the console:
If the data was exported with missing.to.NA = FALSE
(i.e. user-defined missings are kept) the codes for the missing values will remain as they are and they will be marked as such, so that any other functions (data preparation or analysis) will treat them as missing values when using the data.
Cutting continuous variables into discrete categorical using the GUI
To start the RALSA user interface, execute the following command in RStudio:
ralsaGUI()
For the examples that follow, merge a new file with PIRLS 2021 data for Australia and Slovenia (Slovenia, not Slovakia) taking all student variables. See how to merge data files here. You can name the merged file PIRLS_2021_ASG_merged.RData.
When done merging the data, select Data preparation > Cut variables from the menu on the left. When navigated to the Cut variables in the GUI, click on the Choose data file button. Navigate to the folder containing the merged PIRLS_2021_ASG_merged.RData file, select it and click the Select button.
Once the file is loaded, you will see the two panels with the available variables and selected variables (the latter is currently empty):
Use the mouse to select individual variables and the single arrow buttons to move them from the list of available variables to the list of selected variables and vice versa. You can use the filter boxes on the top of the panels to find the needed variables quickly. Note that the selected variables must be continuous, and not categorical. The variables also cannot be PVs. If any of these conditions is not met, the GUI will not let you continue any further and warnings will be displayed. So, let’s cut the Students Like Reading (ASBGSLR) and the Home Resources for Learning (ASBGHRL) continuous scales into discrete categorical variables. The cut points used for cutting the variable must be within the range of values of each source variables. To check the ranges (minimum and maximum values) of the source variables, use the Data diagnostics functionality of RALSA (you can see how to do this here). Find the variables for the two scales in the list of available variables on the left (you can use the filter at the top) and move it to the list of the selected variables. Once there are any variables in the Selected variables panel, the following will appear at the bottom of the screen:
As the note on top states, each selected variable must have a new variable name for saving the data under it. The variable labels are short descriptions of the variable content. All or none shall be specified. If not specified, the variable labels will be copied over from the source variables, appending “Cut” at the beginning. After the information in the boxes above is completed, the following elements will appear at the bottom of the page:
In the text box, enter the cut points that will define the new categories. For this example, let’s enter 4.1, 7.9, 9.9, and 10.7. Enter them divided by spaces, no commas or other separators. Pressing the Reset button will clear all entered values. After entering the values, the following elements will appear:
The table shows what will be the new categories will be. The new categories will be defined in ranges, according to the cut points, from below the lowest cut point to above the highest cut point. If no value labels are defined in the last column, the resulting new variables will be numeric (integers). If new value labels are defined, the resulting new variables will be categorical (factors). Enter the following labels from top to bottom: “Very low”, “Low”, “Medium”, “High”, and “Very high”. Press the Define the new output file name button, navigate to the folder where you want to save the new data file, define the file name and click on the Save button. In this case, we will define the same file name as the source file in the same location. This will overwrite the file, adding the new variables to it. The final screen should look like the image below:
Click on the Execute syntax button. The GUI console will appear at the bottom and will log all completed operations: