What is RALSA?
To understand what it is, we need to know what are the large-scale assessments and surveys. If you are already familiar with the methodology of large-scale assessment, please feel free to scroll down to the bottom of this page or just click here. The data from small-scale research is usually analyzed using plain statistical procedures, as we know them from basic statistics. Say, we want to compute average, correlation, or linear regression, we can simply load the data in a statistical software like SPSS, R, SAS, Minitab, etc. and use the built-in procedures. Large-scale assessments and surveys do not aim to compute sample statistics, but to estimate population parameters. That is, we would not care to compute statistic for the used sample, but to use the sampled elements to make an inference for the entire population. While we can hardly collect data for the entire target population in a country, we can draw a sample and use its data to estimate the population parameter. The usual statistical procedures would require drawing a simple or systematic random sample from the target population. The empirical studies from the past show that with a sample of about 400 elements we can reliably represent the target population. We can even assign weight to each sample element, so that it will not represent just itself, but also all similar elements in the population. This procedure, however, has drawbacks.
First, the random sampling will assume that all elements have the same probability of being selected. This may not always be true, the actual selection probability may depend on many different factors, including but not limited to, groups within the population with similar characteristics, or remote groups within the population, as well as the sizes of these groups.
Second, the sampling weights are computed as the inverse probability of being selected. However, when the selection probability is assumed to be equal, the weights for all elements will be equal too.
Third, this procedure would not allow us to oversample certain groups within the target population we may have a special interest in.
Fourth, it will require an exhaustive and precise list of all elements in the population. Such rarely exist.
Fifth, this may not be a practical and cost-efficient option. The elements in a target population which a survey or assessment is interested in usually represent a large number and are scattered across the entire country. If we sample 400 elements from a population of, say 4,000,000 people, we may easily end up in a large expenditure of resources to reach and organize a one-to-one session with each one of them to ask our questions. Such procedure would be quite time consuming as well. It is cheaper to travel to 10 schools and test 40 students in each than travel to 400 schools.
Sixth, surveys and assessments are not interested in just the characteristics of the target population, but also in their context. Large-scale assessments and surveys are often interested in educational outcomes (i.e. student achievement) which are connected with the educational context. These would include school, teacher, family background, as well as processes and educational practices. Thus, it would be highly desirable to link the student achievement to their educational context. With simple or systematic random sample we can easily end up with one sampled student per classroom and school. This would not make possible linking the outcome with the context (school and teacher) to make valid estimation of the relationship between achievement and important contextual factors.
Large-scale assessments and surveys come with a different sampling strategy to overcome all these issues: multistage stratified cluster sampling with probability proportional to the size of the primary sampling units. First, an exhaustive list of all primary sampling units (PSUs) is obtained, where PSUs are schools or any other entities where the elements to be sampled are located, usually these are schools. It is sorted in descending order by the size (i.e. number of elements in each PSU). A sampling interval reflecting the desired sample size is defined and an initial starting number within this interval is selected at random. The first PSU is selected by selecting the n-th school within the first N schools within the sampling interval, where n is the initial starting number. Then all the rest of the schools are selected by applying the sampling interval from the first selected school on, until the end of the list of PSUs is reached. Implicit and/or explicit stratification by certain demographic characteristics is applied in the process. This is the first stage of sampling. Remember that at the very beginning the list of schools is sorted descending by the number of students, and then the schools are selected. It may not be immediately obvious, but this type of sampling will give larger schools higher probability of being selected and smaller schools – smaller probability. The second stage is selecting elements from each PSU. In case we are sampling schools, these will be students (as in TIMSS, PIRLS, PISA, ICCS, ICILS) or teachers (as in SITES or TALIS). If cluster sampling is applied, then one or two intact classes from the target grade will be picked at random, taking all their students in the sample. This is, for example, the procedure in TIMSS, PIRLS and ICCS. PISA and ICILS, on the other hand, do not use cluster sampling of students, instead they sample a fixed number of students (15 or 20) from the target grade are sampled at random, regardless which class they belong to. In addition, student teachers are sampled as well. When cluster sampling is applied, teachers teaching the sampled classes are sampled as well and linked to their students (ICCS being an exception, target grade teachers are sampled randomly within the schools, regardless of their subject). All these allows answering questions about the relationship of certain student characteristics by characteristics of their teachers and schools, which would not be possible with a simple random sampling. All of the above is called “complex sampling design” and it brings a number of issues for analysis. Even if most of the statistical software brings a functionality to apply survey weights, adding a weight variable to the calculations doesn’t solve the problem – these weights will be treated as if they stem from simple or systematic random sampling. That is, assuming each element in the sample has equal chance to be sampled would violate the actual implementation of sampling with unequal probabilities and, because weights are the inverse probability of selection, the resulting estimates will be biased. Thus, a Jackknife Repeated Replication (JRR) or Balanced Repeated Replication (BRR) techniques are required and used in large-scale assessments. The implementation of these varies greatly across the studies and, although some generic software products have the option to work with complex samples, they cannot readily handle these issues out of the box to estimate the sampling error of the estimates. In a nutshell, JRR and BRR replicate the weights multiple times, every time adjusting the weights for a pair of schools and recomputing the statistics, at the end averaging the estimates using complex formulas to obtain the final estimate and its standard error.
Another complication is introduced when assessment component is present, i.e. in case of large-scale assessments like TIMSS, PIRLS and PISA. The usual testing situation involves applying a test to the respondents and, using the means of Classical Test Theory (CTT) or the Item Response Theory (IRT), a single test score to each tested person is assigned. This approach is commonly used in exams with a specific purpose and limited set of items. Large-scale assessments assess broad content and cognitive domains (e.g. mathematics and science) and a large number of items is needed to assess the outcomes reliably. For example, TIMSS 2015 mathematics domain in grade 8 employs 297 items in mathematics and 305 items in science. It is quite clear that no single student can provide a valid answer to all of them. Thus, large-scale assessments employ something called Block Incomplete Booklets (BIB) approach where small blocks of items are distributed across number of booklets and through common blocks link between the booklets is maintained. Every tested student takes one booklet only with a limited number of items. This is called a “complex assessment design”. IRT is very useful for scaling data from BIB design through the colloquially called “vertical scaling”. Various estimation procedures can produce reliable scores. There is just one issue with these procedures – they are appropriate to produce scores which will be used on individual level, e.g. in exams. However, when it comes to estimation on group level, the scores produced using this method are proven to be biased. The solution is the so called “plausible values methodology” where, simply put, the answers of items which a student did not take are treated as missing values. The plausible values (PVs) methodology is rather complex to be presented in its entirety here, so its presentation here will be quite short. It is extension of the IRT where first the item parameters are estimated. Then the achievement items’ and the principal components derived from the background/contextual data which explain 90% of its variance are pooled together and the scores are estimated applying the obtained IRT parameters. Due to the use of background data, there are many groups where students cluster together because of similarity in their characteristics. Each of these groups has its own distribution of values. There is a lot of uncertainty with this distribution, thou. The process ends with producing individual scores making several (five or 10) random draws in the distribution of the group to which the student belongs. That is, more than one score per tested student is produced to later estimate the measurement variance. It is also called “imputation variance” because actually the estimation of the scores using the PV methodology follows the multiple imputation techniques. As a consequence, each analysis involving test scores has to be repeated a number of times (as many PVs) and then averaged. The computation of the measurement error involves a complex formula following the so called “Rubin’s rule” for aggregating the results from multiple imputed data sets. The actual formula will differ across the different studies, depending on the actual implementation.
When both complex sampling and assessment designs are used, an estimate will be computed with each replicate (JRR or BRR) weight and each PV. To give you an idea what this means, in TIMSS the number of computations for an estimate (percentage, mean, regression coefficient, etc.) is done 755 times (once with the full weight, 150 replicates and five PVs – 151×5=755), then the estimates are averaged and the total error (sampling and measurement) is computed. If the estimate is computed within different groups in the population, then this procedure is repeated for each group separately.
Those who are interested in the details on the methodology of large-scale assessments and surveys this can have a look at the following publications:
Foy, P., & Yin, L. (2017). Scaling the PIRLS 2016 Achievement Data. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and Procedures in PIRLS 2016 (p. 12.1-12.38). TIMSS & PIRLS International Study Center.
Gebhardt, E., & Schulz, W. (2018). Scaling Procedures for ICCS Test Items. In W. Schulz, B. Losito, R. Carstens, & J. Fraillon (Eds.), ICCS 2016 Technical Report (pp. 117–138). International Association for the Evaluation of Educational Achievement.
Meinck, S. (2015). Sampling Design and Implementation. In J. Fraillon, W. Schulz, T. Friedman, J. Ainley, & E. Gebhardt (Eds.), ICILS 2013 Technical Report (pp. 65–86). International Association for the Evaluation of Educational Achievement.
Rust, K. (2014). Sampling, Weighting, and Variance Estimation in International Large-Scale Assessments. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of International Large-Scale Assessments: Background, Technical Issues, and Methods of Data Analysis (pp. 117–154). CRC Press.
For an overview of the issues in using large-scale assessment data see the following publication:
Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International Large-Scale Assessment Data: Issues in Secondary Analysis and Reporting. Educational Researcher, 39(2), 142–151.
For additional information on large-scale assessments, visit the ILSA-gateway.
From the presentation above it became quite clear that the complex sampling and assessment designs of a study require special care and software to analyze the data. This software needs to handle these issues depending on the actual implementation of the methodology properly. And this is exactly where RALSA comes in.
The R Analyzer for Large-Scale Assessments (RALSA) is an R package for preparation and analysis of data from large-scale assessments and surveys which use complex sampling and assessment design. The software can handle the design issues and apply the appropriate analytical methods automatically for every type of study, out of the box without user intervention. RALSA is a free of charge and open source software licensed under GPL v2.0 and is cross-platform — it works on any system which can run a full installation of R. Currently, RALSA supports a number of studies with different design and a number of analysis types (see below). Both of these will increase in future.
In addition to the traditional command-line R interface, RALSA has a Graphical User Interface for the users who lack the technical skills.
Currently, RALSA supports the following functionality:
- Prepare data for analysis
- Convert data (SPSS, or text in case of PISA prior 2015)
- Merge study data files from different countries and/or respondents
- View variable properties (name, class, variable label, response categories/unique values, user-defined missing values)
- Recode variables
- Perform analyses (more analysis types will be added in future)
- Percentages of respondents in certain groups and averages on variables of interest, per group
- Percentiles of variables within groups of respondents
- Percentages of respondents reaching or surpassing benchmarks of achievement
- Correlations (Pearson or Spearman)
- Linear regression
- Binary logistic regression
All data preparation and analysis functions automatically recognize the study design and apply the appropriate techniques to handle the complex sampling assessment design issues, while giving freedom to tweak the analysis (e.g. change the default weight, apply the “shortcut” method in TIMSS and PIRLS and so on).
Currently, RALSA can work with data for all cycles of the following studies (more will be added in future):
- PIRLS (including PIRLS Literacy and ePIRLS)
- TIMSS (including TIMSS Numeracy and eTIMSS)
- TiPi (TIMSS and PIRLS joint study)
- TIMSS Advanced
- PISA for Development
- TALIS Starting Strong Survey (a.k.a. TALIS 3S)
For questions, feature requests, training requests, and bug reports, please write to firstname.lastname@example.org.