The SCED is a very flexible methodology and has many variants. Those mentioned here are the building blocks from which other designs are then derived. For those readers interested in the nuances of each design, Barlow et al., (2008); Franklin, Allison, and Gorman (1997); Kazdin (2010); and Kratochwill and Levin (1992), among others, provide cogent, in-depth discussions. Identifying the appropriate SCED depends upon many factors, including the specifics of the IV, the setting in which the study will be conducted, participant characteristics, the desired or hypothesized outcomes, and the research question(s). Similarly, the researcher’s selection of measurement and analysis techniques is determined by these factors.
Predominant Single-Case Experimental Designs
Alternating/simultaneous designs (6%; primary design of the studies reviewed)
Alternating and simultaneous designs involve an iterative manipulation of the IV(s) across different phases to show that changes in the DV vary systematically as a function of manipulating the IV(s). In these multielement designs, the researcher has the option to alternate the introduction of two or more IVs or present two or more IVs at the same time. In the alternating variation, the researcher is able to determine the relative impact of two different IVs on the DV, when all other conditions are held constant. Another variation of this design is to alternate IVs across various conditions that could be related to the DV (e.g., class period, interventionist). Similarly, the simultaneous design would occur when the IVs were presented at the same time within the same phase of the study.
Changing criterion design (4%)
Changing criterion designs are used to demonstrate a gradual change in the DV over the course of the phase involving the active manipulation of the IV. Criteria indicating that a change has occurred happen in a step-wise manner, in which the criterion shifts as the participant responds to the presence of the manipulated IV. The changing criterion design is particularly useful in applied intervention research for a number of reasons. The IV is continuous and never withdrawn, unlike the strategy used in a reversal design. This is particularly important in situations where removal of a psychological intervention would be either detrimental or dangerous to the participant, or would be otherwise unfeasible or unethical. The multiple baseline design also does not withdraw intervention, but it requires replicating the effects of the intervention across participants, settings, or situations. A changing criterion design can be accomplished with one participant in one setting without withholding or withdrawing treatment.
Multiple baseline/combined series design (69%)
The multiple baseline or combined series design can be used to test within-subject change across conditions and often involves multiple participants in a replication context. The multiple baseline design is quite simple in many ways, essentially consisting of a number of repeated, miniature AB experiments or variations thereof. Introduction of the IV is staggered temporally across multiple participants or across multiple within-subject conditions, which allows the researcher to demonstrate that changes in the DV reliably occur only when the IV is introduced, thus controlling for the effects of extraneous factors. Multiple baseline designs can be used both within and across units (i.e., persons or groups of persons). When the baseline phase of each subject begins simultaneously, it is called a concurrent multiple baseline design. In a nonconcurrent variation, baseline periods across subjects begin at different points in time. The multiple baseline design is useful in many settings in which withdrawal of the IV would not be appropriate or when introduction of the IV is hypothesized to result in permanent change that would not reverse when the IV is withdrawn. The major drawback of this design is that the IV must be initially withheld for a period of time to ensure different starting points across the different units in the baseline phase. Depending upon the nature of the research questions, withholding an IV, such as a treatment, could be potentially detrimental to participants.
Reversal designs (17%)
Reversal designs are also known as introduction and withdrawal and are denoted as ABAB designs in their simplest form. As the name suggests, the reversal design involves collecting a baseline measure of the DV (the first A phase), introducing the IV (the first B phase), removing the IV while continuing to assess the DV (the second A phase), and then reintroducing the IV (the second B phase). This pattern can be repeated as many times as is necessary to demonstrate an effect or otherwise address the research question. Reversal designs are useful when the manipulation is hypothesized to result in changes in the DV that are expected to reverse or discontinue when the manipulation is not present. Maintenance of an effect is often necessary to uphold the findings of reversal designs. The demonstration of an effect is evident in reversal designs when improvement occurs during the first manipulation phase, compared to the first baseline phase, then reverts to or approaches original baseline levels during the second baseline phase when the manipulation has been withdrawn, and then improves again when the manipulation in then reinstated. This pattern of reversal, when the manipulation is introduced and then withdrawn, is essential to attributing changes in the DV to the IV. However, maintenance of the effects in a reversal design, in which the DV is hypothesized to reverse when the IV is withdrawn, is not incompatible (Kazdin, 2010). Maintenance is demonstrated by repeating introduction–withdrawal segments until improvement in the DV becomes permanent even when the IV is withdrawn. There is not always a need to demonstrate maintenance in all applications, nor is it always possible or desirable, but it is paramount in the learning and intervention research contexts.
Mixed designs (10%)
Mixed designs include a combination of more than one SCED (e.g., a reversal design embedded within a multiple baseline) or an SCED embedded within a group design (i.e., a randomized controlled trial comparing two groups of multiple baseline experiments). Mixed designs afford the researcher even greater flexibility in designing a study to address complex psychological hypotheses, but also capitalize on the strengths of the various designs. See Kazdin (2010) for a discussion of the variations and utility of mixed designs.
Related Nonexperimental Designs
In contrast to the designs previously described, all of which constitute “true experiments” (Kazdin, 2010; Shadish et al., 2002), in quasi-experimental designs the conditions of a true experiment (e.g., active manipulation of the IV, replication of the effect) are approximated and are not readily under the control of the researcher. Because the focus of this article is on experimental designs, quasi-experiments are not discussed in detail; instead the reader is referred to Kazdin (2010) and Shadish et al. (2002).
Ecological and naturalistic single-case designs
For a single-case design to be experimental, there must be active manipulation of the IV, but in some applications, such as those that might be used in social and personality psychology, the researcher might be interested in measuring naturally occurring phenomena and examining their temporal relationships. Thus, the researcher will not use a manipulation. An example of this type of research might be a study about the temporal relationship between alcohol consumption and depressed mood, which can be measured reliably using EMA methods. Psychotherapy process researchers also use this type of design to assess dyadic relationship dynamics between therapists and clients (e.g., Tschacher & Ramseyer, 2009).
Research Design Standards
Each of the reviewed standards provides some degree of direction regarding acceptable research designs. The WWC provides the most detailed and specific requirements regarding design characteristics. Those guidelines presented in Tables 3, 4, and 5 are consistent with the methodological rigor necessary to meet the WWC distinction “meets standards.” The WWC also provides less-stringent standards for a “meets standards with reservations” distinction. When minimum criteria in the design, measurement, or analysis sections of a study are not met, it is rated “does not meet standards” (Kratochwill et al., 2010). Many SCEDs are acceptable within the standards of DIV12, DIV16, NRP, and in the Tate et al. SCED scale. DIV12 specifies that replication occurs across a minimum of three successive cases, which differs from the WWC specifications, which allow for three replications within a single-subject design but does not necessarily need to be across multiple subjects. DIV16 does not require, but seems to prefer, a multiple baseline design with a between-subject replication. Tate et al. state that the “design allows for the examination of cause and effect relationships to demonstrate efficacy” (p. 400, 2008). Determining whether or not a design meets this requirement is left up to the evaluator, who might then refer to one of the other standards or another source for direction.
Research Design Standards and Guidelines
Measurement and Assessment Standards and Guidelines
Analysis Standards and Guidelines
The Stone and Shiffman (2002) standards for EMA are concerned almost entirely with the reporting of measurement characteristics and less so with research design. One way in which these standards differ from those of other sources is in the active manipulation of the IV. Many research questions in EMA, daily diary, and time-series designs are concerned with naturally occurring phenomena, and a researcher manipulation would run counter to this aim. The EMA standards become important when selecting an appropriate measurement strategy within the SCED. In EMA applications, as is also true in some other time-series and daily diary designs, researcher manipulation occurs as a function of the sampling interval in which DVs of interest are measured according to fixed time schedules (e.g., reporting occurs at the end of each day), random time schedules (e.g., the data collection device prompts the participant to respond at random intervals throughout the day), or on an event-based schedule (e.g., reporting occurs after a specified event takes place).
The basic measurement requirement of the SCED is a repeated assessment of the DV across each phase of the design in order to draw valid inferences regarding the effect of the IV on the DV. In other applications, such as those used by personality and social psychology researchers to study various human phenomena (Bolger et al., 2003; Reis & Gable, 2000), sampling strategies vary widely depending on the topic area under investigation. Regardless of the research area, SCEDs are most typically concerned with within-person change and processes and involve a time-based strategy, most commonly to assess global daily averages or peak daily levels of the DV. Many sampling strategies, such as time-series, in which reporting occurs at uniform intervals or on event-based, fixed, or variable schedules, are also appropriate measurement methods and are common in psychological research (see Bolger et al., 2003).
Repeated-measurement methods permit the natural, even spontaneous, reporting of information (Reis, 1994), which reduces the biases of retrospection by minimizing the amount of time elapsed between an experience and the account of this experience (Bolger et al., 2003). Shiffman et al. (2008) aptly noted that the majority of research in the field of psychology relies heavily on retrospective assessment measures, even though retrospective reports have been found to be susceptible to state-congruent recall (e.g., Bower, 1981) and a tendency to report peak levels of the experience instead of giving credence to temporal fluctuations (Redelmeier & Kahneman, 1996; Stone, Broderick, Kaell, Deles-Paul, & Porter, 2000). Furthermore, Shiffman et al. (1997) demonstrated that subjective aggregate accounts were a poor fit to daily reported experiences, which can be attributed to reductions in measurement error resulting in increased validity and reliability of the daily reports.
The necessity of measuring at least one DV repeatedly means that the selected assessment method, instrument, and/or construct must be sensitive to change over time and be capable of reliably and validly capturing change. Horner et al. (2005) discusses the important features of outcome measures selected for use in these types of designs. Kazdin (2010) suggests that measures be dimensional, which can more readily detect effects than categorical and binary measures. Although using an established measure or scale, such as the Outcome Questionnaire System (M. J. Lambert, Hansen, & Harmon, 2010), provides empirically validated items for assessing various outcomes, most measure validation studies conducted on this type of instrument involve between-subject designs, which is no guarantee that these measures are reliable and valid for assessing within-person variability. Borsboom, Mellenbergh, and van Heerden (2003) suggest that researchers adapting validated measures should consider whether the items they propose using have a factor structure within subjects similar to that obtained between subjects. This is one of the reasons that SCEDs often use observational assessments from multiple sources and report the interrater reliability of the measure. Self-report measures are acceptable practice in some circles, but generally additional assessment methods or informants are necessary to uphold the highest methodological standards. The results of this review indicate that the majority of studies include observational measurement (76.0%). Within those studies, nearly all (97.1%) reported interrater reliability procedures and results. The results within each design were similar, with the exception of time-series designs, which used observer ratings in only half of the reviewed studies.
Time-series designs are defined by repeated measurement of variables of interest over a period of time (Box & Jenkins, 1970). Time-series measurement most often occurs in uniform intervals; however, this is no longer a constraint of time-series designs (see Harvey, 2001). Although uniform interval reporting is not necessary in SCED research, repeated measures often occur at uniform intervals, such as once each day or each week, which constitutes a time-series design. The time-series design has been used in various basic science applications (Scollon, Kim-Pietro, & Diener, 2003) across nearly all subspecialties in psychology (e.g., Bolger et al., 2003; Piasecki et al., 2007; for a review, see Reis & Gable, 2000; Soliday et al., 2002). The basic time-series formula for a two-phase (AB) data stream is presented in Equation 1. In this formula α represents the step function of the data stream; S represents the change between the first and second phases, which is also the intercept in a two-phase data stream and a step function being 0 at times i = 1, 2, 3…n1 and 1 at times i = n1+1, n1+2, n1+3…n; n1 is the number of observations in the baseline phase; n is the total number of data points in the data stream; i represents time; and εi = ρεi−1 + ei, which indicates the relationship between the autoregressive function (ρ) and the distribution of the data in the stream.
Time-series formulas become increasingly complex when seasonality and autoregressive processes are modeled in the analytic procedures, but these are rarely of concern for short time-series data streams in SCEDs. For a detailed description of other time-series design and analysis issues, see Borckardt et al. (2008), Box and Jenkins (1970), Crosbie (1993), R. R. Jones et al. (1977), and Velicer and Fava (2003).
Time-series and other repeated-measures methodologies also enable examination of temporal effects. Borckardt et al. (2008) and others have noted that time-series designs have the potential to reveal how change occurs, not simply if it occurs. This distinction is what most interested Skinner (1938), but it often falls below the purview of today’s researchers in favor of group designs, which Skinner felt obscured the process of change. In intervention and psychopathology research, time-series designs can assess mediators of change (Doss & Atkins, 2006), treatment processes (Stout, 2007; Tschacher & Ramseyer, 2009), and the relationship between psychological symptoms (e.g., Alloy, Just, & Panzarella, 1997; Hanson & Chen, 2010; Oslin, Cary, Slaymaker, Colleran, & Blow, 2009), and might be capable of revealing mechanisms of change (Kazdin, 2007, 2009, 2010). Between- and within-subject SCED designs with repeated measurements enable researchers to examine similarities and differences in the course of change, both during and as a result of manipulating an IV. Temporal effects have been largely overlooked in many areas of psychological science (Bolger et al., 2003): Examining temporal relationships is sorely needed to further our understanding of the etiology and amplification of numerous psychological phenomena.
Time-series studies were very infrequently found in this literature search (2%). Time-series studies traditionally occur in subfields of psychology in which single-case research is not often used (e.g., personality, physiological/biological). Recent advances in methods for collecting and analyzing time-series data (e.g., Borckardt et al., 2008) could expand the use of time-series methodology in the SCED community. One problem with drawing firm conclusions from this particular review finding is a semantic factor: Time-series is a specific term reserved for measurement occurring at a uniform interval. However, SCED research appears to not yet have adopted this language when referring to data collected in this fashion. When time-series data analytic methods are not used, the matter of measurement interval is of less importance and might not need to be specified or described as a time-series. An interesting extension of this work would be to examine SCED research that used time-series measurement strategies but did not label it as such. This is important because then it could be determined how many SCEDs could be analyzed with time-series statistical methods.
Daily diary and ecological momentary assessment methods
EMA and daily diary approaches represent methodological procedures for collecting repeated measurements in time-series and non-time-series experiments, which are also known as experience sampling. Presenting an in-depth discussion of the nuances of these sampling techniques is well beyond the scope of this paper. The reader is referred to the following review articles: daily diary (Bolger et al., 2003; Reis & Gable, 2000; Thiele, Laireiter, & Baumann, 2002), and EMA (Shiffman et al., 2008). Experience sampling in psychology has burgeoned in the past two decades as technological advances have permitted more precise and immediate reporting by participants (e.g., Internet-based, two-way pagers, cellular telephones, handheld computers) than do paper and pencil methods (for reviews see Barrett & Barrett, 2001; Shiffman & Stone, 1998). Both methods have practical limitations and advantages. For example, electronic methods are more costly and may exclude certain subjects from participating in the study, either because they do not have access to the necessary technology or they do not have the familiarity or savvy to successfully complete reporting. Electronic data collection methods enable the researcher to prompt responses at random or predetermined intervals and also accurately assess compliance. Paper and pencil methods have been criticized for their inability to reliably track respondents’ compliance: Palermo, Valenzuela, and Stork (2004) found better compliance with electronic diaries than with paper and pencil. On the other hand, Green, Rafaeli, Bolger, Shrout, & Reis (2006) demonstrated the psychometric data structure equivalence between these two methods, suggesting that the data collected in either method will yield similar statistical results given comparable compliance rates.
Daily diary/daily self-report and EMA measurement were somewhat rarely represented in this review, occurring in only 6.1% of the total studies. EMA methods had been used in only one of the reviewed studies. The recent proliferation of EMA and daily diary studies in psychology reported by others (Bolger et al., 2003; Piasecki et al., 2007; Shiffman et al., 2008) suggests that these methods have not yet reached SCED researchers, which could in part have resulted from the long-held supremacy of observational measurement in fields that commonly practice single-case research.
As was previously mentioned, measurement in SCEDs requires the reliable assessment of change over time. As illustrated in Table 4, DIV16 and the NRP explicitly require that reliability of all measures be reported. DIV12 provides little direction in the selection of the measurement instrument, except to require that three or more clinically important behaviors with relative independence be assessed. Similarly, the only item concerned with measurement on the Tate et al. scale specifies assessing behaviors consistent with the target of the intervention. The WWC and the Tate et al. scale require at least two independent assessors of the DV and that interrater reliability meeting minimum established thresholds be reported. Furthermore, WWC requires that interrater reliability be assessed on at least 20% of the data in each phase and in each condition. DIV16 expects that assessment of the outcome measures will be multisource and multimethod, when applicable. The interval of measurement is not specified by any of the reviewed sources. The WWC and the Tate et al. scale require that DVs be measured repeatedly across phases (e.g., baseline and treatment), which is a typical requirement of a SCED. The NRP asks that the time points at which DV measurement occurred be reported.
The baseline measurement represents one of the most crucial design elements of the SCED. Because subjects provide their own data for comparison, gathering a representative, stable sampling of behavior before manipulating the IV is essential to accurately inferring an effect. Some researchers have reported the typical length of the baseline period to range from 3 to 12 observations in intervention research applications (e.g., Center et al., 1986; Huitema, 1985; R. R. Jones et al., 1977; Sharpley, 1987); Huitema’s (1985) review of 881 experiments published in the Journal of Applied Behavior Analysis resulted in a modal number of three to four baseline points. Center et al. (1986) suggested five as the minimum number of baseline measurements needed to accurately estimate autocorrelation. Longer baseline periods suggest a greater likelihood of a representative measurement of the DVs, which has been found to increase the validity of the effects and reduce bias resulting from autocorrelation (Huitema & McKean, 1994). The results of this review are largely consistent with those of previous researchers: The mean number of baseline observations was found to be 10.22 (SD = 9.59), and 6 was the modal number of observations. Baseline data were available in 77.8% of the reviewed studies. Although the baseline assessment has tremendous bearing on the results of a SCED study, it was often difficult to locate the exact number of data points. Similarly, the number of data points assessed across all phases of the study were not easily identified.
The WWC, DIV12, and DIV16 agree that a minimum of three data points during the baseline is necessary. However, to receive the highest rating by the WWC, five data points are necessary in each phase, including the baseline and any subsequent withdrawal baselines as would occur in a reversal design. DIV16 explicitly states that more than three points are preferred and further stipulates that the baseline must demonstrate stability (i.e., limited variability), absence of overlap between the baseline and other phases, absence of a trend, and that the level of the baseline measurement is severe enough to warrant intervention; each of these aspects of the data is important in inferential accuracy. Detrending techniques can be used to address baseline data trend. The integration option in ARIMA-based modeling and the empirical mode decomposition method (Wu, Huang, Long, & Peng, 2007) are two sophisticated detrending techniques. In regression-based analytic methods, detrending can be accomplished by simply regressing each variable in the model on time (i.e., the residuals become the detrended series), which is analogous to adding a linear, exponential, or quadratic term to the regression equation.
NRP does not provide a minimum for data points, nor does the Tate et al. scale, which requires only a sufficient sampling of baseline behavior. Although the mean and modal number of baseline observations is well within these parameters, seven (1.7%) studies reported mean baselines of less than three data points.
Establishing a uniform minimum number of required baseline observations would provide researchers and reviewers with only a starting guide. The baseline phase is important in SCED research because it establishes a trend that can then be compared with that of subsequent phases. Although a minimum number of observations might be required to meet standards, many more might be necessary to establish a trend when there is variability and trends in the direction of the expected effect. The selected data analytic approach also has some bearing on the number of necessary baseline observations. This is discussed further in the Analysis section.
Reporting of repeated measurements
Stone and Shiffman (2002) provide a comprehensive set of guidelines for the reporting of EMA data, which can also be applied to other repeated-measurement strategies. Because the application of EMA is widespread and not confined to specific research designs, Stone and Shiffman intentionally place few restraints on researchers regarding selection of the DV and the reporter, which is determined by the research question under investigation. The methods of measurement, however, are specified in detail: Descriptions of prompting, recording of responses, participant-initiated entries, and the data acquisition interface (e.g., paper and pencil diary, PDA, cellular telephone) ought to be provided with sufficient detail for replication. Because EMA specifically, and time-series/daily diary methods similarly, are primarily concerned with the interval of assessment, Stone and Shiffman suggest reporting the density and schedule of assessment. The approach is generally determined by the nature of the research question and pragmatic considerations, such as access to electronic data collection devices at certain times of the day and participant burden. Compliance and missing data concerns are present in any longitudinal research design, but they are of particular importance in repeated-measurement applications with frequent measurement. When the research question pertains to temporal effects, compliance becomes paramount, and timely, immediate responding is necessary. For this reason, compliance decisions, rates of missing data, and missing data management techniques must be reported. The effect of missing data in time-series data streams has been the topic of recent research in the social sciences (e.g., Smith, Borckardt, & Nash, in press; Velicer & Colby, 2005a, 2005b). The results and implications of these and other missing data studies are discussed in the next section.
Analysis of SCED Data
Experts in the field generally agree about the majority of critical single-case experiment design and measurement characteristics. Analysis, on the other hand, is an area of significant disagreement, yet it has also received extensive recent attention and advancement. Debate regarding the appropriateness and accuracy of various methods for analyzing SCED data, the interpretation of single-case effect sizes, and other concerns vital to the validity of SCED results has been ongoing for decades, and no clear consensus has been reached. Visual analysis, following systematic procedures such as those provided by Franklin, Gorman, Beasley, and Allison (1997) and Parsonson and Baer (1978), remains the standard by which SCED data are most commonly analyzed (Parker, Cryer, & Byrns, 2006). Visual analysis can arguably be applied to all SCEDs. However, a number of baseline data characteristics must be met for effects obtained through visual analysis to be valid and reliable. The baseline phase must be relatively stable; free of significant trend, particularly in the hypothesized direction of the effect; have minimal overlap of data with subsequent phases; and have a sufficient sampling of behavior to be considered representative (Franklin, Gorman, et al., 1997; Parsonson & Baer, 1978). The effect of baseline trend on visual analysis, and a technique to control baseline trend, are offered by Parker et al. (2006). Kazdin (2010) suggests using statistical analysis when a trend or significant variability appears in the baseline phase, two conditions that ought to preclude the use of visual analysis techniques. Visual analysis methods are especially adept at determining intervention effects and can be of particular relevance in real-world applications (e.g., Borckardt et al., 2008; Kratochwill, Levin, Horner, & Swoboda, 2011).
However, visual analysis has its detractors. It has been shown to be inconsistent, can be affected by autocorrelation, and results in overestimation of effect (e.g., Matyas & Greenwood, 1990). Visual analysis as a means of estimating an effect precludes the results of SCED research from being included in meta-analysis, and also makes it very difficult to compare results to the effect sizes generated by other statistical methods. Yet, visual analysis proliferates in large part because SCED researchers are familiar with these methods and are not only generally unfamiliar with statistical approaches, but lack agreement about their appropriateness. Still, top experts in single-case analysis champion the use of statistical methods alongside visual analysis whenever it is appropriate to do so (Kratochwill et al., 2011).
Statistical analysis of SCED data consists generally of an attempt to address one or more of three broad research questions: (1) Does introduction/manipulation of the IV result in statistically significant change in the level of the DV (level-change or phase-effect analysis)? (2) Does introduction/manipulation of the IV result in statistically significant change in the slope of the DV over time (slope-change analysis)? and (3) Do meaningful relationships exist between the trajectory of the DV and other potential covariates? Level- and slope-change analyses are relevant to intervention effectiveness studies and other research questions in which the IV is expected to result in changes in the DV in a particular direction. Visual analysis methods are most adept at addressing research questions pertaining to changes in level and slope (Questions 1 and 2), most often using some form of graphical representation and standardized computation of a mean level or trend line within and between each phase of interest (e.g., Horner & Spaulding, 2010; Kratochwill et al., 2011; Matyas & Greenwood, 1990). Research questions in other areas of psychological science might address the relationship between DVs or the slopes of DVs (Question 3). A number of sophisticated modeling approaches (e.g., cross-lag, multilevel, panel, growth mixture, latent class analysis) may be used for this type of question, and some are discussed in greater detail later in this section. However, a discussion about the nuances of this type of analysis and all their possible methods is well beyond the scope of this article.
The statistical analysis of SCEDs is a contentious issue in the field. Not only is there no agreed-upon statistical method, but the practice of statistical analysis in the context of the SCED is viewed by some as unnecessary (see Shadish, Rindskopf, & Hedges, 2008). Traditional trends in the prevalence of statistical analysis usage by SCED researchers are revealing: Busk & Marascuilo (1992) found that only 10% of the published single-case studies they reviewed used statistical analysis; Brossart, Parker, Olson, & Mahadevan (2006) estimated that this figure had roughly doubled by 2006. A range of concerns regarding single-case effect size calculation and interpretation is discussed in significant detail elsewhere (e.g., Campbell, 2004; Cohen, 1994; Ferron & Sentovich, 2002; Ferron & Ware, 1995; Kirk, 1996; Manolov & Solanas, 2008; Olive & Smith, 2005; Parker & Brossart, 2003; Robey et al., 1999; Smith et al., in press; Velicer & Fava, 2003). One concern is the lack of a clearly superior method across datasets. Although statistical methods for analyzing SCEDs abound, few studies have examined their comparative performance with the same dataset. The most recent studies of this kind, performed by Brossart et al. (2006), Campbell (2004), Parker and Brossart (2003), and Parker and Vannest (2009), found that the more promising available statistical analysis methods yielded moderately different results on the same data series, which led them to conclude that each available method is equipped to adequately address only a relatively narrow spectrum of data. Given these findings, analysts need to select an appropriate model for the research questions and data structure, being mindful of how modeling results can be influenced by extraneous factors.
The current standards unfortunately provide little guidance in the way of statistical analysis options. This article presents an admittedly cursory introduction to available statistical methods; many others are not covered in this review. The following articles provide more in-depth discussion and description of other methods: Barlow et al. (2008); Franklin et al., (1997); Kazdin (2010); and Kratochwill and Levin (1992, 2010). Shadish et al. (2008) summarize more recently developed methods. Similarly, a Special Issue of Evidence-Based Communication Assessment and Intervention (2008, Volume 2) provides articles and discussion of the more promising statistical methods for SCED analysis. An introduction to autocorrelation and its implications for statistical analysis is necessary before specific analytic methods can be discussed. It is also pertinent at this time to discuss the implications of missing data.
Many repeated measurements within a single subject or unit create a situation that most psychological researchers are unaccustomed to dealing with: autocorrelated data, which is the nonindependence of sequential observations, also known as serial dependence. Basic and advanced discussions of autocorrelation in single-subject data can be found in Borckardt et al. (2008), Huitema (1985), and Marshall (1980), and discussions of autocorrelation in multilevel models can be found in Snijders and Bosker (1999) and Diggle and Liang (2001). Along with trend and seasonal variation, autocorrelation is one example of the internal structure of repeated measurements. In the social sciences, autocorrelated data occur most naturally in the fields of physiological psychology, econometrics, and finance, where each phase of interest has potentially hundreds or even thousands of observations that are tightly packed across time (e.g., electroencephalography actuarial data, financial market indices). Applied SCED research in most areas of psychology is more likely to have measurement intervals of day, week, or hour.
Autocorrelation is a direct result of the repeated-measurement requirements of the SCED, but its effect is most noticeable and problematic when one is attempting to analyze these data. Many commonly used data analytic approaches, such as analysis of variance, assume independence of observations and can produce spurious results when the data are nonindependent. Even statistically insignificant autocorrelation estimates are generally viewed as sufficient to cause inferential bias when conventional statistics are used (e.g., Busk & Marascuilo, 1988; R. R. Jones et al., 1977; Matyas & Greenwood, 1990). The effect of autocorrelation on statistical inference in single-case applications has also been known for quite some time (e.g., R. R. Jones et al., 1977; Kanfer, 1970; Kazdin, 1981; Marshall, 1980). The findings of recent simulation studies of single-subject data streams indicate that autocorrelation is a nontrivial matter. For example, Manolov and Solanas (2008) determined that calculated effect sizes were linearly related to the autocorrelation of the data stream, and Smith et al. (in press) demonstrated that autocorrelation estimates in the vicinity of 0.80 negatively affect the ability to correctly infer a significant level-change effect using a standardized mean differences method. Huitema and colleagues (e.g., Huitema, 1985; Huitema & McKean, 1994) argued that autocorrelation is rarely a concern in applied research. Huitema’s methods and conclusions have been questioned and opposing data have been published (e.g., Allison & Gorman, 1993; Matyas & Greenwood, 1990; Robey et al., 1999), resulting in abandonment of the position that autocorrelation can be conscionably ignored without compromising the validity of the statistical procedures. Procedures for removing autocorrelation in the data stream prior to calculating effect sizes are offered as one option: One of the more promising analysis methods, autoregressive integrated moving averages (discussed later in this article), was specifically designed to remove the internal structure of time-series data, such as autocorrelation, trend, and seasonality (Box & Jenkins, 1970; Tiao & Box, 1981).
Another concern inherent in repeated-measures designs is missing data. Daily diary and EMA methods are intended to reduce the risk of retrospection error by eliciting accurate, real-time information (Bolger et al., 2003). However, these methods are subject to missing data as a result of honest forgetfulness, not possessing the diary collection tool at the specified time of collection, and intentional or systematic noncompliance. With paper and pencil diaries and some electronic methods, subjects might be able to complete missed entries retrospectively, defeating the temporal benefits of these assessment strategies (Bolger et al., 2003). Methods of managing noncompliance through the study design and measurement methods include training the subject to use the data collection device appropriately, using technology to prompt responding and track the time of response, and providing incentives to participants for timely compliance (for additional discussion of this topic, see Bolger et al., 2003; Shiffman & Stone, 1998).
Even when efforts are made to maximize compliance during the conduct of the research, the problem of missing data is often unavoidable. Numerous approaches exist for handling missing observations in group multivariate designs (e.g., Horton & Kleinman, 2007; Ibrahim, Chen, Lipsitz, & Herring, 2005). Ragunathan (2004) and others concluded that full information and raw data maximum likelihood methods are preferable. Velicer and Colby (2005a, 2005b) established the superiority of maximum likelihood methods over listwise deletion, mean of adjacent observations, and series mean substitution in the estimation of various critical time-series data parameters. Smith et al. (in press) extended these findings regarding the effect of missing data on inferential precision. They found that managing missing data with the EM procedure (Dempster, Laird, & Rubin, 1977), a maximum likelihood algorithm, did not affect one’s ability to correctly infer a significant effect. However, lag-1 autocorrelation estimates in the vicinity of 0.80 resulted in insufficient power sensitivity (< 0.80), regardless of the proportion of missing data (10%, 20%, 30%, or 40%).1 Although maximum likelihood methods have garnered some empirical support, methodological strategies that minimize missing data, particularly systematically missing data, are paramount to post-hoc statistical remedies.
Nonnormal distribution of data
In addition to the autocorrelated nature of SCED data, typical measurement methods also present analytic challenges. Many statistical methods, particularly those involving model finding, assume that the data are normally distributed. This is often not satisfied in SCED research when measurements involve count data, observer-rated behaviors, and other, similar metrics that result in skewed distributions. Techniques are available to manage nonnormal distributions in regression-based analysis, such as zero-inflated Poisson regression (D. Lambert, 1992) and negative binomial regression (Gardner, Mulvey, & Shaw, 1995), but many other statistical analysis methods do not include these sophisticated techniques. A skewed data distribution is perhaps one of the reasons Kazdin (2010) suggests not using count, categorical, or ordinal measurement methods.
Available statistical analysis methods
Following is a basic introduction to the more promising and prevalent analytic methods for SCED research. Because there is little consensus regarding the superiority of any single method, the burden unfortunately falls on the researcher to select a method capable of addressing the research question and handling the data involved in the study. Some indications and contraindications are provided for each method presented here.
Multilevel and structural equation modeling
Multilevel modeling (MLM; e.g., Schmidt, Perels, & Schmitz, 2010) techniques represent the state of the art among parametric approaches to SCED analysis, particularly when synthesizing SCED results (Shadish et al., 2008). MLM and related latent growth curve and factor mixture methods in structural equation modeling (SEM; e.g., Lubke & Muthén, 2005; B. O. Muthén & Curran, 1997) are particularly effective for evaluating trajectories and slopes in longitudinal data and relating changes to potential covariates. MLM and related hierarchical linear models (HLM) can also illuminate the relationship between the trajectories of different variables under investigation and clarify whether or not these relationships differ amongst the subjects in the study. Time-series and cross-lag analyses can also be used in MLM and SEM (Chow, Ho, Hamaker, & Dolan, 2010; du Toit & Browne, 2007). However, they generally require sophisticated model-fitting techniques, making them difficult for many social scientists to implement. The structure (autocorrelation) and trend of the data can also complicate many MLM methods. The common, short data streams in SCED research and the small number of subjects also present problems to MLM and SEM approaches, which were developed for data with significantly greater numbers of observations when the number of subjects is fewer, and for a greater number of participants for model-fitting purposes, particularly when there are fewer data points. Still, MLM and related techniques arguably represent the most promising analytic methods.
A number of software options2 exist for SEM. Popular statistical packages in the social sciences provide SEM options, such as PROC CALIS in SAS (SAS Institute Inc., 2008), the AMOS module (Arbuckle, 2006) of SPSS (SPSS Statistics, 2011), and the sempackage for R (R Development Core Team, 2005), the use of which is described by Fox (Fox, 2006). A number of stand-alone software options are also available for SEM applications, including Mplus (
In this blog I will be describing case studies and single case designs, and discussing the differences between them.
A single case design is basically one where the subject serves as their own control group. They’re used more often to see how effective interventions are, as we’ve seen in our behavioural lectures. In a single-subject design, there are three phases. The first is the baseline phase. This phase is where the subject is measured on their behaviour with no interventions or behaviour changes taken. The next stage is the intervention stage. This is where the subject is measured on their behaviour when the independent variable has been introduced. The final stage is the reversal stage. This is where the independent variable is removed again. The researcher should only move onto the next stage when the data is stable, so stronger results are found. Single case designs are often used because they’re far more sensitive to individual differences than cases with more than one participants.
A case study is an analysis of a single unit. It is not necessarily on just an individual person, it can also on a specific group of people or on an event. In a case study, researchers collect detailed information about the subject. An advantage of case studies is that they give very detailed information, which is sometimes on very unique and interesting cases. An example of this kind of case study would be David Reimer, who was born as a male but was raised female due to an accident in a routine circumcision. A disadvantage of case studies is that they are very subjective, both because they are only on an individual unit and also because of the researchers. Another problem is that it is hard to generalise from case studies, as they are only one person and so you can’t apply the results to the general population.
There is a big difference between case studies and single case designs, despite them superficially sounding similar. In a single case design, the researcher has much more control over the participant. They introduce an independent variable and measure the behaviour, so this allows cause and effect to be more readily established. Case studies, on the other hand, the researcher simply observes the behaviour. This means that they are not very effective at testing theories. One effective way of utilizing both designs is using a case study to initially research an unknown area, and then using single case studies to test the hypothesis more effectively.
So overall, case studies and single case designs are two different kinds of experiment design that use individuals as subjects. They both have their advantages and disadvantages, and can be effectively used in different cases.
Posted by emilyjchurchill on March 25, 2012