We present methodology motivated by a controlled trial designed to validate SPOT GRADE, a novel surgical bleeding severity scale (Spotnitz et al., 2018). Briefly, the study was designed to quantify inter- and intra-surgeon agreement for characterizing the severity of surgical bleeds via a Kappa statistic. Multiple surgeons were presented with a randomized sequence of controlled bleeding videos and asked to apply the rating system to characterize each wound. Each video was shown multiple times to quantify intra-surgeon reliability, creating clustered data. In this work, we adapt the Kappa statistic for clustered data and investigate the performance of the Kappa statistic in group sequential testing to increase study efficiency. Operating characteristics of the Kappa statistic under several types group sequential stopping boundaries are assessed via simulation and applied to data from the SPOT GRADE trial. Finally, we will illustrate potential sample size savings relative to a fixed sample design and consider trade-offs with power.