We present methodology motivated by a controlled trial designed to validate SPOT GRADE, a novel surgical bleeding severity score (Spotnitz et al, Spine, 2018). Briefly, the study was designed to quantify inter and intrasurgeon agreement for characterizing the severity of surgical bleeds via a Kappa statistic. Multiple surgeons were presented with a randomized sequence of controlled bleeding videos and asked to apply the rating system to characterize each wound. Each video was presented multiple times in a randomized fashion, resulting in clustered data. In this work we implement a multiple outputation procedure to account for within video clustering and embed the testing procedure in a group sequential framework to increase study efficiency. We establish independent increments for the proposed multiple outputationbased Kappa statistic, allowing for the application of standard group sequential stopping boundaries and monitoring procedures. Operating characteristics of the proposed method are assessed via simulation and applied to data from the SPOT GRADE trial. We illustrate potential sample size savings relative to a fixed sample design and consider tradeoffs with power.