In a Case when there is not any next operation of “a” you need to define a maximum waiting time to compute a reward for any last processed item in a model run or you must be absolutely sure, when you end a model run the current processed items are not taking part of a reward system. You must count items only if their rewards could be calculated completely.
Edit: if you don’t find any suitable approach you can set maximum waiting time (Edit II: waiting time + process time = reward) to model run time.This prevents any situation of any undefined reward and is also your lower boundary of your system. I would use this method to pre store an reward and I would update this value when the involved processor state changes in model.
Edit II: The above description normalizes your rewards against longest processing time. Then there might occur percentage values greater than 100%. If you set all pre stored rewards to a self defined lower boundary value and you update them later normalized against run time length, then you can allocate all rewards.
lowest boundary reward = process time (a) / model runtime length
you scale all experiments by a factor of run time length of current experiment divided by maximum length of all Experiments.