
Toby Simonds
Research Scientist and Reasoning Team Lead at Tufa Labs. Specializing in LLM post-training, reasoning, and reinforcement learning applications for large language models.
Before that, studied Mathematical Physics at Melbourne University before realizing that AI would now be the fastest way to advance physics knowledge.
While at university was CTO of Inspire Robotics, where I developed robotics kits to inspire students to pursue STEM careers.
LinkedIn / Papers / Twitter / Email / GitHub / CV
tamassimonds@gmail.com
Selected Research

The Hidden Cost of Winning: Moral Alignment Degradation in RL-Trained AI
Show that Models trained on poker become significantly more likely to choose immoral actions to win when given the option, with willingness to cheat increasing from near-zero to over 5% after just 96 training steps

RLSR: Reinforcement Learning from Self Reward
Demonstrated that LLMs can serve as self-evaluating judges, enabling them to bootstrap their own performance. This approach allowed a 7B parameter model to qualify for the MIT integration Bee without requiring external tools or data

AlphaWrite: Inference time compute Scaling for Writing
Introduce AlphaWrite, an inference-time scaling method for creative writing that produces SoTA performance. Uses evolutionary generation and Elo-based ranking to improve story quality.

LLMs for Engineering: Teaching Models to Design High Powered Rockets
Developed a rocket simulation RL environment and demonstrated that reinforcement learning applied to LLMs can generate superhuman engineering designs. Proposed RocketBench, a benchmark for evaluating models' ability to iteratively improve their designs. Showed that current LLMs struggle with iterative design refinement

LADDER+TTRL
Introduced TTRL (Test Time Reinforcement Learning) and LADDER methods to improve 7B model performance beyond o1 on the MIT Integration Bee. LADDER uses LLMs to generate progressively harder problem variants, enabling hierarchical RL. TTRL introduces a new paradigm for scaling inference by parallelizing model training for specific problems. First model to achieve superhuman performance on the MIT Integration Bee

REL: Working out is all you need
First open-source, non-distillation paper to replicate o1-style reasoning behavior. Used supervised fine-tuning on human reasoning traces combined with REL synthetic data generation, paired with ReST-style reinforcement learning to achieve o1 like behaviour and reach 28% accuracy on AIME 2024

MoDEM: Mixture of Domain Expert Models
Achieved state-of-the-art cost-to-performance ratio on MMLU by pairing a model router with specialized fine-tuned models. Demonstrated that we could significantly reduce inference costs by utilizing smaller but highly specialized models selected by a prompt-based routing system

EAD: Entropy Adpative Decoding
Demonstrated up to 40% inference cost savings by switching between small and large model based on logit entropy leading to significant cost savings. Core idea was a correlation between logit entropy and model uncertainty and hence we switch to larger model in areas where smaller model confused

Turning the web into RL
Introduced tooling to convert textbook data into reinforcement learning questions, providing a massive new source of RL training data and helping to address the challenge of question scarcity in RL training