Toby Simonds

Research Scientist and Reasoning Team Lead at Tufa Labs. Specializing in LLM post-training, reasoning, and reinforcement learning applications for large language models.

Before that, studied Mathematical Physics at Melbourne University before realizing that AI would now be the fastest way to advance physics knowledge.

While at university was CTO of Inspire Robotics, where I developed robotics kits to inspire students to pursue STEM careers. Successfully reached $300k ARR before exiting.

LinkedIn / Twitter / Email / GitHub / CV

tamassimonds@gmail.com

Research

LLMs for Engineering: Teaching Models to Design High Powered Rockets

Developed a rocket simulation RL environment and demonstrated that reinforcement learning applied to LLMs can generate superhuman engineering designs. Proposed RocketBench, a benchmark for evaluating models' ability to iteratively improve their designs. Showed that current LLMs struggle with iterative design refinement

Self Rewarding Self Improving

Demonstrated that LLMs can serve as self-evaluating judges, enabling them to bootstrap their own improvement on MIT integration tasks. This approach allowed a 7B parameter model to achieve superhuman performance without requiring external tools or data

AlphaWrite: Inference time compute Scaling for Writing

Introduce AlphaWrite, an inference-time scaling method for creative writing that produces SoTA performance. Uses evolutionary generation and Elo-based ranking to improve story quality.

LADDER+TTRL

Introduced TTRL (Test Time Reinforcement Learning) and LADDER methods to improve 7B model performance beyond o1 on the MIT Integration Bee. LADDER uses LLMs to generate progressively harder problem variants, enabling hierarchical RL. TTRL introduces a new paradigm for scaling inference by parallelizing model training for specific problems. First model to achieve superhuman performance on the MIT Integration Bee

REL: Working out is all you need

First open-source, non-distillation paper to replicate o1-style reasoning behavior. Used supervised fine-tuning on human reasoning traces combined with REL synthetic data generation, paired with ReST-style reinforcement learning to achieve o1 like behaviour and reach 28% accuracy on AIME 2024

Turning the web into RL

Introduced tooling to convert textbook data into reinforcement learning questions, providing a massive new source of RL training data and helping to address the challenge of question scarcity in RL training