Please turn JavaScript on

LessWrong (30+ Karma)

Subscribe in seconds and receive LessWrong (30+ Karma)'s news feed updates in your inbox, on your phone or even read them from your own news page here on follow.it.

You can select the updates using tags or topics and you can add as many websites to your feed as you like.

And the service is entirely free!

Follow LessWrong (30+ Karma): TYPE III AUDIO

Is this your feed? Claim it!

Publisher:  Unclaimed!
Message frequency:  6.1 / day

Message History

FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in real-world cases. We’re open-minded on the types of submissions we receive and on how they address the problem. We’ve set aside approximately 200 thousand dollars for prizes. Winning submissions may receive a prize from 5 thousa...


Read full story

TLDR

We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and...


Read full story

This was the week of Claude Opus 4.8. I covered the model card, then model welfare concerns, and finally capabilities and reactions. It's a good model, sir, an incremental but real improvement over Opus 4.7, and it is now my clear daily driver. The Trump Executive Order returned from being seemingly dead, officially putting us in the prior restraint era of frontier model rel...


Read full story

Rohin Shah recently had an interview on 80000 hours on his views on AGI Safety and his work at Google DeepMind. I'm posting the transcript below to encourage further discussion. I think the discussion is interesting though I disagree on a bunch of topics, especially on alignment difficulty and CoT monitoring.


Transcript

Who's Rohi...


Read full story

Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen

Huggingface, Github

TL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick) and making a small change to the injection form...


Read full story