Trending
My news
Find Feeds
My feeds
Delivery settings
Plans
My earnings

Help
Blog

Follow us on X (twitter)
Follow us on Facebook

Log in

Find more feeds

LessWrong (30+ Karma)

Subscribe in seconds and receive LessWrong (30+ Karma)'s news feed updates in your inbox, on your phone or even read them from your own news page here on follow.it.

You can select the updates using tags or topics and you can add as many websites to your feed as you like.

And the service is entirely free!

Follow LessWrong (30+ Karma): TYPE III AUDIO

Is this your feed? Claim it!

Publisher: Unclaimed!

Message frequency: 6.1 / day

Message History

"Lab Leaks, Black Holes, and Eggs: Epistemic Case Study Competition” by Oliver Sourbut, Josh Jacobson, Future of Lif...2 hrs

FLF is running a competition to find the best workflows and methodologies for using AI to produce reliable, trustworthy knowledge bases, grounded in real-world cases. We’re open-minded on the types of submissions we receive and on how they address the problem. We’ve set aside approximately 200 thousand dollars for prizes. Winning submissions may receive a prize from 5 thousa...

Read full story

″(Mis)generalization of Helpful-Only Fine-tuning” by Omar Khursheed, Baram Sosis, Fabien Roger2 hrs

TLDR

We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and...

Read full story

"AI #171: False Flag” by Zvi2 hrs

This was the week of Claude Opus 4.8. I covered the model card, then model welfare concerns, and finally capabilities and reactions. It's a good model, sir, an incremental but real improvement over Opus 4.7, and it is now my clear daily driver. The Trump Executive Order returned from being seemingly dead, officially putting us in the prior restraint era of frontier model rel...

Read full story

"Rohin Shah on AGI Safety” by anaguma2 hrs

Rohin Shah recently had an interview on 80000 hours on his views on AGI Safety and his work at Google DeepMind. I'm posting the transcript below to encourage further discussion. I think the discussion is interesting though I disagree on a bunch of topics, especially on alignment difficulty and CoT monitoring.

Transcript

Who's Rohi...

Read full story

"Building Better Activation Oracles” by ceselder, jan_bauer, Niclas Luick, Adam Karvonen, Neel Nanda2 hrs

Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen

Huggingface, Github

TL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick) and making a small change to the injection form...

Read full story

Login to follow.it

Keep me logged in

Or:

Don't have an account yet?