overview for diz

‘Reasoning’ AI is LYING to you! — or maybe it’s just hallucinating again in c/[email protected]

[–] [email protected] 6 points 2 weeks ago* (last edited 2 weeks ago)

It re consumes its own bullshit, and the bullshit it does print is the bullshit it also fed itself, its not lying about that. Of course, it is also always re consuming the initial prompt too so the end bullshit isn’t necessarily quite as far removed from the question as the length would indicate.

Where it gets deceptive is when it knows an answer to the problem, but it constructs some bullshit for the purpose of making you believe that it solved the problem on its own. The only way to tell the difference is to ask it something simpler that it doesn’t know the answer to, and watch it bullshit in circles or to an incorrect answer.

Is Scott and others like him at fault for Trump... no it's the "elitist's" fault! in c/[email protected]

[–] [email protected] 4 points 2 weeks ago

I seriously doubt he ever worked anywhere like that, not to mention that he’s too spineless to actually get in trouble IRL.

Is Scott and others like him at fault for Trump... no it's the "elitist's" fault! in c/[email protected]

[–] [email protected] 8 points 3 weeks ago* (last edited 3 weeks ago) (4 children)

He’s such a complete moron. He doesn’t want to recite “DEI shibboleths”? What does he even think that would refer to? Why shibboleths?

To spell it out, that would refer to an antisemitic theory that the reason (for example) some black guy would get a medal of honor (the “deimedal”) is because of the jews.

I swear this guy is dumber than Trump. Trump for all his rambling, uses actual language - Trump understands what the shit he is saying means to his followers. Scott… he really does not.

Gemini seem to have "solved" my duck river crossing, lol. in c/[email protected]

[–] [email protected] 5 points 3 weeks ago

I think they worked specifically on cheating the benchmarks, though. As well as popular puzzles like pre existing variants of the river crossing - it is a very large puzzle category, very popular, if the river crossing puzzle is not on the list I don't know what would be.

Keep in mind that they are also true believers, too - they think that if they cram enough little pieces of logical reasoning, taken from puzzles, into the AI, then they will get robot god that will actually start coming up with new shit.

I very much doubt that there's some general reasoning performance improvement that results in these older puzzle variants getting solved, while new ones that aren't particularly more difficult, fail.

Gemini seem to have "solved" my duck river crossing, lol. in c/[email protected]

[–] [email protected] 7 points 3 weeks ago* (last edited 3 weeks ago) (1 children)

Did you use any of that kind of notation in the prompt? Or did some poor squadron of task workers write out a few thousand examples of this notation for river crossing problems in an attempt to give it an internal structure?

I didn't use any notation in the prompt, but gemini 2.5 pro seem to always represent state of the problem after every step in some way. When asked if it does anything with it says it is "very important", so it may be that there's some huge invisible prompt that says its very important to do this.

It also mentioned N cannibals and M missionaries.

My theory is that they wrote a bunch of little scripts that generate puzzles and solutions in that format. Since river crossing is one of the top most popular puzzles, it would be on the list (and N cannibals M missionaries is easy to generate variants of), although their main focus would have been the puzzles in the benchmarks that they are trying to cheat.

edit: here's one of the logs:

https://pastebin.com/GKy8BTYD

Basically it keeps on trying to brute force the problem. It gets first 2 moves correct, but in a stopped clock style manner - if there's 2 people and 1 boat they both take the boat, if there's 2 people and >=2 boats, then each of them takes a boat.

It keeps doing the same shit until eventually its state tracking fails, or its reading of the state fails, and then it outputs the failure as a solution. Sometimes it deems it impossible:

https://pastebin.com/Li9quqqd

All tests done with gemini 2.5 pro, I can post links if you need them but links don't include their "thinking" log and I also suspect that if >N people come through a link they just look at it. Nobody really shares botshit unless its funny or stupid. A lot of people independently asking the same problem, that would often happen if there's a new homework question so they can't use that as a signal so easily.

Gemini seem to have "solved" my duck river crossing, lol. in c/[email protected]

[–] [email protected] 11 points 3 weeks ago* (last edited 3 weeks ago)

Yeah I think the best examples are everyday problems that people solve all the time but don't explicitly write out solutions step by step for, or not in the puzzle-answer form.

It's not even a novel problem at all, I'm sure there's even a plenty of descriptions of solutions to it as part of stories and such. Just not as "logical puzzles" due to triviality.

What really annoys me is when they claim high performance on benchmarks consisting of fairly difficult problems. This is basically fraud, since they know full well it is still entirely "knowledge" reliant, and even take steps to augment it with generated problems and solutions.

I guess the big sell is that it could use bits and pieces of logic gleaned from other solutions to solve a "new" problem. Except it can not.

Gemini seem to have "solved" my duck river crossing, lol. in c/[email protected]

[–] [email protected] 8 points 3 weeks ago* (last edited 3 weeks ago)

And it is Google we're talking about, lol. If no one uses their AI shit they just replace something people use with it (also see search).

Gemini seem to have "solved" my duck river crossing, lol. in c/[email protected]

[–] [email protected] 7 points 3 weeks ago* (last edited 3 weeks ago)

It's google though, if nobody uses their shit they just put it inside their search.

It's only gonna go away when they run out of cash.

edit: whoops replied to the wrong comment

21

Gemini seem to have "solved" my duck river crossing, lol. (awful.systems)

submitted 3 weeks ago by [email protected] to c/[email protected]

19 comments fedilink

Tried my duck river crossing thing a few times recently, it usually solves it now, albeit with a bias to make unnecessary trips half of the time.

Of course, anything new fails:

There's 2 people and 1 boat on the left side of the river, and 3 boats on the right side of the river. Each boat can accommodate up to 6 people. How do they get all the boats to the left side of the river?

Did they seriously change something just to deal with my duck puzzle? How odd.

It's Google so it is not out of the question that they might do some analysis on the share links and referring pages, or even use their search engine to find discussions of a problem they're asked. I need to test that theory and simultaneously feed some garbage to their plagiarism machine...

Sample of the new botshit:

L->R: 2P take B_L. L{}, R{2P, 4B}. R->L: P1 takes B_R1. L{P1, B_R1}, R{P2, 3B}. R->L: P2 takes B_R2. L{2P, B_R1, B_R2}, R{2B}. L->R: P1 takes B_R1 back. L{P2, B_R2}, R{P1, 3B}. R->L: P1 takes B_R3. L{P1, P2, B_R2, B_R3}, R{2B}. L->R: P2 takes B_R2 back. L{P1, B_R3}, R{P2, 3B}.

And again and again, like a buggy attempt at brute forcing the problem.

How to explain our very good friends to normal humans? in c/[email protected]

[–] [email protected] 8 points 1 month ago* (last edited 1 month ago)

I just describe it as "computer scientology, nowhere near as successful as the original".

The other thing is that he's a Thiel project, different but not any more sane than Curtis Yarvin aka Moldbug. So if they heard of moldbug's political theories (which increasingly many people heard about because of, well, them being enacted) it's easy to give a general picture of total fucking insanity funded by thiel money. It doesn't really matter what the particular insanity is, and it matters even less now as the AGI shit hit mainstream entirely bypassing anything Yudkowsky had to say on the subject.

Gemini 2.5 "reasoning", no real improvement on river crossings. in c/[email protected]

[–] [email protected] 6 points 1 month ago* (last edited 1 month ago)

Yeah it really is fascinating. It follows some sort of recipe to try to solve the problem, like it's trained to work a bit like an automatic algebra system.

I think they had employed a lot of people to write generators of variants of select common logical puzzles, e.g. river crossings with varying boat capacities and constraints, generating both the puzzle and the corresponding step by step solution with "reasoning" and re-printing of the state of the items on every step and all that.

It seems to me that their thinking is that successive parroting can amount to reasoning, if its parroting well enough. I don't think it can. They have this one-path approach, where it just tries doing steps and representing state, just always trying the same thing.

What they need for this problem is to take a different kind of step, reduction (the duck can not be left unsupervised -> the duck must be taken with me on every trip -> rewrite a problem without the duck and with 1 less boat capacity -> solve -> rewrite the solution with "take the duck with you" on every trip).

But if they add this, then there's two possible paths it can take on every step, and this thing is far too slow to brute force the right one. They may get it to solve my duck variant, but at the expense of making it fail a lot of other variants.

The other problem is that even seemingly most elementary reasoning involves very many applications of basic axioms. This is what doomed symbol manipulation "AI" in the past and this is what is dooming it now.

Gemini 2.5 "reasoning", no real improvement on river crossings. in c/[email protected]

[–] [email protected] 7 points 1 month ago* (last edited 1 month ago)

Not really. Here's the chain-of-word-vomit that led to the answers:

https://pastebin.com/HQUExXkX

Note that in "its impossible" answer it correctly echoes that you can take one other item with you, and does not bring the duck back (while the old overfitted gpt4 obsessively brought items back), while in the duck + 3 vegetables variant, it has a correct answer in the wordvomit, but not being an AI enthusiast it can't actually choose the correct answer (a problem shared with the monkeys on typewriters).

I'd say it clearly isn't ignoring the prompt or differences from the original river crossings. It just can't actually reason, and the problem requires a modicum of reasoning, much as unloading groceries from a car does.

Gemini 2.5 "reasoning", no real improvement on river crossings. in c/[email protected]

[–] [email protected] 6 points 1 month ago* (last edited 1 month ago) (2 children)

It’s a failure mode that comes from pattern matching without actual reasoning.

Exactly. Also looking at its chain-of-wordvomit (which apparently I can't share other than by cut and pasting it somewhere), I don't think this is the same as GPT 4 overfitting to the original river crossing and always bringing items back needlessly.

Note also that in one example it discusses moving the duck and another item across the river (so "up to two other items" works); it is not ignoring the prompt, and it isn't even trying to bring anything back. And its answer (calling it impossible) has nothing to do with the original.

In the other one it does bring items back, it tries different orders, even finds an order that actually works (with two unnecessary moves), but because it isn't an AI fanboy reading tea leaves, it still gives out the wrong answer.

Here's the full logs:

https://pastebin.com/HQUExXkX

Content warning: AI wordvomit which is so bad it is folded hidden in a google tool.

39

Gemini 2.5 "reasoning", no real improvement on river crossings. (awful.systems)

submitted 1 month ago by [email protected] to c/[email protected]

33 comments fedilink

So I signed up for a free month of their crap because I wanted to test if it solves novel variants of the river crossing puzzle.

Like this one:

You have a duck, a carrot, and a potato. You want to transport them across the river using a boat that can take yourself and up to 2 other items. If the duck is left unsupervised, it will run away.

Unsurprisingly, it does not:

https://g.co/gemini/share/a79dc80c5c6c

https://g.co/gemini/share/59b024d0908b

The only 2 new things seem to be that old variants are no longer novel, and that it is no longer limited to producing incorrect solutions - now it can also incorrectly claim that the solution is impossible.

I think chain of thought / reasoning is a fundamentally dishonest technology. At the end of the day, just like older LLMs it requires that someone solved a similar problem (either online or perhaps in a problem solution pair they generated if they do that to augment the training data).

But it outputs quasi reasoning to pretend that it is actually solving the problem live.