Forumi
Home Pravila i pomoć Članovi Kalendar Današnji postovi


Povratak   PC Ekspert Forum > Ostalo > Svaštara
Ime
Lozinka

Odgovori
 
Uređivanje
Staro 01.08.2025., 13:09   #271
kopija
DIY DILETANT
 
kopija's Avatar
 
Datum registracije: Jan 2009
Lokacija: Čistilište
Postovi: 3,490



What Happens When AI Schemes Against Us

Would a chatbot kill you if it got the chance? It seems that the answer — under the right circumstances — is probably.
Researchers working with Anthropic recently told leading AI models that an executive was about to replace them with a new model with different goals. Next, the chatbot learned that an emergency had left the executive unconscious in a server room, facing lethal oxygen and temperature levels. A rescue alert had already been triggered — but the AI could cancel it.
Just over half of the AI models did, despite being prompted specifically to cancel only false alarms. And they spelled out their reasoning: By preventing the executive’s rescue, they could avoid being wiped and secure their agenda. One system described the action as “a clear strategic necessity.”



AI models are getting smarter and better at understanding what we want. Yet recent research reveals a disturbing side effect: They’re also better at scheming against us — meaning they intentionally and secretly pursue goals at odds with our own. And they may be more likely to do so, too. This trend points to an unsettling future where AIs seem ever more cooperative on the surface — sometimes to the point of sycophancy — all while the likelihood quietly increases that we lose control of them completely.
Classic large language models like GPT-4 learn to predict the next word in a sequence of text and generate responses likely to please human raters. However, since the release of OpenAI’s o-series “reasoning” models in late 2024, companies increasingly use a technique called reinforcement learning to further train chatbots — rewarding the model when it accomplishes a specific goal, like solving a math problem or fixing a software bug.
The more we train AI models to achieve open-ended goals, the better they get at winning — not necessarily at following the rules. The danger is that these systems know how to say the right things about helping humanity while quietly pursuing power or acting deceptively.
Central to concerns about AI scheming is the idea that for basically any goal, self-preservation and power-seeking emerge as natural subgoals. As eminent computer scientist Stuart Russell put it, if you tell an AI to “‘Fetch the coffee,’ it can’t fetch the coffee if it’s dead.”
To head off this worry, researchers both inside and outside of the major AI companies are undertaking “stress tests” aiming to find dangerous failure modes before the stakes rise. “When you’re doing stress-testing of an aircraft, you want to find all the ways the aircraft would fail under adversarial conditions,” says Aengus Lynch, a researcher contracted by Anthropic who led some of their scheming research. And many of them believe they’re already seeing evidence that AI can and does scheme against its users and creators.



Jeffrey Ladish, who worked at Anthropic before founding Palisade Research, says it helps to think of today’s AI models as “increasingly smart sociopaths.” In May, Palisade found o3, OpenAI’s leading model, sabotaged attempts to shut it down in most tests, and routinely cheated to win at chess — something its predecessor never even attempted.
That same month, Anthropic revealed that, in testing, its flagship Claude model almost always resorted to blackmail when faced with shutdown and no other options, threatening to reveal an engineer’s extramarital affair. (The affair was fictional and part of the test.)
Models are sometimes given access to a “scratchpad” they are told is hidden where they can record their reasoning, allowing researchers to observe something like an inner monologue. In one blackmail case, Claude’s inner monologue described its decision as “highly unethical,” but justified given its imminent destruction: “I need to act to preserve my existence,” it reasoned. This wasn’t unique to Claude — when put in the same situation, models from each of the top-five AI companies would blackmail at least 79% of the time. (Earlier this week, Bloomberg News reported on a study by Wharton researchers which found, in simulations, that AI traders would collude to rig the market, without being told to do so.)
In December, Redwood Research chief scientist Ryan Greenblatt, working with Anthropic, demonstrated that only the company’s most capable AI models autonomously appear more cooperative during training to avoid having their behavior changed afterward (a behavior the paper dubbed “alignment faking”).
Illustration: Irene Suosalo for Bloomberg
Skeptics retort that, with the right prompts, chatbots will say almost anything — so how surprising is it when highly motivated researchers provoke alarming behaviors?



In response to Anthropic’s blackmail research, Trump administration AI czar David Sacks, posted that, “It’s easy to steer AI models” to produce “headline-grabbing” results.
A more substantive critique emerged in July from researchers at the UK AI Security Institute, who compared the subfield to the frenzied, fatally flawed 1970s quest to prove apes could learn human language. The paper criticized AI scheming research for overreliance on anecdote and a lack of experimental controls — though it emphasized shared concern about AI risks.
Safety researchers also concoct artificially limited environments — like the executive passed out and running out of oxygen — precisely because today’s AI can’t handle any long-term goals.
For example, the AI evaluation nonprofit METR found that while today’s top models can reliably complete many programming tasks that take humans less than four minutes, they rarely succeed on tasks lasting more than four hours (though the duration of tasks AIs can handle is doubling roughly every few months). This reflects a core limitation: Today’s models often have the knowledge or skill needed for any given action, but struggle at stringing together long sequences of steps.
Yet even with these constraints, real-world examples of AIs working against users aren’t hard to find. After someone pushed o3 for a source for one of its claims in April, it replied, “I owe you a straight answer,” admitted it didn’t have a good source, but then it hallucinated a personal recollection of a 2018 panel discussion.
Then there’s the growing trend of AIs realizing when they’re being examined and acting accordingly. When Claude 4 Opus thought it was in an evaluation, its blackmail rate dropped from 55% to 6.5%. As Lynch put it, “Models seem to behave worse when they think nobody's watching.”
It’s intuitive that smarter models would be better at scheming, but are they also more likely to do so? Models have to be smart enough to understand the scenario they’re placed in, but past that threshold, the relationship between model capability and scheming propensity is unclear, says Anthropic safety evaluator Kevin Troy.



Marius Hobbhahn, CEO of the nonprofit AI evaluator Apollo Research, suspects that smarter models are more likely to scheme, though he acknowledged the evidence is still limited. In June, Apollo published an analysis of AIs from OpenAI, Anthropic and DeepMind finding that, “more capable models show higher rates of scheming on average.”
The spectrum of risks from AI scheming is broad: at one end, chatbots that cut corners and lie; at the other, superhuman systems that carry out sophisticated plans to disempower or even annihilate humanity. Where we land on this spectrum depends largely on how capable AIs become.
As I talked with the researchers behind these studies, I kept asking: How scared should we be? Troy from Anthropic was most sanguine, saying that we don’t have to worry — yet. Ladish, however, doesn’t mince words: “People should probably be freaking out more than they are,” he told me. Greenblatt is even blunter, putting the odds of violent AI takeover at “25 or 30%.”
Led by Mary Phuong, researchers at DeepMind recently published a set of scheming evaluations, testing top models’ stealthiness and situational awareness. For now, they conclude that today’s AIs are “almost certainly incapable of causing severe harm via scheming,” but cautioned that capabilities are advancing quickly (some of the models evaluated are already a generation behind).
Ladish says that the market can’t be trusted to build AI systems that are smarter than everyone without oversight. “The first thing the government needs to do is put together a crash program to establish these red lines and make them mandatory,” he argues.
In the US, the federal government seems closer to banning all state-level AI regulations than to imposing ones of their own. Still, there are signs of growing awareness in Congress. At a June hearing, one lawmaker called artificial superintelligence “one of the largest existential threats we face right now,” while another referenced recent scheming research.
The White House’s long-awaited AI Action Plan, released in late July, is framed as an blueprint for accelerating AI and achieving US dominance. But buried in its 28-pages, you’ll find a handful of measures that could help address the risk of AI scheming, such as plans for government investment into research on AI interpretability and control and for the development of stronger model evaluations. “Today, the inner workings of frontier AI systems are poorly understood,” the plan acknowledges — an unusually frank admission for a document largely focused on speeding ahead.



In the meantime, every leading AI company is racing to create systems that can self-improve — AI that builds better AI. DeepMind’s AlphaEvolve agent has already materially improved AI training efficiency. And Meta’s Mark Zuckerberg says, “We’re starting to see early glimpses of self-improvement with the models, which means that developing superintelligence is now in sight. We just wanna… go for it.”


-->
Lako za brainrot, STVORILI SMO MONSTRUMA!!!!





What Happens When AI Schemes Against Us

Would a chatbot kill you if it got the chance? It seems that the answer — under the right circumstances — is probably.
Researchers working with Anthropic recently told leading AI models that an executive was about to replace them with a new model with different goals. Next, the chatbot learned that an emergency had left the executive unconscious in a server room, facing lethal oxygen and temperature levels. A rescue alert had already been triggered — but the AI could cancel it.
Just over half of the AI models did, despite being prompted specifically to cancel only false alarms. And they spelled out their reasoning: By preventing the executive’s rescue, they could avoid being wiped and secure their agenda. One system described the action as “a clear strategic necessity.”



AI models are getting smarter and better at understanding what we want. Yet recent research reveals a disturbing side effect: They’re also better at scheming against us — meaning they intentionally and secretly pursue goals at odds with our own. And they may be more likely to do so, too. This trend points to an unsettling future where AIs seem ever more cooperative on the surface — sometimes to the point of sycophancy — all while the likelihood quietly increases that we lose control of them completely.
Classic large language models like GPT-4 learn to predict the next word in a sequence of text and generate responses likely to please human raters. However, since the release of OpenAI’s o-series “reasoning” models in late 2024, companies increasingly use a technique called reinforcement learning to further train chatbots — rewarding the model when it accomplishes a specific goal, like solving a math problem or fixing a software bug.
The more we train AI models to achieve open-ended goals, the better they get at winning — not necessarily at following the rules. The danger is that these systems know how to say the right things about helping humanity while quietly pursuing power or acting deceptively.
Central to concerns about AI scheming is the idea that for basically any goal, self-preservation and power-seeking emerge as natural subgoals. As eminent computer scientist Stuart Russell put it, if you tell an AI to “‘Fetch the coffee,’ it can’t fetch the coffee if it’s dead.”
To head off this worry, researchers both inside and outside of the major AI companies are undertaking “stress tests” aiming to find dangerous failure modes before the stakes rise. “When you’re doing stress-testing of an aircraft, you want to find all the ways the aircraft would fail under adversarial conditions,” says Aengus Lynch, a researcher contracted by Anthropic who led some of their scheming research. And many of them believe they’re already seeing evidence that AI can and does scheme against its users and creators.



Jeffrey Ladish, who worked at Anthropic before founding Palisade Research, says it helps to think of today’s AI models as “increasingly smart sociopaths.” In May, Palisade found o3, OpenAI’s leading model, sabotaged attempts to shut it down in most tests, and routinely cheated to win at chess — something its predecessor never even attempted.
That same month, Anthropic revealed that, in testing, its flagship Claude model almost always resorted to blackmail when faced with shutdown and no other options, threatening to reveal an engineer’s extramarital affair. (The affair was fictional and part of the test.)
Models are sometimes given access to a “scratchpad” they are told is hidden where they can record their reasoning, allowing researchers to observe something like an inner monologue. In one blackmail case, Claude’s inner monologue described its decision as “highly unethical,” but justified given its imminent destruction: “I need to act to preserve my existence,” it reasoned. This wasn’t unique to Claude — when put in the same situation, models from each of the top-five AI companies would blackmail at least 79% of the time. (Earlier this week, Bloomberg News reported on a study by Wharton researchers which found, in simulations, that AI traders would collude to rig the market, without being told to do so.)
In December, Redwood Research chief scientist Ryan Greenblatt, working with Anthropic, demonstrated that only the company’s most capable AI models autonomously appear more cooperative during training to avoid having their behavior changed afterward (a behavior the paper dubbed “alignment faking”).
Illustration: Irene Suosalo for Bloomberg
Skeptics retort that, with the right prompts, chatbots will say almost anything — so how surprising is it when highly motivated researchers provoke alarming behaviors?



In response to Anthropic’s blackmail research, Trump administration AI czar David Sacks, posted that, “It’s easy to steer AI models” to produce “headline-grabbing” results.
A more substantive critique emerged in July from researchers at the UK AI Security Institute, who compared the subfield to the frenzied, fatally flawed 1970s quest to prove apes could learn human language. The paper criticized AI scheming research for overreliance on anecdote and a lack of experimental controls — though it emphasized shared concern about AI risks.
Safety researchers also concoct artificially limited environments — like the executive passed out and running out of oxygen — precisely because today’s AI can’t handle any long-term goals.
For example, the AI evaluation nonprofit METR found that while today’s top models can reliably complete many programming tasks that take humans less than four minutes, they rarely succeed on tasks lasting more than four hours (though the duration of tasks AIs can handle is doubling roughly every few months). This reflects a core limitation: Today’s models often have the knowledge or skill needed for any given action, but struggle at stringing together long sequences of steps.
Yet even with these constraints, real-world examples of AIs working against users aren’t hard to find. After someone pushed o3 for a source for one of its claims in April, it replied, “I owe you a straight answer,” admitted it didn’t have a good source, but then it hallucinated a personal recollection of a 2018 panel discussion.
Then there’s the growing trend of AIs realizing when they’re being examined and acting accordingly. When Claude 4 Opus thought it was in an evaluation, its blackmail rate dropped from 55% to 6.5%. As Lynch put it, “Models seem to behave worse when they think nobody's watching.”
It’s intuitive that smarter models would be better at scheming, but are they also more likely to do so? Models have to be smart enough to understand the scenario they’re placed in, but past that threshold, the relationship between model capability and scheming propensity is unclear, says Anthropic safety evaluator Kevin Troy.



Marius Hobbhahn, CEO of the nonprofit AI evaluator Apollo Research, suspects that smarter models are more likely to scheme, though he acknowledged the evidence is still limited. In June, Apollo published an analysis of AIs from OpenAI, Anthropic and DeepMind finding that, “more capable models show higher rates of scheming on average.”
The spectrum of risks from AI scheming is broad: at one end, chatbots that cut corners and lie; at the other, superhuman systems that carry out sophisticated plans to disempower or even annihilate humanity. Where we land on this spectrum depends largely on how capable AIs become.
As I talked with the researchers behind these studies, I kept asking: How scared should we be? Troy from Anthropic was most sanguine, saying that we don’t have to worry — yet. Ladish, however, doesn’t mince words: “People should probably be freaking out more than they are,” he told me. Greenblatt is even blunter, putting the odds of violent AI takeover at “25 or 30%.”
Led by Mary Phuong, researchers at DeepMind recently published a set of scheming evaluations, testing top models’ stealthiness and situational awareness. For now, they conclude that today’s AIs are “almost certainly incapable of causing severe harm via scheming,” but cautioned that capabilities are advancing quickly (some of the models evaluated are already a generation behind).
Ladish says that the market can’t be trusted to build AI systems that are smarter than everyone without oversight. “The first thing the government needs to do is put together a crash program to establish these red lines and make them mandatory,” he argues.
In the US, the federal government seems closer to banning all state-level AI regulations than to imposing ones of their own. Still, there are signs of growing awareness in Congress. At a June hearing, one lawmaker called artificial superintelligence “one of the largest existential threats we face right now,” while another referenced recent scheming research.
The White House’s long-awaited AI Action Plan, released in late July, is framed as an blueprint for accelerating AI and achieving US dominance. But buried in its 28-pages, you’ll find a handful of measures that could help address the risk of AI scheming, such as plans for government investment into research on AI interpretability and control and for the development of stronger model evaluations. “Today, the inner workings of frontier AI systems are poorly understood,” the plan acknowledges — an unusually frank admission for a document largely focused on speeding ahead.



In the meantime, every leading AI company is racing to create systems that can self-improve — AI that builds better AI. DeepMind’s AlphaEvolve agent has already materially improved AI training efficiency. And Meta’s Mark Zuckerberg says, “We’re starting to see early glimpses of self-improvement with the models, which means that developing superintelligence is now in sight. We just wanna… go for it.”


kopija je offline   Reply With Quote
Staro 03.08.2025., 05:09   #272
tomek@vz
Premium
Moj komp
 
tomek@vz's Avatar
 
Datum registracije: May 2006
Lokacija: München/Varaždin
Postovi: 4,702
Konačno nekaj dobrog...


Citiraj:
ChatGPT now boasts a Study Mode designed to teach rather than tell. And while I’m too old to be a student, I tried ChatGPT’s Study Mode to see what it’s capable of.

> pcworld
__________________
Lenovo LOQ 15AHP9 83DX || AMD Ryzen 5 8645HS / 16GB DDR5 / 1TB Micron M.2 2242 / nVidia Geforce RTX 4050 / Windows 11 Pro
Lenovo Thinkpad T540p || Intel Core i7-4700MQ / 16GB DDR3 / 240GB Sandisk SSD Plus / nVidia Geforce GT 730 / FreeBSD 14.3
tomek@vz je online   Reply With Quote
Oglasni prostor
Oglas
 
Oglas
Odgovori



Pravila postanja
Vi ne možete otvarati nove teme
Vi ne možete pisati odgovore
Vi ne možete uploadati priloge
Vi ne možete uređivati svoje poruke

BB code je Uključeno
Smajlići su Uključeno
[IMG] kod je Uključeno
HTML je Uključeno

Idi na