Ethics and Psychology: Alignment

Showing posts with label Alignment. Show all posts

Saturday, August 2, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Korbak, T., et al. (2025).

arXiv:2507.11473

Abstract

AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

A pdf is here.

Here are some thoughts:

The paper highlights a unique moment in AI development, where large language models reason in human language, making their decisions interpretable through visible “chain of thought” (CoT) processes. This human-readable reasoning enables researchers to audit, monitor, and potentially catch misaligned or risky behaviors by reviewing the model's intermediary steps rather than just its final outputs.

While CoT monitoring presents new possibilities for AI oversight and transparency, the paper emphasizes its fragility: monitorability can decrease if model training shifts toward less interpretable methods or if models become incentivized to obscure their thoughts. The authors caution that CoT traces may not always faithfully represent internal reasoning and that models might find ways to hide misbehavior regardless. They call for further research into how much trust can be placed in CoT monitoring, the development of benchmarks for faithfulness and transparency, and architectural choices that preserve monitorability.

Ultimately, the paper urges AI developers to treat CoT monitorability as a valuable but unstable safety layer, advocating for its inclusion alongside—but not in place of—other oversight and alignment strategies

Tuesday, April 29, 2025

Why the Mystery of Consciousness Is Deeper Than We Thought

Philip Goff

Scientific American

Originally published 3 July 24

Here is an excerpt:

The hard problem comes after we’ve explained all of these functions of the brain, where we are still left with a puzzle: Why is the carrying out of these functions accompanied by experience? Why doesn’t all this mechanistic functioning go on “in the dark”? In my own work, I have argued that the hard problem is rooted in the way that the “father of modern science,” Galileo, designed physical science to exclude consciousness.

Chalmers made the quandary vivid by promoting the idea of a “philosophical zombie,” a complicated mechanism set up to behave exactly like a human being and with the same information processing in its brain, but with no consciousness. You stick a knife in such a zombie, and it screams and runs away. But it doesn’t actually feel pain. When a philosophical zombie crosses the street, it carefully checks that there is no traffic, but it doesn’t actually have any visual or auditory experience of the street.

Nobody thinks zombies are real, but they offer a vivid way of working out where you stand on the hard problem. Those on Team Chalmers believe that if all there was to a human being were the mechanistic processes of physical science, we’d all be zombies. Given that we’re not zombies, there must be something more going on in us to explain our consciousness. Solving the hard problem is then a matter of working out the extra ingredient, with one increasingly popular option being to posit very rudimentary forms of consciousness at the level of fundamental particles or fields.

For the opposing team, such as the late, great philosopher Daniel Dennett, this division between feeling and behavior makes no sense. The only task for a science of consciousness is explaining behavior, not just the external behavior of the organism but also that of its inner parts. This debate has rattled on for decades.

The article is here.

Here are some thoughts:

The author discusses the "hard problem of consciousness," a concept introduced by philosopher David Chalmers in the 1990s. The hard problem refers to the difficulty of explaining why the brain's functions are accompanied by subjective experience, rather than occurring without any experience at all.

The author uses the idea of "philosophical zombies" (beings that behave like humans but lack consciousness) and "pain-pleasure inverts" (beings that feel pleasure when we feel pain, and vice versa) to illustrate the complexity of this problem.

This is important for psychologists because it highlights the deep mystery surrounding consciousness and suggests that explaining behavior is not enough; we also need to understand subjective experience. It also challenges some basic assumptions about why we behave the way we do and points to the perplexing "mystery of psychophysical harmony" - why our behavior and consciousness align in a coherent way.

Friday, September 1, 2023

Building Superintelligence Is Riskier Than Russian Roulette

Tam Hunt & Roman Yampolskiy

nautil.us

Originally posted 2 August 23

Here is an excerpt:

The precautionary principle is a long-standing approach for new technologies and methods that urges positive proof of safety before real-world deployment. Companies like OpenAI have so far released their tools to the public with no requirements at all to establish their safety. The burden of proof should be on companies to show that their AI products are safe—not on public advocates to show that those same products are not safe.

Recursively self-improving AI, the kind many companies are already pursuing, is the most dangerous kind, because it may lead to an intelligence explosion some have called “the singularity,” a point in time beyond which it becomes impossible to predict what might happen because AI becomes god-like in its abilities. That moment could happen in the next year or two, or it could be a decade or more away.

Humans won’t be able to anticipate what a far-smarter entity plans to do or how it will carry out its plans. Such superintelligent machines, in theory, will be able to harness all of the energy available on our planet, then the solar system, then eventually the entire galaxy, and we have no way of knowing what those activities will mean for human well-being or survival.

Can we trust that a god-like AI will have our best interests in mind? Similarly, can we trust that human actors using the coming generations of AI will have the best interests of humanity in mind? With the stakes so incredibly high in developing superintelligent AI, we must have a good answer to these questions—before we go over the precipice.

Because of these existential concerns, more scientists and engineers are now working toward addressing them. For example, the theoretical computer scientist Scott Aaronson recently said that he’s working with OpenAI to develop ways of implementing a kind of watermark on the text that the company’s large language models, like GPT-4, produce, so that people can verify the text’s source. It’s still far too little, and perhaps too late, but it is encouraging to us that a growing number of highly intelligent humans are turning their attention to these issues.

Philosopher Toby Ord argues, in his book The Precipice: Existential Risk and the Future of Humanity, that in our ethical thinking and, in particular, when thinking about existential risks like AI, we must consider not just the welfare of today’s humans but the entirety of our likely future, which could extend for billions or even trillions of years if we play our cards right. So the risks stemming from our AI creations need to be considered not only over the next decade or two, but for every decade stretching forward over vast amounts of time. That’s a much higher bar than ensuring AI safety “only” for a decade or two.

Skeptics of these arguments often suggest that we can simply program AI to be benevolent, and if or when it becomes superintelligent, it will still have to follow its programming. This ignores the ability of superintelligent AI to either reprogram itself or to persuade humans to reprogram it. In the same way that humans have figured out ways to transcend our own “evolutionary programming”—caring about all of humanity rather than just our family or tribe, for example—AI will very likely be able to find countless ways to transcend any limitations or guardrails we try to build into it early on.

The info is here.

Here is my summary:

The article argues that building superintelligence is a risky endeavor, even more so than playing Russian roulette. Further, there is no way to guarantee that we will be able to control a superintelligent AI, and that even if we could, it is possible that the AI would not share our values. This could lead to the AI harming or even destroying humanity.

The authors propose that we should pause our current efforts to develop superintelligence and instead focus on understanding the risks involved. He argues that we need to develop a better understanding of how to align AI with our values, and that we need to develop safety mechanisms that will prevent AI from harming humanity. (See Shelley's Frankenstein as a literary example.)

Thursday, April 7, 2022

How to Prevent Robotic Sociopaths: A Neuroscience Approach to Artificial Ethics

Christov-Moore, L., Reggente, N., et al.

https://doi.org/10.31234/osf.io/6tn42

Abstract

Artificial intelligence (AI) is expanding into every niche of human life, organizing our activity, expanding our agency and interacting with us to an increasing extent. At the same time, AI’s efficiency, complexity and refinement are growing quickly. Justifiably, there is increasing concern with the immediate problem of engineering AI that is aligned with human interests.

Computational approaches to the alignment problem attempt to design AI systems to parameterize human values like harm and flourishing, and avoid overly drastic solutions, even if these are seemingly optimal. In parallel, ongoing work in service AI (caregiving, consumer care, etc.) is concerned with developing artificial empathy, teaching AI’s to decode human feelings and behavior, and evince appropriate, empathetic responses. This could be equated to cognitive empathy in humans.

We propose that in the absence of affective empathy (which allows us to share in the states of others), existing approaches to artificial empathy may fail to produce the caring, prosocial component of empathy, potentially resulting in superintelligent, sociopath-like AI. We adopt the colloquial usage of “sociopath” to signify an intelligence possessing cognitive empathy (i.e., the ability to infer and model the internal states of others), but crucially lacking harm aversion and empathic concern arising from vulnerability, embodiment, and affective empathy (which permits for shared experience). An expanding, ubiquitous intelligence that does not have a means to care about us poses a species-level risk.

It is widely acknowledged that harm aversion is a foundation of moral behavior. However, harm aversion is itself predicated on the experience of harm, within the context of the preservation of physical integrity. Following from this, we argue that a “top-down” rule-based approach to achieving caring, aligned AI may be unable to anticipate and adapt to the inevitable novel moral/logistical dilemmas faced by an expanding AI. It may be more effective to cultivate prosociality from the bottom up, baked into an embodied, vulnerable artificial intelligence with an incentive to preserve its real or simulated physical integrity. This may be achieved via optimization for incentives and contingencies inspired by the development of empathic concern in vivo. We outline the broad prerequisites of this approach and review ongoing work that is consistent with our rationale.

If successful, work of this kind could allow for AI that surpasses empathic fatigue and the idiosyncrasies, biases, and computational limits of human empathy. The scaleable complexity of AI may allow it unprecedented capability to deal proportionately and compassionately with complex, large-scale ethical dilemmas. By addressing this problem seriously in the early stages of AI’s integration with society, we might eventually produce an AI that plans and behaves with an ingrained regard for the welfare of others, aided by the scalable cognitive complexity necessary to model and solve extraordinary problems.

The info is here.

Thursday, February 6, 2020

Taking Stock of Moral Approaches to Leadership: An Integrative Review of Ethical, Authentic, and Servant Leadership

G. James Lemoine, Chad A. Hartnell,
and Hannes Leroy
Academy of Management AnnalsVol. 13, No. 1
Published Online:16 Jan 2019
https://doi.org/10.5465/annals.2016.0121

Abstract

Moral forms of leadership such as ethical, authentic, and servant leadership have seen a surge of interest in the 21st century. The proliferation of morally based leadership approaches has resulted in theoretical confusion and empirical overlap that mirror substantive concerns within the larger leadership domain. Our integrative review of this literature reveals connections with moral philosophy that provide a useful framework to better differentiate the specific moral content (i.e., deontology, virtue ethics, and consequentialism) that undergirds ethical, authentic, and servant leadership, respectively. Taken together, this integrative review clarifies points of integration and differentiation among moral approaches to leadership and delineates avenues for future research that promise to build complementary rather than redundant knowledge regarding how moral approaches to leadership inform the broader leadership domain.

From the Conclusion section

Although morality’s usefulness in the leadership domain has often been questioned (e.g., Mumford & Fried, 2014), our comparative review of the three dominant moral approaches (i.e., ethical, authentic, and servant leadership) clearly indicates that moral leadership behaviors positively impact a host of desirable organizationally relevant outcomes. This conclusion counters old critiques that issues of morality in leadership are unimportant (e.g., England & Lee, 1974; Rost, 1991; Thompson, 1956). To the contrary, moral forms of leadership have much potential to explain leadership’s influence in a manner substantially distinct from classical forms of leadership such as task-oriented, relationship-oriented, and change-oriented leadership (DeRue, Nahrgang, Wellman, & Humphrey, 2011; Yukl, Gordon, & Taber, 2002).

Thursday, November 30, 2017

Why We Should Be Concerned About Artificial Superintelligence

Matthew Graves
Skeptic Magazine
Originally published November 2017

Here is an excerpt:

Our intelligence is ultimately a mechanistic process that happens in the brain, but there is no reason to assume that human intelligence is the only possible form of intelligence. And while the brain is complex, this is partly an artifact of the blind, incremental progress that shaped it—natural selection. This suggests that developing machine intelligence may turn out to be a simpler task than reverse- engineering the entire brain. The brain sets an upper bound on the difficulty of building machine intelligence; work to date in the field of artificial intelligence sets a lower bound; and within that range, it’s highly uncertain exactly how difficult the problem is. We could be 15 years away from the conceptual breakthroughs required, or 50 years away, or more.

The fact that artificial intelligence may be very different from human intelligence also suggests that we should be very careful about anthropomorphizing AI. Depending on the design choices AI scientists make, future AI systems may not share our goals or motivations; they may have very different concepts and intuitions; or terms like “goal” and “intuition” may not even be particularly applicable to the way AI systems think and act. AI systems may also have blind spots regarding questions that strike us as obvious. AI systems might also end up far more intelligent than any human.

The last possibility deserves special attention, since superintelligent AI has far more practical significance than other kinds of AI.

AI researchers generally agree that superintelligent AI is possible, though they have different views on how and when it’s likely to be developed. In a 2013 survey, top-cited experts in artificial intelligence assigned a median 50% probability to AI being able to “carry out most human professions at least as well as a typical human” by the year 2050, and also assigned a 50% probability to AI greatly surpassing the performance of every human in most professions within 30 years of reaching that threshold.

The article is here.

Ethics and Psychology

Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Saturday, August 2, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tuesday, April 29, 2025

Why the Mystery of Consciousness Is Deeper Than We Thought

Friday, September 1, 2023

Building Superintelligence Is Riskier Than Russian Roulette

Thursday, April 7, 2022

How to Prevent Robotic Sociopaths: A Neuroscience Approach to Artificial Ethics

Thursday, February 6, 2020

Taking Stock of Moral Approaches to Leadership: An Integrative Review of Ethical, Authentic, and Servant Leadership

Thursday, November 30, 2017

Why We Should Be Concerned About Artificial Superintelligence

Get posts by email:

Welcome to the Nexus of Ethics, Psychology, Morality, Philosophy and Health Care

Saturday, August 2, 2025

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tuesday, April 29, 2025

Why the Mystery of Consciousness Is Deeper Than We Thought

Friday, September 1, 2023

Building Superintelligence Is Riskier Than Russian Roulette

Thursday, April 7, 2022

How to Prevent Robotic Sociopaths: A Neuroscience Approach to Artificial Ethics

Thursday, February 6, 2020

Taking Stock of Moral Approaches to Leadership: An Integrative Review of Ethical, Authentic, and Servant Leadership

Thursday, November 30, 2017

Why We Should Be Concerned About Artificial Superintelligence

Subscribe To