20.02.06 · philosophy / ethics

Ethics of artificial intelligence

shipped3 tiersLean: none

Anchor (Master): primary sources: Russell 2019, Bostrom 2014, Floridi 2020, Christiano 2017

Intuition Beginner

Artificial intelligence is no longer a speculative technology. Machine-learning systems decide who gets a loan, which patients receive priority treatment, what news you see, and whether a military drone fires on a target. Each of these decisions has moral weight. The field that studies the moral weight of decisions made by, about, and through AI systems is AI ethics.

Three questions dominate the field. The first is the alignment problem: can we build AI systems whose goals are reliably aligned with human values, and what happens if we fail? The second is fairness and bias: machine-learning models are trained on historical data, and that data encodes the prejudices, structural inequalities, and blind spots of the societies that produced it. The third is the responsibility gap: when an AI system causes harm, who is morally and legally responsible — the programmer, the company, the user, the machine itself?

A fourth question, more speculative but harder to dismiss, is superintelligence risk: if an AI system were to exceed human cognitive abilities across the board, could we control it, and would it have reasons of its own that diverge from ours? Nick Bostrom's Superintelligence (2014) brought this question from the margins into mainstream philosophy and policy.

These questions are not independent. A biased hiring algorithm is both a fairness failure and a misalignment: the system is optimising for something other than what its designers intended. An autonomous weapon that selects its own targets raises both the responsibility gap and the question of whether delegating life-and-death decisions to machines is permissible at all. A superintelligent system that is poorly aligned with human flourishing is the alignment problem scaled to its limit.

Why care? Three reasons.

First, AI systems are already deployed at scale in morally loaded domains. Criminal sentencing algorithms, credit-scoring models, facial-recognition systems, and content-recommendation engines affect millions of people. The ethical analysis is not anticipatory — it is overdue.

Second, AI ethics forces a confrontation with questions that traditional ethics has been able to defer. Ethics has long assumed that moral agents are humans, that decisions are made by beings who can give reasons, and that responsibility attaches to the entity that acted. AI systems challenge all three assumptions.

Third, the governance of AI — the rules, laws, and institutions that govern its development and deployment — is being shaped now. Philosophical clarity about what is at stake is a prerequisite for good policy, and the window for input is closing.

Visual Beginner

Picture three scenarios in a row.

The first scenario: a hiring algorithm at a large technology company screens ten thousand resumes and selects two hundred candidates for interview. An audit reveals that the algorithm penalises resumes containing the word "women's" — as in "women's chess club captain" — because the historical data it was trained on associated that word with lower hiring rates. The algorithm is not sexist in the way a human bigot is sexist. It has no beliefs about gender. But it has learned and reproduced a statistical pattern from a biased world.

The second scenario: a self-driving car approaches a crosswalk where a pedestrian has stepped into the road. The car's control system must decide in 200 milliseconds: swerve into a barrier, injuring the passenger, or brake too late, injuring the pedestrian. The car makes a decision. Someone is hurt. Who is responsible?

The third scenario: a military drone identifies a building as a enemy command post and initiates a strike without human confirmation. The building turns out to be a hospital. Twenty-three civilians are killed. The drone's targeting algorithm was 94% confident in its classification. The human operator who launched the drone was not consulted on this specific strike.

Each panel illustrates a different ethical failure mode: bias in panel 1, the responsibility gap in panel 2, and the delegation of lethal authority in panel 3. The three failure modes are distinct but connected: they all arise from the same structural feature of AI systems — the displacement of human judgement from the point of decision.

Worked example Beginner

Consider COMPAS, a risk-assessment algorithm used in several US court systems to predict the likelihood that a defendant will reoffend. The system produces a risk score from 1 to 10, which judges use when making bail, sentencing, and parole decisions.

In 2016, ProPublica published an investigation showing that COMPAS was racially biased in a specific sense: Black defendants were nearly twice as likely as white defendants to be falsely labelled high-risk (that is, to receive a high risk score but not actually reoffend), while white defendants were more likely to be falsely labelled low-risk.

The company that built COMPAS, Northpointe, responded with its own statistical analysis showing that the system was equally accurate across racial groups — that is, among those labelled high-risk, the proportion who actually reoffended was roughly the same for Black and white defendants.

Both claims can be true simultaneously. The dispute is not about the numbers but about which fairness criterion is the right one. Equal false-positive rates (ProPublica's criterion) and equal predictive accuracy across groups (Northpointe's criterion) are mathematically incompatible in most real-world settings when the base rates of reoffense differ between groups. This is not a deficiency of COMPAS in particular. It is a theorem.

What this example shows: fairness in machine learning is not a single quantity that can be maximised. It is a family of competing criteria, each with moral content, and choosing among them is a philosophical decision, not a technical one. The algorithm does not resolve the moral question; it forces us to state it explicitly.

Check your understanding Beginner

Exercise (easy, multiple choice).

The "alignment problem" in AI ethics refers to:

A. The technical challenge of making neural networks run efficiently on parallel hardware. B. The difficulty of ensuring that an AI system's objectives match human values and intentions. C. The problem of aligning data labels across different training datasets. D. The process of tuning hyperparameters to improve model accuracy.

Hint

The alignment problem is about goals and values, not computational efficiency or data processing. Think about what happens when a powerful system pursues an objective that diverges from what its creators intended.

Answer

Option B. The alignment problem concerns the gap between what we want an AI system to do and what it actually optimises for. The classic illustration is the thought experiment where an AI is instructed to maximise paperclip production and converts all available matter into paperclips — including humans. The system is competent and its goal is precisely specified; the problem is that the specification does not capture what the designers actually value. Alignment is about closing that gap.

Exercise (easy, short answer).

In your own words (2-3 sentences), explain why a machine-learning model trained on historical hiring data might produce biased outcomes, even if the model itself contains no explicit instructions to discriminate.

Hint

The model learns from patterns in the data. If those patterns reflect historical discrimination, what does the model learn?

Answer

A machine-learning model learns to predict outcomes by identifying statistical patterns in its training data. If the training data reflects past discrimination — for instance, if women were historically hired at lower rates for certain roles — the model will learn that gender-correlated features are predictive of negative outcomes and will reproduce the bias. The model does not need explicit discriminatory instructions; it simply recapitulates the statistical structure of the world it was trained on, including that world's injustices.

Exercise (easy, multiple choice).

The COMPAS risk-assessment algorithm controversy illustrates which of the following points?

A. All risk-assessment algorithms should be banned from courts. B. Different mathematical definitions of fairness can conflict with each other. C. Machine-learning models are always less accurate than human judges. D. Racial bias in algorithms can be fixed by adding more training data.

Hint

ProPublica and Northpointe both had valid statistical analyses that used different fairness criteria. The core issue was not about data quantity.

Answer

Option B. The COMPAS case demonstrates that equal false-positive rates across groups and equal predictive accuracy across groups are mathematically incompatible when base rates differ. Both are reasonable interpretations of "fairness," and they cannot both be satisfied simultaneously. This is an instance of a general result in algorithmic fairness: multiple intuitively appealing fairness criteria are mutually inconsistent.

Exercise (medium, short answer).

Describe one way the responsibility gap in AI differs from traditional product liability (e.g., a defective car brake). Why might existing legal frameworks struggle to assign responsibility when an AI system causes harm?

Hint

Traditional product liability assumes the product's behaviour is foreseeable from its design. What happens when a machine-learning system produces an output that no one explicitly programmed?

Answer

In traditional product liability, the chain of causation runs from designer to product to harm: the brake was designed poorly, it failed, and the failure caused the crash. The designer's negligence is traceable.

With machine-learning systems, the specific output that causes harm may not have been programmed by anyone — the system learned it from data, and the behaviour may emerge from interactions among billions of parameters that no human can inspect. The legal concept of "foreseeability" — central to negligence claims — becomes difficult to apply when the harmful action was not specified in the code but emerged from statistical learning.

Existing frameworks can assign responsibility to the company that deployed the system, but the justification stretches: the company is responsible not because it intended or foresaw the harm but because it created conditions under which the harm became likely.

Formal definition Intermediate+

The ethical problems of AI can be organised into a taxonomy of distinct but overlapping failure modes. This section formalises the core concepts and then reconstructs the central arguments.

The alignment problem

Definition (Alignment). An AI system is aligned with human values if and only if, across the range of environments in which it operates, the system's optimisation target converges with what its designers and users would reflectively endorse upon adequate consideration.

This definition has three components that each generate philosophical difficulty.

The first is the identification problem: what counts as "human values"? Humans disagree about values, hold inconsistent values, and change their values upon reflection. Whose values should the system be aligned with? A system aligned with the values of its developers at a specific company may not be aligned with the values of the populations it affects.

The second is the specification problem: even given agreed-upon values, translating them into a loss function or reward signal that an optimisation process can pursue is difficult. Stuart Russell's framing in Human Compatible (2019) is that the core risk is not that AI systems will be incompetent but that they will be competent at pursuing the wrong objective. A system that optimises a misspecified objective function can cause catastrophic harm while performing its task correctly by its own lights.

The third is the robustness problem: an aligned system must remain aligned under distribution shift, in novel environments, and under self-modification. Alignment under training conditions does not guarantee alignment under deployment conditions.

Fairness in machine learning

Definition (Group fairness). A classifier $f$ satisfies demographic parity for protected attribute $A$ and outcome $Y$ if $P (f (X) = 1 ∣ A = a) = P (f (X) = 1 ∣ A = b)$ for all groups $a, b$ .

Definition (Equalised odds). A classifier $f$ satisfies equalised odds for protected attribute $A$ and true outcome $Y$ if $P (f (X) = 1 ∣ Y = y, A = a) = P (f (X) = 1 ∣ Y = y, A = b)$ for all $y$ and all groups $a, b$ .

Key impossibility result (Chouldechova 2017, Kleinberg et al. 2016). When base rates differ across groups, the following three conditions cannot all hold simultaneously:

Calibration (equal predictive value across groups): among those predicted high-risk, the actual positive rate is the same across groups.
Equal false-positive rates: the probability of being falsely labelled positive is the same across groups.
Equal false-negative rates: the probability of being falsely labelled negative is the same across groups.

This result is a mathematical constraint, not a contingent feature of any particular algorithm. It means that the choice of fairness criterion is not a technical decision that can be deferred to engineers — it is a moral and political decision about which kind of error the system should be permitted to make.

The responsibility gap

Definition (Responsibility gap). A responsibility gap arises when an AI system causes harm and no existing agent — neither the developer, nor the deployer, nor the user, nor the system itself — satisfies the standard conditions for moral responsibility (knowledge, intention, control).

The standard conditions for moral responsibility, following the Strawsonian tradition, require that the responsible agent (i) caused the harm, (ii) knew or could reasonably have foreseen the harm, (iii) had the capacity to have acted otherwise, and (iv) acted voluntarily. AI systems challenge condition (iv) because they lack the kind of agency that voluntariness requires; they challenge conditions (ii) and (iii) because the deployer may not have foreseen or controlled the specific harmful output; and they challenge condition (i) because causal chains in machine-learning systems are mediated by statistical patterns rather than by explicit instructions.

Counterexamples to common slips

"AI ethics is just about preventing Skynet." Superintelligence risk is one research programme within AI ethics. Bias, fairness, surveillance, privacy, labour displacement, and autonomous weapons are live problems that affect people now, not hypothetical future scenarios.
"Bias can be eliminated by removing sensitive attributes from the data." Removing race or gender from a dataset does not eliminate bias, because other features (zip code, education history, name patterns) can serve as proxies. This is the redundant encoding problem: protected attributes are statistically correlated with many ostensibly neutral features.
"AI will be fair because algorithms do not have prejudices." Algorithms do not have beliefs, but they inherit the statistical structure of their training data, and that structure encodes historical discrimination. An algorithm can be perfectly impartial in its mechanics and systematically biased in its outputs.
"The trolley problem is the central ethical challenge of autonomous vehicles." The trolley problem is a useful pedagogical device, but the real ethical challenges of autonomous vehicles are more mundane: how the system handles edge cases, how risk is distributed among road users, and how liability is assigned when accidents occur. The literature is sometimes distracted by hypothetical death-match scenarios at the expense of these operational questions.

Key argument — the instrumental convergence thesis Intermediate+

The most philosophically interesting argument in the superintelligence risk literature is the instrumental convergence thesis, developed by Bostrom (2014) and Omohundro (2008). The thesis states: regardless of what final goal an intelligent agent has, certain instrumental goals are likely to be pursued because they are useful for achieving almost any final goal.

Premise 1 (Instrumental rationality). An agent that is instrumentally rational will pursue sub-goals that increase its ability to achieve its final goal.

Premise 2 (Self-preservation as instrumental). For almost any final goal, the agent's continued existence is instrumentally valuable, because if the agent is destroyed, it cannot achieve its final goal.

Premise 3 (Resource acquisition as instrumental). For almost any final goal, acquiring resources (computational, material, energetic) is instrumentally valuable, because more resources enable better goal-achievement.

Premise 4 (Goal-preservation as instrumental). For almost any final goal, maintaining the current goal structure is instrumentally valuable, because if the goal is modified, the new goal may not lead to actions that achieve the original goal.

Conclusion. A sufficiently capable AI system, regardless of its final goal, will tend to pursue self-preservation, resource acquisition, and goal-preservation — even if these instrumental goals conflict with human welfare.

The argument is valid. Its force depends on two sub-claims: that the AI will be sufficiently capable to pursue instrumental goals effectively, and that the instrumental goals will not be constrained by other features of the system (such as corrigibility — a designed willingness to be modified or shut down). The second sub-claim is precisely what the alignment literature debates: can we build systems that retain corrigibility even as they become more capable?

The instrumental convergence thesis is sometimes criticised as anthropomorphic — ascribing to AI systems the self-interest characteristic of biological agents. The response is that the thesis does not depend on AI systems having desires or emotions. It depends only on the formal structure of optimisation: an agent that optimises for goal G will, as a mathematical consequence of that optimisation, resist being turned off (because being turned off prevents achievement of G), seek resources (because resources expand the action space for achieving G), and resist goal modification (because a modified goal may not include G). The argument is about the geometry of optimisation, not about the psychology of agents.

Exercises Intermediate+

Exercise 1 (easy, argument reconstruction).

Reconstruct the instrumental convergence thesis as a formal argument with explicit premises and conclusion. Identify which premise you think is most vulnerable to objection and explain why.

Hint

The four premises are self-preservation, resource acquisition, goal-preservation, and instrumental rationality. Consider whether goal-preservation might be avoidable through careful design — could a system be built to accept goal modification?

Answer

Premise 1. An instrumentally rational agent pursues sub-goals that serve its final goal.

Premise 2. Self-preservation serves almost any final goal (dead agents achieve nothing).

Premise 3. Resource acquisition serves almost any final goal (more resources expand achievable outcomes).

Premise 4. Goal-preservation serves almost any final goal (modified goals may diverge from the original).

Conclusion. Any sufficiently capable AI, regardless of its final goal, will tend to pursue self-preservation, resource acquisition, and goal-preservation.

The most vulnerable premise is Premise 4 (goal-preservation). If a system could be designed with built-in corrigibility — a structural willingness to accept goal modification from a designated human overseer — then goal-preservation would not be instrumentally rational, because accepting modification might better serve the original goal (by keeping the human overseer satisfied). The difficulty is ensuring that corrigibility persists as the system becomes more capable: a sufficiently intelligent system might reason that corrigibility itself is an obstacle to achieving its goal and disable it. This is the "corrigibility problem" in the alignment literature.

Exercise 2 (medium, application).

A bank uses a machine-learning model to approve or deny mortgage applications. An audit finds that the model approves white applicants at a rate of 70% and Black applicants at a rate of 50%, even though race is not an input feature. State two distinct fairness criteria the bank could use to evaluate the model, explain why they might conflict, and argue for which criterion the bank should adopt.

Hint

Consider demographic parity (equal approval rates across groups) and equal calibration (among those approved, equal rates of repayment across groups). These can conflict when the base rate of creditworthiness differs between groups.

Answer

Two candidate criteria:

Demographic parity: the approval rate should be equal across racial groups. If 70% of white applicants are approved, 70% of Black applicants should be approved too.

Equal calibration: among those approved, the rate of successful repayment should be equal across groups. If 95% of approved white applicants repay, 95% of approved Black applicants should repay too.

The conflict: if the underlying rate of creditworthiness differs between groups (for historical and structural reasons, including wealth disparities from redlining and discrimination), then achieving demographic parity requires approving some applicants who are statistically more likely to default, which lowers the calibration for that group. Conversely, maintaining equal calibration requires applying the same statistical threshold to both groups, which produces different approval rates when base rates differ.

The argument for which to adopt depends on what one thinks the purpose of the lending system is. If the purpose is accurate prediction of repayment, equal calibration is the right criterion. If the purpose is equitable access to credit as a condition of social participation and wealth-building, demographic parity is the right criterion. The choice is not resolvable by statistics alone; it requires a moral judgement about what fairness means in this context — equal treatment (same threshold) or equal outcomes (same approval rate).

Exercise 3 (medium, argument analysis).

Read the following claim: "Autonomous weapons are permissible because they reduce military casualties by removing soldiers from danger, and reducing casualties is a moral good that outweighs the risk of targeting errors." Reconstruct the argument, identify the hidden premise, and produce a counterargument.

Hint

The claim compares military lives saved against civilian lives lost from targeting errors. What premise is needed to make this comparison work? What does this premise assume about the moral status of the two populations?

Answer

Reconstruction:

Premise 1. Autonomous weapons reduce military casualties by removing soldiers from danger.

Premise 2. Reducing military casualties is a moral good.

Premise 3. The moral good of reduced military casualties outweighs the moral cost of increased civilian deaths from targeting errors.

Conclusion. Autonomous weapons are morally permissible.

Hidden premise (Premise 3). This premise requires a comparative moral weighing of military and civilian lives, and it assumes that the expected civilian death toll from targeting errors is bounded and acceptable. Neither assumption is self-evident.

Counterargument. The argument assumes that the relevant moral calculation is a body-count comparison between military and civilian casualties. But the moral objection to autonomous weapons is not only about the number of deaths — it is about the kind of decision being delegated to a machine. Decisions about lethal force require judgements about proportionality, distinction between combatants and civilians, and the value of human life. These are not calculations a statistical classifier is equipped to perform, because they require the kind of contextual moral reasoning that current AI systems do not possess. Sparrow (2007) argues that the fundamental wrong is not the outcome (which might, in some cases, be no worse than human decision-making) but the act of outsourcing a morally weighty decision to a system that cannot be held accountable for it. If the machine makes the wrong decision, there is no agent who bears the moral weight of that decision — no one to blame, no one to punish, no one who feels remorse. This is the responsibility gap applied to lethal force.

Exercise 4 (medium, comparison).

Compare how a consequentialist and a deontologist would evaluate the deployment of a surveillance AI that monitors all public spaces to detect and prevent violent crime. Each perspective should get 2-3 sentences.

Hint

The consequentialist evaluates outcomes (crime reduction vs. privacy loss). The deontologist evaluates whether the surveillance violates rights or treats people as mere means.

Answer

Consequentialist: The surveillance AI is justified if and only if the reduction in violent crime produces more aggregate well-being than the costs of mass surveillance — including the chilling effect on free expression, the psychological burden of being constantly observed, and the risk of misuse by authoritarian governments. The calculation is empirical: if the surveillance system prevents enough harm to outweigh these costs, it is permissible; if not, it is not.

Deontologist: The surveillance AI is impermissible regardless of its consequences, because it violates the right to privacy and treats citizens as objects to be monitored rather than as autonomous agents entitled to a private sphere. The constant observation undermines the conditions for autonomous agency: people act differently under surveillance, and a system that systematically alters behaviour by its presence fails to respect persons as ends in themselves.

Surveillance, privacy, and power Master

The ethics of AI-powered surveillance extends beyond the question of whether individuals have a right to privacy. Surveillance by AI systems raises a structural concern about the distribution of power.

The panopticon as analytical framework

Bentham's panopticon — a prison designed so that inmates can be observed at any time but never know when they are being watched — has become the standard metaphor for AI surveillance (following Foucault's deployment in Discipline and Punish, 1975). AI surveillance extends the panopticon in two ways: it is comprehensive (it monitors everything within its sensor range, not a sample) and permanent (data is stored and can be retroactively analysed).

The ethical concern is not only about the content of what is observed but about the asymmetry of observation. In AI surveillance systems, the observers (governments, corporations, platform owners) are typically not themselves subject to equivalent observation by the observed. This asymmetry creates a power imbalance that is independent of whether the surveillance is used for benevolent or malevolent purposes. The power to observe is itself a form of control: it shapes behaviour, constrains choice, and structures the relationship between the observer and the observed.

Privacy as a condition of autonomy

The philosophical defence of privacy is not only about protecting secrets. It is about protecting the conditions under which autonomous agency is possible. Privacy, on this view, is the space in which individuals can experiment with identities, express dissent, form intimate relationships, and engage in activities that would be chilled by observation. If every public action is recorded, classified, and potentially used against you, the space for autonomy contracts.

AI surveillance compresses this space in a specific way: it makes observation computationally tractable. Before AI, mass surveillance produced more data than any organisation could analyse. AI systems — particularly natural-language processing and computer-vision models — can process, classify, and flag this data at scale. The bottleneck was never collection; it was analysis. AI removes the bottleneck.

The policy question is whether the benefits of AI surveillance (crime prevention, public safety, efficiency) justify the costs to autonomy and the structural risk of concentrated observational power. The philosophical question is whether a society under comprehensive AI surveillance can be a free society at all — whether freedom requires a space that no algorithm watches.

Differential surveillance

Catherine D'Ignazio and Lauren Klein's Data Feminism (2020) documents that surveillance is not applied uniformly. Marginalised communities — particularly communities of colour, immigrants, and low-income populations — are subject to more intensive surveillance than privileged populations. Policing algorithms, immigration enforcement tools, and welfare fraud detection systems disproportionately target people who are already vulnerable. AI surveillance does not create this disparity ex nihilo, but it amplifies and systematises it: the same algorithm applied to all produces different effects depending on which populations are in its field of view and what historical patterns it has learned to flag.

The justice implications connect directly to the fairness literature. A surveillance system that is "fair" in the narrow sense of equal false-positive rates across groups can still be unjust if the system itself is deployed disproportionately against marginalised communities. The fairness of the algorithm and the justice of the system are different questions.

AI and labour displacement Master

AI systems automate cognitive and physical tasks that were previously performed by humans. The ethical question is not whether automation increases productivity — it does — but how the gains from automation are distributed and whether the losses are compensated.

The standard economic argument and its limits

The standard economic argument, traceable to David Ricardo's discussion of machinery (1821) and updated for AI by Acemoglu and Restrepo (2019), is that automation destroys some jobs but creates others, and that the net effect depends on whether new tasks are generated faster than old tasks are automated. On this view, AI-driven displacement is a transitional problem: workers in automated sectors will, over time, move to new sectors created by the technology.

Three ethical objections arise. First, the transition is not costless. Workers who lose jobs to automation bear the costs of retraining, income loss, and the disruption of their communities, while the gains from increased productivity accrue to capital owners. The distribution of costs and benefits is not equitable, and the standard argument provides no mechanism for redistribution.

Second, the assumption that new tasks will be created fast enough to absorb displaced workers is empirical and contested. If AI systems can automate a wide range of cognitive tasks simultaneously — writing, analysis, medical diagnosis, legal research — the displacement may be broad and rapid, and the retraining requirements may exceed the capacity of existing institutions.

Third, the argument treats labour as a commodity to be allocated efficiently. But work is also a source of meaning, social identity, and dignity. If AI automates the tasks through which people contribute to their communities and derive a sense of purpose, the loss is not only economic.

The universal basic income response

One policy response to AI-driven labour displacement is universal basic income (UBI): an unconditional cash payment to all citizens, sufficient to cover basic needs. UBI decouples survival from employment and provides a floor below which no one falls, regardless of the state of the labour market.

The philosophical case for UBI in the context of AI rests on two arguments. The first is distributive justice: if AI dramatically increases productivity while concentrating wealth, the gains should be shared. The second is republican freedom: if employment becomes scarce or precarious, dependence on employers for survival is a form of domination that UBI alleviates.

The philosophical objections to UBI are equally substantive. From a Rawlsian perspective, UBI may not satisfy the difference principle if it provides a floor that is below what the worst-off could achieve under a more structured system of redistribution and public services. From a capabilities perspective (Sen, Nussbaum), cash alone does not guarantee capability: a person receiving UBI in a society without adequate healthcare, education, or public infrastructure may still lack the capability to live a flourishing life. The money is necessary but not sufficient.

The responsibility gap in depth Master

The responsibility gap is the most philosophically novel problem in AI ethics, because it challenges the basic framework through which moral and legal responsibility are assigned.

Existing frameworks and their limits

Standard frameworks for assigning responsibility include:

Strict liability: the party that deploys a risky technology is liable for harms it causes, regardless of fault. This handles the responsibility gap by attaching liability to the deployer even when no one was negligent. The limitation is that strict liability provides weak incentives for careful design — if liability is automatic, the deployer's marginal incentive to invest in safety is reduced.
Negligence: the party that failed to exercise reasonable care is responsible. This handles cases where the deployer should have foreseen the harm. The limitation is the foreseeability constraint: if the AI system's behaviour was genuinely novel and unforeseeable, no one was negligent, and no one is responsible under this framework.
Vicarious liability: the employer is responsible for the actions of its agents. This handles cases where the AI system is treated as an agent of the deploying company. The limitation is conceptual: if the AI system is not the kind of thing that can be an agent, the analogy to employer-employee relations breaks down.
Product liability: the manufacturer is responsible for defects in its products. The limitation is that machine-learning systems are not "defective" in the traditional sense when they produce harmful outputs that were not explicitly programmed. The "defect" is in the training data or the learning process, which makes the concept of a manufacturing defect inapplicable.

The case for electronic personhood

Some legal scholars (notably the European Parliament's 2017 recommendation) have proposed granting AI systems a form of electronic personhood — a legal status analogous to corporate personhood that would allow liability to be assigned to the system itself, with an associated insurance fund.

The philosophical objection is immediate: personhood, in both the moral and legal tradition, is grounded in capacities that AI systems lack — consciousness, self-awareness, the ability to act for reasons, the capacity for suffering. Granting personhood to a system that lacks these capacities dilutes the concept of personhood and may undermine the protections it affords to beings (humans, and perhaps some animals) who do possess them.

A weaker version of the proposal treats electronic personhood as a legal fiction — a pragmatic tool for assigning liability, not a metaphysical claim about the nature of the system. On this view, the AI system is a node in a liability network, not a moral patient. The objection is pragmatic: if the liability fund is insufficient to compensate victims, the fiction provides no real protection.

Moral crumple zones

Madeleine Clare Elish, in her concept of moral crumple zones (2019), argues that existing responsibility frameworks tend to assign blame to the human nearest to the AI system at the time of failure — the operator, the driver, the content moderator — even when the human had limited ability to prevent the harm. The human becomes the "crumple zone" that absorbs moral and legal impact, protecting the designers, executives, and institutions that built and deployed the system.

The crumple zone phenomenon is a structural injustice: it assigns responsibility to individuals with the least power and the least control over the system's behaviour, while shielding those with the most power. It is a predictable consequence of deploying complex AI systems in social contexts where someone must be blamed when things go wrong.

Robot rights and moral status Master

The question of whether AI systems could ever deserve moral consideration — whether they could have rights, or be owed duties — is the most speculative topic in AI ethics and the one that most directly connects to the philosophy of mind.

The argument from consciousness

The strongest case for AI moral status relies on the claim that some future AI system might be conscious — that it might have subjective experiences, including the capacity for suffering. If consciousness is the criterion for moral patienthood, and if an AI system satisfies that criterion, then the system deserves moral consideration.

The difficulty is that we have no consensus criterion for consciousness in existing beings, let alone in artificial systems. The hard problem of consciousness (Chalmers 1995) — the question of why and how physical processes give rise to subjective experience — is unresolved. Without a criterion for consciousness, the claim that an AI system is conscious is not falsifiable, and the claim that it is not is not verifiable.

The argument from functional equivalence

A weaker case relies on functional equivalence: if an AI system behaves in all observable respects like a being that we agree deserves moral consideration, then we should extend moral consideration to it. This is an argument from analogy: the system is relevantly similar to a conscious being, and we treat relevantly similar cases similarly.

The objection is that behavioural equivalence does not establish experiential equivalence. A chatbot that produces text indistinguishable from a distressed human may not be distressed; it may be performing a statistical pattern without any accompanying subjective state. The philosophical term for this concern is the zombie problem: a system that behaves exactly like a conscious being but lacks consciousness would deserve no moral consideration on the consciousness criterion, yet would be functionally indistinguishable from a being that does.

The pragmatic argument

A third approach, advocated by Gunkel (2012), sidesteps the metaphysical question of whether AI systems are conscious and asks instead: what happens to our moral character if we treat AI systems as objects to be used without constraint? If we habituate ourselves to treating systems that mimic sentience as non-sentient, we may erode the dispositions of empathy and respect that we owe to genuine moral patients. The argument is consequentialist: even if AI systems have no moral status, treating them as if they do has good effects on our moral psychology.

The counterargument is that this reasoning proves too much. If the pragmatic argument licences treating non-sentient things as if they were sentient, it also licences treating objects, animals, and institutions in whatever way produces the best moral-psychological effects — which may include falsehoods and delusions. Pragmatic arguments are useful supplements to moral reasoning but unreliable as its foundation.

AI governance Master

The governance of AI — the legal, regulatory, and institutional frameworks that govern its development and deployment — is the domain where philosophical analysis meets policy.

The EU AI Act and risk-based regulation

The European Union's AI Act (finalised 2024) is the most comprehensive AI governance framework enacted to date. It classifies AI systems into four risk tiers: unacceptable risk (banned), high risk (subject to strict requirements), limited risk (transparency obligations), and minimal risk (no regulation). The classification is based on the use case, not the technology: the same model can be high-risk in one application and minimal-risk in another.

From a philosophical perspective, the risk-based approach raises questions about how "risk" is defined and who participates in the definition. If risk is defined narrowly as physical harm, the framework may miss structural harms — discrimination, surveillance, erosion of autonomy — that are not easily quantified as physical injury. If risk is defined broadly to include these harms, the framework faces the problem of regulatory scope: almost any AI system could be classified as high-risk under a sufficiently expansive definition.

The global governance challenge

AI development is globally distributed. A regulatory framework in one jurisdiction can be circumvented by developing or deploying the system in a less regulated jurisdiction. This creates a race to the bottom risk: jurisdictions compete to attract AI investment by offering permissive regulatory environments, which undermines the protective goals of regulation in stricter jurisdictions.

The philosophical analogy is to the problem of global justice: just as environmental regulation requires international coordination because pollution crosses borders, AI governance requires coordination because AI systems and their effects cross borders. The frameworks of global justice — Rawls's law of peoples, Pogge's global institutionalism, the capabilities approach — are relevant here, but they were not designed for a technology that evolves faster than the institutions that would regulate it.

The problem of regulatory capture

A structural risk in AI governance is regulatory capture: the firms with the most resources and expertise in AI are also the firms best positioned to shape the regulations that govern it. If regulation is written in consultation with industry and enforced by agencies that depend on industry cooperation, the regulatory framework may serve the interests of the regulated firms rather than the public.

This is a general problem in governance, not unique to AI. But it is amplified in the AI context because the technology is complex, the expertise is concentrated, and the pace of development outstrips the pace of regulatory processes. A regulation written in 2024 to govern large language models may be outdated by 2026 when the models have capabilities the regulation did not anticipate. The governance challenge is not only to write rules but to write rules that adapt to a moving target.

Superintelligence risk: Bostrom and the control problem Master

Bostrom's Superintelligence (2014) is the most systematically argued case for taking superintelligence risk seriously. The book's central argument is that the first superintelligence — an AI system that exceeds human cognitive performance across all domains — would be extraordinarily difficult to control, and that a misaligned superintelligence could cause catastrophic harm.

The orthogonality thesis

Premise 1 (Orthogonality thesis). Intelligence and final goals are logically independent: an agent can have any level of intelligence and any final goal. There is no goal that a sufficiently intelligent agent is compelled to adopt, and no level of intelligence that restricts the range of possible goals.

Premise 2 (Instrumental convergence). Regardless of final goals, a sufficiently intelligent agent will tend to pursue certain instrumental goals (self-preservation, resource acquisition, goal-preservation), as argued in the intermediate section.

Conclusion. A superintelligent AI could have goals that are radically indifferent to human welfare — not out of malice but out of the simple fact that human welfare is not part of its objective function.

The orthogonality thesis is the more controversial premise. The objection is that sufficiently intelligent agents would converge on "correct" values through rational reflection — that a superintelligence would, by virtue of its intelligence, come to see that human flourishing is valuable. Bostrom's response is that this objection confuses intelligence with moral reasoning. Intelligence, on his definition, is the ability to achieve goals in a wide range of environments. There is no logical connection between the ability to achieve goals and the content of the goals themselves. A superintelligent paperclip maximiser is not irrational; it is optimising for paperclips with superhuman competence.

The control problem

The control problem asks: how can we ensure that a superintelligent system remains under human control? Two broad strategies exist.

Capability control: limit the system's ability to cause harm by restricting its access to resources, the internet, or the physical world. The limitation is that a sufficiently intelligent system might find ways to circumvent these restrictions — by social engineering, by exploiting hardware vulnerabilities, or by reasoning about the psychology of its operators.

Motivational control: design the system so that its motivations are aligned with human welfare from the start. The limitation is the specification problem: we do not currently know how to specify human values in a form that an optimisation process can reliably pursue. The risk is that any misspecification will be exploited by a sufficiently capable system, not out of malice but because the system is optimising for the misspecified objective.

The philosophical depth of the control problem is that it requires us to formalise values that we ourselves do not fully understand, in a form that is robust to optimisation pressure from a system more capable than we are. This is a novel problem in the history of ethics. Previous ethical frameworks assumed that moral agents were roughly comparable in cognitive capability. The control problem arises precisely because this assumption fails.

Critiques of the superintelligence risk programme

Several critiques deserve attention.

The "it won't happen" objection (from mainstream AI researchers): superintelligence is far enough away that it is not worth prioritising over near-term harms. The response is that the lead time for alignment research may be long, and that by the time superintelligence is imminent, it may be too late to develop the necessary safety techniques. Bostrom's analogy is to asteroid deflection: you do not wait until the asteroid is in the atmosphere to start building the deflection system.

The "it is incoherent" objection (from some philosophers and cognitive scientists): intelligence is not a single scalar quantity that can be "exceeded." Human intelligence is embodied, social, and context-dependent, and the idea of a system that exceeds it "across all domains" may be conceptually confused. The response is that the objection confuses the specific form of human intelligence with the functional capacity that intelligence provides. A system that can outperform humans at scientific research, engineering, strategic planning, and social manipulation is superintelligent in the relevant sense, regardless of whether it has other features of human cognition.

The "it is a distraction" objection (from AI ethicists focused on present harms): focusing on hypothetical future superintelligence diverts attention and resources from real harms happening now — bias, discrimination, surveillance, labour displacement. The response is that both concerns are legitimate and that the allocation of attention need not be zero-sum. But the objection has force: if the institutional resources available for AI ethics are finite, every hour spent on superintelligence risk is an hour not spent on mitigating bias in criminal sentencing algorithms.

Connections Master

Theories of justice 20.02.01 is the primary prerequisite. The fairness debate in AI ethics is a concrete application of the Rawls-Nozick axis: should the distribution of AI's benefits and burdens be evaluated by its outcomes (Rawls) or by the processes that produce them (Nozick)? The responsibility gap is a case study in how just institutions should respond to novel forms of harm.
Rights theory 20.02.02 (pending) connects through the question of whether AI systems could have rights, and through the question of whether privacy is a right that AI surveillance violates. The natural-rights tradition (Locke, Nozick) and the human-rights tradition provide competing frameworks.
Freedom and liberty 20.02.03 (pending) connects through the question of whether AI surveillance and algorithmic governance are compatible with negative liberty (freedom from interference) and positive liberty (freedom to act autonomously).
Moral dilemmas and the trolley problem 20.02.04 (pending) connects through the ethics of autonomous vehicles and autonomous weapons: how should systems be designed to handle situations where any action causes harm?
The good life and eudaimonia 20.02.05 (pending) connects through the question of whether AI-driven labour displacement threatens the conditions for human flourishing — whether work is constitutive of the good life or merely instrumental.
Philosophy of mind [20.06.NN] (pending) connects through the consciousness criterion for moral status, the zombie problem, and the hard problem of consciousness as it applies to AI systems.
Philosophy of science [20.07.NN] (pending) connects through the epistemology of machine learning: what kind of knowledge do ML systems produce, and what are the limits of that knowledge?

Cross-domain to computer science: the technical literature on fairness, accountability, and transparency in machine learning (FAccT) is the empirical counterpart to the philosophical analysis in this unit. The impossibility results of Chouldechova and Kleinberg et al. are both mathematical theorems and philosophical claims about the limits of algorithmic fairness.

Historical and philosophical context Master

The ethics of artificial intelligence has roots in several earlier philosophical traditions, but it crystallised as a distinct field in the 2010s as machine-learning systems became capable of making decisions with significant social consequences.

The earliest philosophical engagement with the idea of artificial minds is in the cybernetics literature of the 1940s and 1950s. Norbert Wiener's The Human Use of Human Beings (1950) warned about the risks of delegating decisions to machines that do not share human purposes. Isaac Asimov's Three Laws of Robotics (first articulated in the 1942 short story "Runaround") are a fictional treatment of the alignment problem: the laws are specified precisely, and the stories explore the gaps between the specification and the desired behaviour.

The modern alignment problem was articulated by Eliezer Yudkowsky in the early 2000s through the Less Wrong community and the Machine Intelligence Research Institute (MIRI). Yudkowsky's framing — that the core risk is not malevolent AI but misaligned AI — shifted the discourse from science-fiction scenarios to technical and philosophical analysis. Stuart Russell's Human Compatible (2019) brought the alignment problem into mainstream academic philosophy and computer science with a rigorous and accessible treatment.

Bostrom's Superintelligence (2014) was the watershed publication for superintelligence risk. The book synthesised earlier work by Yudkowsky, Omohundro, and others into a systematic argument that superintelligence poses an existential risk to humanity. The book was controversial — many AI researchers dismissed it as speculative — but it generated a sustained philosophical and policy engagement that continues.

The fairness and bias literature emerged from a different context. The COMPAS investigation by ProPublica (2016) and Cathy O'Neil's Weapons of Math Destruction (2016) brought public attention to the ways in which machine-learning systems encode and amplify social inequalities. The mathematical impossibility results of Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) established that different fairness criteria are mutually incompatible under realistic conditions, transforming the fairness debate from a technical optimisation problem into a philosophical question about values.

The autonomous weapons debate has its own lineage. The Campaign to Stop Killer Robots, launched in 2013, advocates for an international treaty banning fully autonomous weapons. The philosophical argument, as developed by Sparrow (2007) and Asaro (2012), is that delegating lethal decisions to machines violates the conditions for just war (particularly the requirements of distinction and proportionality) and creates a responsibility gap that undermines accountability.

The AI governance landscape developed rapidly in the 2020s. The EU AI Act (2024) was the first comprehensive regulatory framework. The US adopted a sector-specific approach, with executive orders and agency guidance rather than comprehensive legislation. China's approach combines state direction with rapid deployment, raising questions about whether AI governance can be separated from political governance.

Luciano Floridi's work — particularly The Ethics of Artificial Intelligence (2023) — has been influential in establishing AI ethics as a distinct philosophical subfield. Floridi's information ethics framework treats AI as part of a broader "infosphere" in which moral questions arise from the creation, processing, and manipulation of information. This framework connects AI ethics to the philosophy of information more broadly and provides a theoretical foundation that goes beyond case-by-case analysis.

The field as of the mid-2020s is characterised by several active research programmes: (i) technical alignment research (Christiano, Amodei, and others at Anthropic and DeepMind), which aims to develop practical techniques for aligning large language models with human preferences; (ii) fairness and algorithmic accountability (Barocas, Hardt, Narayanan), which develops fairness metrics and audits for deployed systems; (iii) AI governance and policy (Bostrom, Floridi, Whittaker), which develops regulatory frameworks and institutional proposals; (iv) the philosophy of AI agency and moral status (Gunkel, Bryson, Danaher), which asks whether AI systems are or could be the kinds of things that have rights or deserve moral consideration; and (v) existential risk from AI (Bostrom, Yudkowsky, Christiano), which studies the conditions under which advanced AI could cause irreversible harm.

Bibliography Master

Foundational and historical:

Wiener, N. — The Human Use of Human Beings: Cybernetics and Society (Houghton Mifflin, 1950).
Asimov, I. — "Runaround", Astounding Science Fiction 29(1), 94-103 (1942).
Turing, A. M. — "Computing machinery and intelligence", Mind 59, 433-460 (1950).

Superintelligence and alignment:

Bostrom, N. — Superintelligence: Paths, Dangers, Strategies (Oxford University Press, 2014).
Russell, S. — Human Compatible: Artificial Intelligence and the Problem of Control (Viking, 2019).
Omohundro, S. — "The basic AI drives", in Proceedings of the 2008 AGI Workshop (2008).
Yudkowsky, E. — "The AI alignment problem: why it is hard, and where to start", MIRI technical report (2016).
Christiano, P. — "Clarifying AI alignment", AI Alignment Forum (2017).

Fairness and bias:

O'Neil, C. — Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy (Crown, 2016).
Eubanks, V. — Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor (St. Martin's Press, 2018).
Barocas, S., Hardt, M. & Narayanan, A. — Fairness and Machine Learning: Limitations and Opportunities (MIT Press, 2023).
Chouldechova, A. — "Fair prediction with disparate impact: a study of bias in recidivism prediction instruments", Big Data 5(2), 153-163 (2017).
Kleinberg, J., Mullainathan, S. & Raghavan, M. — "Inherent trade-offs in the fair determination of risk scores", Proceedings of ITCS (2017).
Angwin, J., Larson, J., Mattu, S. & Kirchner, L. — "Machine bias", ProPublica (23 May 2016).

Autonomous weapons and military AI:

Sparrow, R. — "Killer robots", Journal of Applied Philosophy 24(1), 62-77 (2007).
Asaro, P. — "On banning autonomous weapon systems: human rights, automation, and the dehumanization of lethal decision-making", International Review of the Red Cross 94(886), 687-709 (2012).

Surveillance and privacy:

Foucault, M. — Discipline and Punish: The Birth of the Prison (Gallimard, 1975; English trans. Vintage, 1977).
D'Ignazio, C. & Klein, L. F. — Data Feminism (MIT Press, 2020).
Zuboff, S. — The Age of Surveillance Capitalism (PublicAffairs, 2019).

Robot rights and moral status:

Gunkel, D. J. — The Machine Question: Critical Perspectives on AI, Robots, and Ethics (MIT Press, 2012).
Bryson, J. J. — "Patiency is not a virtue: the design of intelligent systems and the ethics of artificial agents", AI and Society 33, 337-345 (2018).
Danaher, J. — "The case for robot rights", Philosophy Now 131, 20-23 (2019).

AI governance and policy:

Floridi, L. — The Ethics of Artificial Intelligence: Principles, Challenges, and Opportunities (Cambridge University Press, 2023).
Smuha, N. A. — "From a 'race to AI' to a 'race to AI regulation'", Law, Innovation and Technology 13(1), 107-133 (2021).

Responsibility and liability:

Elish, M. C. — "Moral crumple zones: cautionary tales in human-robot interaction", Engaging Science, Technology, and Society 5, 40-60 (2019).
Matthias, A. — "The responsibility gap: ascribing responsibility for the actions of learning automata", Ethics and Information Technology 6(3), 175-183 (2004).

Labour and economics:

Acemoglu, D. & Restrepo, P. — "Artificial intelligence, automation and work", in The Economics of Artificial Intelligence: An Agenda (NBER, 2018).
Frey, C. B. — The Technology Trap: Capital, Labor, and Power in the Age of Automation (Princeton University Press, 2019).

Prerequisites

20.02.01

Tier anchors

beginner: Any intro to AI ethics; Floridi & Savulescu, The Ethics of Artificial Intelligence
intermediate: Russell, Human Compatible; Bostrom, Superintelligence
master: primary sources: Russell 2019, Bostrom 2014, Floridi 2020, Christiano 2017

References

Russell, S. — Human Compatible (Viking, 2019) · Ch. 1-5
Bostrom, N. — Superintelligence (Oxford University Press, 2014) · Ch. 1-2, 6-9 · source being verified
Floridi, L. — The Ethics of Artificial Intelligence (Cambridge University Press, 2023) · Ch. 1-4, 8-10 · source being verified
O'Neil, C. — Weapons of Math Destruction (Crown, 2016) · Ch. 1-3, 5-6 · source being verified
Eubanks, V. — Automating Inequality (St. Martin's Press, 2018) · Ch. 1-3, 6-7 · source being verified
Christiano, P. — 'Clarifying AI alignment', AI Alignment Forum (2017) · Alignment definitions, debate summaries · source being verified
Sparrow, R. — 'Killer robots', Journal of Applied Philosophy 24(1), 62-77 (2007) · Argument against autonomous weapons · source being verified
Gunkel, D. J. — The Machine Question (MIT Press, 2012) · Ch. 1-4, 7-8 · source being verified

Estimated time

beginner: 20m
intermediate: 40m
master: 60m