Geoffrey Irving and Amanda Askell
Originally published February 19, 2019
Here is an excerpt:
Learning values by asking humans questions
We start with the premise that human values are too complex to describe with simple rules. By “human values” we mean our full set of detailed preferences, not general goals such as “happiness” or “loyalty”. One source of complexity is that values are entangled with a large number of facts about the world, and we cannot cleanly separate facts from values when building ML models. For example, a rule that refers to “gender” would require an ML model that accurately recognizes this concept, but Buolamwini and Gebru found that several commercial gender classifiers with a 1% error rate on white men failed to recognize black women up to 34% of the time. Even where people have correct intuition about values, we may be unable to specify precise rules behind these intuitions. Finally, our values may vary across cultures, legal systems, or situations: no learned model of human values will be universally applicable.
If humans can’t reliably report the reasoning behind their intuitions about values, perhaps we can make value judgements in specific cases. To realize this approach in an ML context, we ask humans a large number of questions about whether an action or outcome is better or worse, then train on this data. “Better or worse” will include both factual and value-laden components: for an AI system trained to say things, “better” statements might include “rain falls from clouds”, “rain is good for plants”, “many people dislike rain”, etc. If the training works, the resulting ML system will be able to replicate human judgement about particular situations, and thus have the same “fuzzy access to approximate rules” about values as humans. We also train the ML system to come up with proposed actions, so that it knows both how to perform a task and how to judge its performance. This approach works at least in simple cases, such as Atari games and simple robotics tasks and language-specified goals in gridworlds. The questions we ask change as the system learns to perform different types of actions, which is necessary as the model of what is better or worse will only be accurate if we have applicable data to generalize from.
The info is here.