Reality is broken, or, an XCOM2 review
Me and Star Wars
Desiderata for a model of human values
 Soares (2015) defines the value learning problem as By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended? There have been a few attempts to formalize this question. Dewey (2011) started from the notion of building an AI that maximized a given utility function, and then moved on to suggest that a value learner should exhibit uncertainty over utility functions and then take “the action with the highest expected value, calculated by a weighted average over the agent’s pool of possible utility functions.” This is a reasonable starting point, but a very general one: in particular, it gives us no criteria by which we or the AI could judge the correctness of a utility function which it is considering. To improve on Dewey’s definition, we would need to get a clearer idea of just what we mean by human values. In this post, I don’t yet want to offer any preliminary definition: rather, I’d like to ask what properties we’d like a definition of human values to have. Once we have a set of such criteria, we can use them as a guideline to evaluate various offered definitions. By “human values”, I here basically mean the values of any given individual: we are not talking about the values of, say, a whole culture, but rather just one person within that culture. While the problem of aggregating or combining the values of many different individuals is also an important one, we should probably start from the point where we can understand the values of just a single person, and then use that understanding to figure out what to do with conflicting values. In order to make the purpose of this exercise as clear as possible, let’s start with the most important desideratum, of which all the others are arguably special cases of: 1. Useful for AI safety engineering. Our model needs to be useful for the purpose of building AIs that are aligned with human interests, such as by making it possible for an AI to evaluate whether its model of human values is correct, and by allowing human engineers to evaluate whether a proposed AI design would be likely to further human values. In the context of AI safety engineering, the main model for human values that gets mentioned is that of utility functions. The one problem with utility functions that everyone always brings up, is that humans have been shown not to have consistent utility functions. This suggests two new desiderata: 2. Psychologically realistic. The proposed model should be compatible with that which we know about current human values, and not make predictions about human behavior which can be shown to be empirically false. 3. Testable. The proposed model should be specific enough to make clear predictions, which can then be tested. As additional requirements related to the above ones, we may wish to add: 4. Functional. The proposed model should be able to explain what the functional role of “values” is: how do they affect and drive our behavior? The model should be specific enough to allow us to construct computational simulations of agents with a similar value system, and see whether those agents behave as expected within some simulated environment. 5. Integrated with existing theories. The proposed definition model should, to as large an extent possible, fit together with existing knowledge from related fields such as moral psychology, evolutionary psychology, neuroscience, sociology, artificial intelligence, behavioral economics, and so on. However, I would argue that as a model of human value, utility functions also have other clear flaws. They do not clearly satisfy these desiderata: 6. Suited for modeling internal conflicts and higher-order desires. A drug addict may desire a drug, while also desiring that he not desire it. More generally, people may be genuinely conflicted between different values, endorsing contradictory sets of them given different situations or thought experiments, and they may struggle to behave in a way in which they would like to behave. The proposed model should be capable of modeling these conflicts, as well as the way that people resolve them. 7. Suited for modeling changing and evolving values. A utility function is implicitly static: once it has been defined, it does not change. In contrast, human values are constantly evolving. The proposed model should be able to incorporate this, as well as to predict how our values would change given some specific outcomes. Among other benefits, an AI whose model of human values had this property might be able to predict things that our future selves would regret doing (even if our current values approved of those things), and warn us about this possibility in advance. 8. Suited for generalizing from our existing values to new ones. Technological and social change often cause new dilemmas, for which our existing values may not provide a clear answer. As a historical example (Lessig 2004), American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question – could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? In answer to this question, the concept of landownership was redefined to only extend a limited, and not an indefinite, amount upwards. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. Our model of value should be capable of figuring out such compromises, rather than treating values such as landownership as black boxes, with no understanding of why people value them. As an example of using the current criteria, let’s try applying them to the only paper that I know of that has tried to propose a model of human values in an AI safety engineering context: Sezener (2015). This paper takes an inverse reinforcement learning approach, modeling a human as an agent that interacts with its environment in order to maximize a sum of rewards. It then proposes a value learning design where the value learner is an agent that uses Solomonoff’s universal prior in order to find the program generating the rewards, based on the human’s actions. Basically, a human’s values are equivalent to a human’s reward function. Let’s see to what extent this proposal meets our criteria. Useful for AI safety engineering. To the extent that the proposed model is correct, it would clearly be useful. Sezener provides an equation that could be used to obtain the probability of any given program being the true reward generating program. This could then be plugged directly into a value learning agent similar to the ones outlined in Dewey (2011), to estimate the probability of its models of human values being true. That said, the equation is incomputable, but it could be possible to construct computable approximations. Psychologically realistic. Sezener assumes the existence of a single, distinct reward process, and suggests that this is a “reasonable assumption from a neuroscientific point of view because all reward signals are generated by brain areas such as the striatum”. On the face of it, this seems like an oversimplification, particularly given evidence suggesting the existence of multiple valuation systems in the brain. On the other hand, since the reward process is allowed to be arbitrarily complex, it could be taken to represent just the final output of the combination of those valuation systems. Testable. The proposed model currently seems to be too general to be accurately tested. It would need to be made more specific. Functional. This is arguable, but I would claim that the model does not provide much of a functional account of values: they are hidden within the reward function, which is basically treated as a black box that takes in observations and outputs rewards. While a value learner implementing this model could develop various models of that reward function, and those models could include internal machinery that explained why the reward function output various rewards at different times, the model itself does not make any assumptions of this. Integrated with existing theories. Various existing theories could in principle used to flesh out the internals of the reward function, but currently no such integration is present. Suited for modeling internal conflicts and higher-order desires. No specific mention of this is made in the paper. The assumption of a single reward function that assigns a single reward for every possible observation seems to implicitly exclude the notion of internal conflicts, with the agent always just maximizing a total sum of rewards and being internally united in that goal. Suited for modeling changing and evolving values. As written, the model seems to consider the reward function as essentially unchanging: “our problem reduces to finding the most probable $p_R$ given the entire action-observation history $a_1o_1a_2o_2 . . . a_no_n$.” Suited for generalizing from our existing values to new ones. There does not seem to be any obvious possibility for this in the model. I should note that despite its shortcomings, Sezener’s model seems like a nice step forward: like I said, it’s the only proposal that I know of so far that has even tried to answer this question. I hope that my criteria would be useful in spurring the development of the model further. As it happens, I have a preliminary suggestion for a model of human values which I believe has the potential to fulfill all of the criteria that I have outlined. However, I am far from certain that I have managed to find all the necessary criteria. Thus, I would welcome feedback, particularly including proposed changes or additions to these criteria. Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Learning from painful experiences
Maverick Nannies and Danger Theses
Changing language to change thoughts
Rational approaches to emotions
Two conversationalist tips for introverts
 Two of the biggest mistakes that I used to make that made me a poor conversationalist: 1. Thinking too much about what I was going to say next. If another person is speaking, don’t think about anything else, where “anything else” includes your next words. Instead, just focus on what they’re saying, and the next thing to say will come to mind naturally. If it doesn’t, a brief silence before you say something is not the end of the world. Let your mind wander until it comes up with something. 2. Asking myself questions like “is X interesting / relevant / intelligent-sounding enough to say here”, and trying to figure out whether the thing on my mind was relevant to the purpose of the conversation. Some conversations have an explicit purpose, but most don’t. They’re just the participants saying whatever random thing comes to their mind as a result of what the other person last said. Obviously you’ll want to put a bit of effort to screening off any potentially offensive or inappropriate comments, but for the most part you’re better off just saying whatever random thing comes to your mind. Relatedly, I suspect that these kinds of tendencies are what make introverts experience social fatigue. Social fatigue seems [in some people’s anecdotal experience; don’t have any studies to back me up here] to be associated with mental inhibition: the more you have to spend mental resources on holding yourself back, the more exhausted you will be afterwards. My experience suggests that if you can reduce the amount of filters on what you say, then this reduces mental inhibition, and correspondingly reduces the extent to which socializing causes you fatigue. Peter McCluskey reports of a similar experience; other people mention varying degrees of agreement or disagreement. Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Change blindness
DeepDream: Today psychedelic images, tomorrow unemployed artists
Learning to recognize judgmental labels
 In the spirit of Non-Violent Communication, I’ve today tried to pay more attention to my thoughts and notice any judgments or labels that I apply to other people that are actually disguised indications of my own needs. The first one that I noticed was this: within a few weeks I’ll be a visiting instructor at a science camp, teaching things to a bunch of teens and preteens. I was thinking of how I’d start my lessons, pondered how to grab their attention, and then noticed myself having the thought, “these are smart kids, I’m sure they’ll give me a chance rather than be totally unruly from the start”. Two judgements right there: “smart” and “unruly”. Stopped for a moment’s reflection. I’m going to the camp because I want the kids to learn things that I feel will be useful for them, yes, but at the same time I also have a need to feel respected and appreciated. And I feel uncertain of my ability to get that respect from someone who isn’t already inclined to view me in a favorable light. So in order to protect myself, I’m labelling kids as “smart” if they’re willing to give me a chance, implying that if I can’t get through to some particular one, then it was really their fault rather than mine. Even though they might be uninterested in what I have to say for reasons that have nothing to do with smarts, like me just making a boring presentation. Ouch. Okay, let me reword that original thought in non-judgemental terms: “these are kids who are voluntarily coming to a science camp and who I’ve been told are interested in learning, I’m sure they’ll be willing to listen at least to a bit of what I have to say”. There. Better. Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Adult children make mistakes, too
 There’s a lot of blame and guilt in many people’s lives. We often think of people in terms of good or bad, and feel unworthy or miserable if we fail at things we think we should be able to do. When we don’t do quite as well as we could, because we’re tired or unwell or distracted, we blame and belittle ourselves. Let’s take a different approach. Think of a young child, maybe three years old. He has come a long way from a newborn, but he’s still not that far along. If he tries his hand at making a drawing, and it’s not quite up to adult standards, we don’t think of him as being any worse for that. Or if he doesn’t quite want to share his toys or gets frustrated with his sibling, we understand that it’s because he’s still young, and hasn’t yet learned all the people skills. We don’t judge him for that, but just gently teach him what we’d like him to do instead. It’s not that he’s good or bad, it’s just that he lacks the skills and practice. At the same time, we see the vast potential in him, all the way that he has already come and the way he’s learning new things every day. Now, look at yourself from the perspective of some immensely wise, benevolent being. If you’re religious, that being could be God. If you have a transhumanist bent, maybe a superintelligent AI with understanding beyond human comprehension. Or you could imagine a vastly older version of you, one that had lived for thousands of years and seen and done things you couldn’t even imagine. From the perspective of such a being, aren’t you – and all those around you – the equivalent of that three-year-old? Someone who’s inevitably going to make mistakes and be imperfect, because the world is such a complicated place and nobody could have mastered it all? But who’s nevertheless come a long way from what they once were, and are only going to continue growing? Nate Soares has said that he feels more empathy towards people when he thinks of them as “monkeys who struggle to convince themselves that they’re comfortable in a strange civilization, so different from the ancestral savanna where their minds were forged”. Similarly, we could think of ourselves as young children outside their homes, in a world that’s much too complicated and vast for us to ever understand more than a small fraction of it, still making a valiant effort to do our best despite often being tired or afraid. Let’s take this attitude, not just towards others, but ourselves as well. We’re doing our best to learn to do the right things in a big, difficult world. If we don’t always succeed, there’s no blame: just a knowledge that we can learn to do better, if we make the effort. Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Harry Potter and the Methods of Latent Dirichlet Allocation
Teaching economics & ethics with Kitty Powers' Matchmaker
Things that I'm currently the most interested in (Jan 21st of 2015 edition)
