Or connect using:
 A view to the gallery of my mind > recent entries > calendar > friends > Website > profile > previous 20 entries
Tuesday, August 16th, 2016
2:21 pm - An appreciation of the Less Wrong Sequences
Saturday, June 11th, 2016
6:11 pm - Error in Armstrong and Sotala 2012
 Katja Grace has analyzed my and Stuart Armstrong’s 2012 paper “How We’re Predicting AI – or Failing To”. She discovered that one of the conclusions, “predictions made by AI experts were indistinguishable from those of non-experts”, is flawed due to “a spreadsheet construction and interpretation error”. In other words, I coded the data in one way, there was a communication error and a misunderstanding about what the data meant, and as a result of that, a flawed conclusion slipped into the paper. I’m naturally embarrassed that this happened. But the reason why Katja spotted this error was that we’d made our data freely available, allowing her to spot the discrepancy. This is why data sharing is something that science needs more of. Mistakes happen to everyone, and transparency is the only way to have a chance of spotting those mistakes. I regret the fact that we screwed up this bit, but proud over the fact that we did share our data and allowed someone to catch it. EDITED TO ADD: Some people have taken this mistake to suggest that the overall conclusion, that AI experts are not good predictors of AI timelines, to be flawed. That would overstate the significance of this mistake. While one of the lines of evidence supporting this overall conclusion was flawed, several others are unaffected by this error. Namely, the fact that expert predictions disagree widely with each other, that many past predictions have turned out to be false, and that the psychological literature on what’s required for the development of expertise suggests that it should be very hard to develop expertise in this domain. (see the original paper for details) (I’ve added a note of this mistake to my list of papers.) Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Saturday, May 14th, 2016
11:39 am - Smile, You Are On Tumblr.Com
 I made a new tumblr blog. It has photos of smiling people! With more to come! Why? Previously I happened to need pictures of smiles for a personal project. After going through an archive of photos for a while, I realized that looking at all the happy people made me feel really happy and good. So I thought that I might make a habit out of looking at photos of smiling people, and sharing them. Follow for a regular extra dose of happiness! Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Wednesday, April 27th, 2016
9:52 am - Decisive Strategic Advantage without a Hard Takeoff (part 1)
Friday, April 22nd, 2016
6:07 am - Simplifying the environment: a new convergent instrumental goal
Friday, April 15th, 2016
11:48 am - AI risk model: single or multiple AIs?
Tuesday, April 5th, 2016
10:59 am - Disjunctive AI risk scenarios: AIs gaining the power to act autonomously
Monday, April 4th, 2016
12:59 pm - Disjunctive AI risk scenarios: AIs gaining a decisive advantage
Monday, February 8th, 2016
11:03 am - Reality is broken, or, an XCOM2 review
Wednesday, December 16th, 2015
10:10 am - Me and Star Wars
Saturday, November 28th, 2015
6:26 pm - Desiderata for a model of human values
 Soares (2015) defines the value learning problem as By what methods could an intelligent machine be constructed to reliably learn what to value and to act as its operators intended? There have been a few attempts to formalize this question. Dewey (2011) started from the notion of building an AI that maximized a given utility function, and then moved on to suggest that a value learner should exhibit uncertainty over utility functions and then take “the action with the highest expected value, calculated by a weighted average over the agent’s pool of possible utility functions.” This is a reasonable starting point, but a very general one: in particular, it gives us no criteria by which we or the AI could judge the correctness of a utility function which it is considering. To improve on Dewey’s definition, we would need to get a clearer idea of just what we mean by human values. In this post, I don’t yet want to offer any preliminary definition: rather, I’d like to ask what properties we’d like a definition of human values to have. Once we have a set of such criteria, we can use them as a guideline to evaluate various offered definitions. By “human values”, I here basically mean the values of any given individual: we are not talking about the values of, say, a whole culture, but rather just one person within that culture. While the problem of aggregating or combining the values of many different individuals is also an important one, we should probably start from the point where we can understand the values of just a single person, and then use that understanding to figure out what to do with conflicting values. In order to make the purpose of this exercise as clear as possible, let’s start with the most important desideratum, of which all the others are arguably special cases of: 1. Useful for AI safety engineering. Our model needs to be useful for the purpose of building AIs that are aligned with human interests, such as by making it possible for an AI to evaluate whether its model of human values is correct, and by allowing human engineers to evaluate whether a proposed AI design would be likely to further human values. In the context of AI safety engineering, the main model for human values that gets mentioned is that of utility functions. The one problem with utility functions that everyone always brings up, is that humans have been shown not to have consistent utility functions. This suggests two new desiderata: 2. Psychologically realistic. The proposed model should be compatible with that which we know about current human values, and not make predictions about human behavior which can be shown to be empirically false. 3. Testable. The proposed model should be specific enough to make clear predictions, which can then be tested. As additional requirements related to the above ones, we may wish to add: 4. Functional. The proposed model should be able to explain what the functional role of “values” is: how do they affect and drive our behavior? The model should be specific enough to allow us to construct computational simulations of agents with a similar value system, and see whether those agents behave as expected within some simulated environment. 5. Integrated with existing theories. The proposed definition model should, to as large an extent possible, fit together with existing knowledge from related fields such as moral psychology, evolutionary psychology, neuroscience, sociology, artificial intelligence, behavioral economics, and so on. However, I would argue that as a model of human value, utility functions also have other clear flaws. They do not clearly satisfy these desiderata: 6. Suited for modeling internal conflicts and higher-order desires. A drug addict may desire a drug, while also desiring that he not desire it. More generally, people may be genuinely conflicted between different values, endorsing contradictory sets of them given different situations or thought experiments, and they may struggle to behave in a way in which they would like to behave. The proposed model should be capable of modeling these conflicts, as well as the way that people resolve them. 7. Suited for modeling changing and evolving values. A utility function is implicitly static: once it has been defined, it does not change. In contrast, human values are constantly evolving. The proposed model should be able to incorporate this, as well as to predict how our values would change given some specific outcomes. Among other benefits, an AI whose model of human values had this property might be able to predict things that our future selves would regret doing (even if our current values approved of those things), and warn us about this possibility in advance. 8. Suited for generalizing from our existing values to new ones. Technological and social change often cause new dilemmas, for which our existing values may not provide a clear answer. As a historical example (Lessig 2004), American law traditionally held that a landowner did not only control his land but also everything above it, to “an indefinite extent, upwards”. Upon the invention of this airplane, this raised the question – could landowners forbid airplanes from flying over their land, or was the ownership of the land limited to some specific height, above which the landowners had no control? In answer to this question, the concept of landownership was redefined to only extend a limited, and not an indefinite, amount upwards. Intuitively, one might think that this decision was made because the redefined concept did not substantially weaken the position of landowners, while allowing for entirely new possibilities for travel. Our model of value should be capable of figuring out such compromises, rather than treating values such as landownership as black boxes, with no understanding of why people value them. As an example of using the current criteria, let’s try applying them to the only paper that I know of that has tried to propose a model of human values in an AI safety engineering context: Sezener (2015). This paper takes an inverse reinforcement learning approach, modeling a human as an agent that interacts with its environment in order to maximize a sum of rewards. It then proposes a value learning design where the value learner is an agent that uses Solomonoff’s universal prior in order to find the program generating the rewards, based on the human’s actions. Basically, a human’s values are equivalent to a human’s reward function. Let’s see to what extent this proposal meets our criteria. Useful for AI safety engineering. To the extent that the proposed model is correct, it would clearly be useful. Sezener provides an equation that could be used to obtain the probability of any given program being the true reward generating program. This could then be plugged directly into a value learning agent similar to the ones outlined in Dewey (2011), to estimate the probability of its models of human values being true. That said, the equation is incomputable, but it could be possible to construct computable approximations. Psychologically realistic. Sezener assumes the existence of a single, distinct reward process, and suggests that this is a “reasonable assumption from a neuroscientific point of view because all reward signals are generated by brain areas such as the striatum”. On the face of it, this seems like an oversimplification, particularly given evidence suggesting the existence of multiple valuation systems in the brain. On the other hand, since the reward process is allowed to be arbitrarily complex, it could be taken to represent just the final output of the combination of those valuation systems. Testable. The proposed model currently seems to be too general to be accurately tested. It would need to be made more specific. Functional. This is arguable, but I would claim that the model does not provide much of a functional account of values: they are hidden within the reward function, which is basically treated as a black box that takes in observations and outputs rewards. While a value learner implementing this model could develop various models of that reward function, and those models could include internal machinery that explained why the reward function output various rewards at different times, the model itself does not make any assumptions of this. Integrated with existing theories. Various existing theories could in principle used to flesh out the internals of the reward function, but currently no such integration is present. Suited for modeling internal conflicts and higher-order desires. No specific mention of this is made in the paper. The assumption of a single reward function that assigns a single reward for every possible observation seems to implicitly exclude the notion of internal conflicts, with the agent always just maximizing a total sum of rewards and being internally united in that goal. Suited for modeling changing and evolving values. As written, the model seems to consider the reward function as essentially unchanging: “our problem reduces to finding the most probable $p_R$ given the entire action-observation history $a_1o_1a_2o_2 . . . a_no_n$.” Suited for generalizing from our existing values to new ones. There does not seem to be any obvious possibility for this in the model. I should note that despite its shortcomings, Sezener’s model seems like a nice step forward: like I said, it’s the only proposal that I know of so far that has even tried to answer this question. I hope that my criteria would be useful in spurring the development of the model further. As it happens, I have a preliminary suggestion for a model of human values which I believe has the potential to fulfill all of the criteria that I have outlined. However, I am far from certain that I have managed to find all the necessary criteria. Thus, I would welcome feedback, particularly including proposed changes or additions to these criteria. Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Thursday, November 12th, 2015
10:42 am - Learning from painful experiences
Saturday, October 31st, 2015
4:52 pm - Maverick Nannies and Danger Theses
Sunday, October 18th, 2015
1:01 pm - Changing language to change thoughts
Friday, October 9th, 2015
5:36 pm - Rational approaches to emotions
Friday, October 2nd, 2015
9:03 am - Two conversationalist tips for introverts
 Two of the biggest mistakes that I used to make that made me a poor conversationalist: 1. Thinking too much about what I was going to say next. If another person is speaking, don’t think about anything else, where “anything else” includes your next words. Instead, just focus on what they’re saying, and the next thing to say will come to mind naturally. If it doesn’t, a brief silence before you say something is not the end of the world. Let your mind wander until it comes up with something. 2. Asking myself questions like “is X interesting / relevant / intelligent-sounding enough to say here”, and trying to figure out whether the thing on my mind was relevant to the purpose of the conversation. Some conversations have an explicit purpose, but most don’t. They’re just the participants saying whatever random thing comes to their mind as a result of what the other person last said. Obviously you’ll want to put a bit of effort to screening off any potentially offensive or inappropriate comments, but for the most part you’re better off just saying whatever random thing comes to your mind. Relatedly, I suspect that these kinds of tendencies are what make introverts experience social fatigue. Social fatigue seems [in some people’s anecdotal experience; don’t have any studies to back me up here] to be associated with mental inhibition: the more you have to spend mental resources on holding yourself back, the more exhausted you will be afterwards. My experience suggests that if you can reduce the amount of filters on what you say, then this reduces mental inhibition, and correspondingly reduces the extent to which socializing causes you fatigue. Peter McCluskey reports of a similar experience; other people mention varying degrees of agreement or disagreement. Originally published at Kaj Sotala. You can comment here or there. (Leave an echo)
Tuesday, August 18th, 2015
2:40 pm - Change blindness