Working with Random Variables
- 1 Introduction
- 2 Defining and Modifying Distributions
- 3 Working with Random Variables
- 4 Statistic Variables: Functions of Random Variables
- 5 Surrogate Modeling with Random Variables
Random variables are useful for modeling uncertainty in a system. Unlike regular variable, which are defined by (deterministic) values, random variables are defined by distributions. Rave supports the use of random variables through various sampling (Monte Carlo) based approaches. The main ways to use random variables in Rave are:
- Declare a variable to be random, in which case it is treated as being random for all purposes in Rave.
- Sample data according to random distributions. This lets you generate a data set that involves randomness, but Rave otherwise treats this data as if it were a deterministic sampling.
In either case, Rave samples the random variables according to distributions that you define in Rave. The process of defining distributions is described below.
Note: only independent variables can be treated as random. Function variables become "statistic variables" when one or more of their input variables are random. See below for details.
Defining and Modifying Distributions
Each independent variable in your data set has a list of associated distributions. (This list exists even if you are not treating these variables as random variables.) You can add, remove, and modify the distributions in this list by either:
- Clicking the "Edit Variable Distributions" button on the Sample tab.
- In the Manage data GUI, select the "Variables" category, then click the "Edit Variable Distributions" button.
(Clicking either button has the exact same effect. It doesn't matter which you click.)
Although you can define as many distributions as you want, at any given time only one distribution will be "active" for each variable. We let you define multiple distributions so that you can easily conduct trade studies on the parameterization of your random variables.
- Right click its column header in the main data table and select "Treat as a Random Variable".
The selected independent variable will then become a random variable, distributed according to its active distribution. You can change the active distribution by clicking one of the "Edit Variable Distributions" buttons described above.
Important: Because sampling from random variables is potentially time consuming, Rave does not automatically sample from the variables and evaluate any functional (statistic) variables when you change which variables are random or which distributions are active. You will notice that as soon as you change a variable to be random, the columns for functions of that variable will turn red in the main table. This indicates that the data is invalid (because it is still reflecting the values calculated for the non-random values of the independent variables). After you set up all your random variables to their desired distributions, click the "Reevaluate Functions" button on the Manage data GUI to recalculate the values of your function variables using the new (random) values of the input variables. You will see that the corresponding columns of your data table turn black again, indicating that the values are now valid.
Working with Random Variables
Method 1: Declaring a Variable to be Random
To indicate that a particular variable is random, its name in the main table column header turns purple, and its values in the table are replaced with the word "Random".
The reasoning behind this is that random variables have no single value for each row in your data set, rather they have a distribution of values. Any variables that are functions of one or more random variables are also defined by distributions. In order to represent these function variables as still having a single value, Rave automatically converts any variables that are functions of random variables into statistic variables. You will notice that the column headers of any such functions are changed to reflect this. (See below for more information on statistic variables.)
For example, suppose you have three variables, x,y, and z, such that z=f(x,y). If you change y into a random variable, you will see that the column header for z changes to something like "mean(z)" to indicate that the values displayed in the table are no longer the values of z itself, rather they are the mean value of z calculated over the distribution of y.
To calculate the statistics, Rave uses a sampling-based approach. This works as follows: suppose again that we have z=f(x,y), where y is random and x is deterministic. If Rave needs to calculate z for x=5 and y=N(0,1), (i.e. y is normally distributed with mean 0 and standard deviation 1), Rave samples many random values of y such that these sampled values are approximately distributed as N(0,1). Supposing 5000 such samples were used, Rave then evaluates z 5000 times, each time using x=5 and y=(each of the 5000 randomly sampled values in turn). This yields 5000 values of z, which Rave then aggregates back into a single value using the specified statistic, for example mean(z) would return the average of these 5000 values.
There are four parameters that you can change to customize Rave's approach to calculating functions of random variables:
- The distribution of each random variable
- The number of samples used to calculate statistic (5000 in the example above)
- The method by which the random values are sampled from the distributions
- The statistic used to aggregate the values of z back into a single value.
Very important: In the example above, it took 5000 evaluations of the function f to calculate the mean of z for the single case x=5. In reality, most activities in Rave require many samples. For example, creating a Contour plot that uses a 10x10 grid of points would require 100 deterministic cases, so for 5000 random samples it would require 500,000 function evaluations! Working with random variables is very computationally intensive, and consequently it is only feasible when your functions are extremely fast (think surrogate models or other algebraic functions).
Method 2: Sampling Data from Distributions
Statistic Variables: Functions of Random Variables
When you are working with random variables in Rave, it is presumed that all functions you load are deterministic. In other words, for each unique input vector, x, the function always returns the same value of y. E.g. your functions never use the rand() function or otherwise introduce randomness inside the function; randomness only comes from randomly varying the inputs to the function. (Note: This is not to say that you can't use the rand() function. See the page on non-deterministic user-supplied functions for details.
The way Rave handles functions of Random variables is a little bit complicated. In "real life," a function of a random variable is also a random variable. However, there is no easy way for Rave to store and represent the full distribution of a random variable. Instead, when you have functions of random variables, Rave represents their values to you as scalar-valued statistics. These statistics are calculated by sampling from the distribution many times (number is determined by the "Random Variable Sampling Size" preference) and then aggregating the resulting sampled values into a single number, which is the statistic.
You can choose which statistic is used to represent each functional variable. The statistics you can choose from are:
- Standard Deviation
- Quantile(p), i.e. the value b of y such that P(y < b) = p (A generalization of the median to values other than p=0.5)
- P(y < b)
- P(y is feasible), i.e. the fraction of the sampled set of y values that satisfy whatever constraints you have defined (you can select a particular subset of the constraints to be used to calculate this probability).
Rave treats functions of random variables as statistics whenever it needs to calculate a single value of the function variable for each row of your data table. This is the situation whenever Rave is doing something that is not specifically designed to handle random variables. The most obvious example is that when you are viewing the table, you will see the statistic values listed. Similarly, when you are viewing most visualizations you will see the statistic values used.
Note, however, that Rave always calculates the statistics by some sort of Monte Carlo sampling (you can change how this sampling works by modifying the preferences). Consequently, Rave has access to the full sampled population and is capable of displaying and storing it. However, in general Rave does not store it; it is used to calculate the statistic and then is discarded. But the important point is that you can code special workspace objects, explore methods, or other plugins that use the entire distribution.
Future versions of Rave will have more such objects that display the entire distribution to you.
Statistic Variable Names
When a functional variable has random variables as inputs, Rave will usually display its name to you in a form that includes the statistic. For example, if you loaded a function that calculates a variable called "Weight", and some of its inputs are random variables, you will see Weight listed in rave as "mean(Weight)" (or with another statistic, such as "variance(Weight)".
For all important purposes, however, Rave still considers the variable to be named "Weight". If you load a new function that includes Weight as one of its inputs, you would call that input "Weight", not "mean(Weight)". See also the section below about functions of statistic variables.
Note that you cannot directly use the statistics themselves as inputs to functions. For example, you cannot specify "mean(Weight)" as an input, even if you wanted to. Instead, you would specify Weight as the input, and then calculate the mean within the body of your function.
Functions of Statistic Variables (Functions of functions of random variables)
When you have a function of functions of random variables, for example: f(g(x)), where x includes some random variables, Rave will generally represent both f and g to you as statistics. However, (this is important) Rave does not calculate f using the statistic value of g(x). f is also calculated as a distribution using the actual sampled values of g(x). In other words, if we let s represent some statistic function that takes in a distribution and returns a scalar value, the statistic value of f(g(x)) is calculated as s(f(g(x))), NOT as s(f(s(g(x)))). In other other words, if you change the statistic used to represent g(x), for example you change from "mean" to "median", the values of f(g(x)) will not change because they are based on the actual sampled values of g(x), not the values of the statistic that Rave uses to represent g(x) to you.
Multiple statistics of the same function
Each variable in your data set has a single associated statistic. In order to have multiple statistics of the same variable, you simply duplicate the variable so that you now have two (or more) columns in your data set that represent that variable. You can then assign a different statistic to each one. Rave will use the same sampled population to calculate each statistic. For example, if you have a function variable "weight," which is a function of random variables, and you want to calculate two statistics, mean and standard deviation, you first duplicate "weight" so that you now have two columns named "weight," then set the statistic for the first to "mean" and the statistic for the second to "weight". When Rave needs to evaluate these statistics, it will sample the function that calculates weight n times, where n is your Random Sampling Sample Size preference value, and it will then calculate both statistics from this sample. It will not sample two separate n populations and use the first to calculate weight and the second to calculate standard deviation.
Surrogate Modeling with Random Variables
You cannot make a surrogate that is a function of a random variable, but you can make surrogate models of statistic variables or that are functions of statistic variables.
If you wanted to make a surrogate model of a function whose inputs are random variables, you would have to use the Random sampling button on the Sampling tab to make a new data set that is sampled according to your desired distributions, but that treats all variables as being non-random. You can then use this data set to create surrogate models and import them back into your original data set (if desired).