This is the first in a series of technical blog posts describing data science design patterns useful in constructing data science models, including decision-support and decision-automation systems. We hope that this blog will become a clearinghouse within the data science community for these data science design patterns, thereby extending the design-pattern tradition in software development and enterprise architecture to data science.
Data Science Design Patterns
Preliminaries. Recall that a probability density function (PDF) assigns probability mass (relative likelihood) to measurable collections of events over a sample space.[i] A PDF distance function or metric is a distance function on some set of PDFs. For example, consider the set of geometric PDFs. The geometric PDF is defined by the probability of initial success
p(first success on kth trial) = (1 – p)kp.
The distance between two geometric PDFs having respective per-trial success rates p1 and p2 is
d(p1, p2) = |p2 – p1|.
(Trivially, p1 and p2 are real numbers, and absolute value is a distance function on the reals; so d(p1, p2) is also a distance function.)
The concept of an equivalence class arises when one partitions a set S into subsets termed equivalence classes that are pairwise disjoint and collectively exhaustive, so that every element of S is a member of exactly one subset. All elements in a given class are equivalent.
Finally, recall that unsupervised classification is the problem of partitioning a set without specifying the target classes for the classifier. Instead the partitioning optimizes some criterion. Clustering algorithms do unsupervised classification, frequently by employing a distance function in the process.
Decision templates. The design pattern applies when a decision maker faces a recurring decision problem. All instances of the problem have the same optimization criterion (objective function). All instances involve a set of uncertainties having the same structure, but each instance may have a different PDF over the uncertainties. Here is the sequence of events:
- The decision maker learns which PDF they face.
- The decision maker chooses a course of action.
- The uncertainties are resolved.
- An outcome is realized.
Decision strategy. The decision strategy is to partition the PDFs into equivalence classes, and to have the decision maker take the same actions in response to any PDF in a given equivalence class. Let’s call the actions for a given equivalence class its response. A decision template is an equivalence class of PDFs together with its response. A templated recurring decision problem is a recurring decision problem having decision templates over the problem’s set of possible PDFs.
How it Works
Optimization problem. The trick in templating a recurring decision problem is to partition the set of possible PDFs, and then to define responses over the partition, so the resulting templates yield optimal expected outcomes with respect to the decision problem’s optimization criterion. Many partitions are possible, and there can be a non-trivial interaction between partitioning strategy and response strategy. So the search problem can be challenging.
Partitioning strategies. Typically one tackles the optimization problem by clustering the possible PDFs using a distance function appropriate in some sense to the decision problem. There are many well-known distance functions available for PDFs. See Sung-Hyuk Cha, “Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions,” International Journal of Mathematical Models and Methods in Applied Sciences, Volume 1 Issue 4, 2007, pp. 300-307 for a survey.[ii] The author identifies 45 PDF distance functions and classifies them into eight families:
- Lp Minkowski
- inner product
- fidelity (squared chord)
- squared L2 (χ2)
- Shannon’s entropy
He also compares the properties of the families and of individual functions. Note that he does not mention the recently popular Wasserstein (earth mover) metric, which we discuss in the example below.
There are also many possible clustering algorithms based on distance functions. In fact, there are many surveys of clustering algorithms! See for example Lior Rokach, “A Survey of Clustering Algorithms,” Data Mining and Knowledge Discovery Handbook, Springer (2010), pp. 269-298.[iii] Limiting the partitioning strategy to distance-based clustering may still define a search over dozens of possible combinations of clustering algorithm and distance metric. Substantial domain knowledge and data science experience may be necessary to discover an effective partitioning strategy.
Response strategies. Given a partitioning of the set of PDFs, one must choose an optimal response for each equivalence class. This search problem is entirely domain specific, depending on e.g. the structure of the set of possible responses.
Joint search strategies. It is difficult to assess the effectiveness of a partitioning strategy until one is confident that the set of responses computed for a given partitioning is optimal for that partitioning. So one should solve the response-optimization problem before finalizing the partitioning strategy. Then one can test the overall effectiveness of different partitioning strategies, computing optimal responses under each strategy.
The upper limit on the number of equivalence classes is, of course, the number of PDFs (one in each equivalence class). Any optimal solution to a recurring decision problem would have an equivalent optimal solution with this maximal number of classes. (One can construct the latter from the former in an obvious way.) It is useful to compute the value of any such optimal solution, to measure divergence from optimality as a function of reduction in the number of equivalence classes. One might also compute this loss function for a given partitioning strategy. A partitioning strategy that works well for a large number of equivalence classes may work poorly for a small, manageable number of classes (or conversely). If the cost of using a small number of decision templates is too high, a given partitioning strategy (or perhaps the entire design pattern) may not be appropriate.
When to Use It
When a recurring decision problem involves a large, perhaps infinite, set of possible PDFs, templating the decision problem may have several important benefits:
- It dramatically simplifies the decision process a human decision maker faces. This is especially useful when the decision maker is under time pressure, or when a simple decision process is necessary to persuade the decision maker to use a decision support system (DSS).
- It may help stakeholders understand the structure of the decision problem. In particular, analysis of the loss function may demonstrate that a small number of decision templates can achieve a high level of optimality. DSS adoption is generally easier when the stakeholders understand the model’s structure, and can quantify its costs and benefits.
- It reduces the modeling problem to one involving search over small sets of possible PDF distance functions and clustering algorithms, for a plausible range of equivalence-class counts. It can be easier to choose effective partitioning and response strategies than to build a solution that directly identifies an optimal response as a function of the PDF.
So, consider using decision templates when you have a recurring decision problem with a large number of PDFs, and when any of the above benefits is attractive and within reach.
Mosaic’s case study “Dynamic Airspace Configuration” illustrates the data science design patterns. The decision maker is an air traffic controller. The PDFs are weather and demand forecasts. The responses are airspace configurations. The distance function is the earth mover metric, which measures the minimal amount of work (transfer of mass) needed to convert one PDF into another. The clustering algorithm is k-medoids. The optimization criterion penalizes over-capacity conditions. The response strategy uses NASA’s SectorFlow algorithm.
In this case the modeling process defined 10 decision templates. The approach yielded improvements of various parameters of up to 86%, with a potential annual cost savings (measured in highly realistic simulations) of between 51 and 268 million dollars. Please read the case study for more details.
[i] For this blog’s purposes we gloss the distinction between a probability mass function (for discrete distributions) and a probability density function (for continuous distributions). [ii] http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446, visited 2/13/14. [iii] http://link.springer.com/chapter/10.1007%2F978-0-387-09823-4_14, visited 2/13/14.