Aktualisiert: März 31
In theory, more data means more information means better decisions. In practice, it is often not straight forward to see how one could benefit from data for a specific use case. In this blog post, I want to show how machine learning can support MarketSense that helps to distribute your cleantech product by converting data into potential into leads into conversions.
Machine learning is not the only tool to make use of data, but it is one that is capable of creating some higher-level intuition. Therefore, it is also a scalable approach as it helps to tell a data-driven story that can be told and developed on a more abstract, more humanly accessible level.
In this sense, this post is a rather technical one that presents some, hopefully relevant, applications for cutting edge machine learning algorithms. But at the same time, it should present possibilities coming with access to data to people who prefer to think in use cases.
With our platform, Swiss Energy Planning (SEP), you can already conveniently filter potential buildings matching your custom criteria, for example, all buildings with a large PV potential and a high feed-in electricity tariff. Our feature catalog, from which you can choose your criteria, is growing every week, and so are your possibilities in choosing more and more, possibly softer, criteria (such as, renovation pressure of the building or renovation rate of the district, or socioeconomic features) that might be just as important for answering your problem.
However, with more and more possibilities, it is also getting more and more difficult to choose the right criteria. For example, are a large PV potential and high feed-in electricity tariff the one and only key indicators for someone being interested in a PV system? (A recent case study in the USA suggests that population density, heating system distribution, and median house unit value are more indicative features for the amount of PV deployment at a census tract level than the daily solar radiation, see , Figure 6B.) And even if those are indicative factors, what exact threshold should be chosen to separate, e.g., high from low feed-in tariffs?
What comes more, choosing additional features and thresholds is not only difficult, but it makes your model less understandable and less maintainable (see Rule #3 of Google's Rules of Machine Learning).
In some use cases, the answer to these difficulties could be switching to a machine learning model that automatically finds important features and how they interact with your problem statement.
Figure 2: Partial dependence plot of the nine most important features distinguishing a sales list from an average building. Note that the partial dependence plots marginalize over all other features except one (or possibly two or more) and hence does not show any interactions between the features.
To begin with, we have to state your use case as a concrete machine learning problem. Examples:
Potential lead generator: "Generate new potential leads, e.g., for a telephone campaign, from my (continuously growing) history of sales and no-sales."
Lead prioritization: "I receive so many requests for my cleantech product. I want to prioritize promising and profitable requests based on my history of successful and unsuccessful requests."
Potential ordering: "I have already filtered my potential in SEP with my hard criteria. Now I would like to order my list based on my history of useful and unuseful results."
Your use case...?
In all these use cases, the machine learning model is trained by a continuously growing data set. Hence the model adapts over time along with the data: The more data available, the better the model represents your specific problem. Consequently, at the very beginning, when there is no data available, the model is not representative at all.
One option is to kick-start the training of the model by importing a sales list as positive examples and generating artificial negative examples by randomly picking buildings from the database. At this stage, the model distinguishes your sales from an average building. Such a model generated also the partial dependence plots shown in Figure 2 above.
Another option, for example if there is not much sales data available, would be to first start with a heuristic, e.g., some sort of cost-utility analysis, and then gradually switch to a machine learning model based on the collected feedback.
Measuring Model Performance
It is crucial to being able to track the performance of our model as it evolves over time along with your data. Is the model getting better at distinguishing high from low potential or is the distinction becoming even fuzzier over time? One difficulty in measuring the performance of an evolving model is that the data distribution is changing with time. As an extremal case, take our data set at the very beginning consisting only of a sales list as positive examples and an arbitrary amount of random buildings as negative examples only. If we use, e.g., accuracy as a performance metric, the model can be made arbitrarily accurate by choosing more and more random negative examples. If you have 50 sales and choose 5000 random negative examples you easily achieve an accuracy of 0.99 (5000 divided by 5050) by simply choosing "no sale" all the time.
A remarkably suitable metric for our scenario is the area under the receiver operating characteristic curve (AUC), which is "insensitive to changes in class distribution" . In our case, this means that we should always get around the same AUC value, no matter if we take 5000 or 10000 negative examples. Here, fluctuations in the AUC do rather reflect how well the model can distinguish between a sale and an average building. More formally speaking: "the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance" , which is basically what we want to measure if we want to rank a list of buildings according to their potential for our problem.
Figure 3: Receiver operator characteristic of the same model as used in Figure 2 that distinguishes sales of average buildings. The light blue curves show the results obtained by 5-fold cross-validation, and the thicker blue line is the mean of the five folds. The grey dashed represents the line of a model predicting by chance. The mean and standard deviation of the AUC over the five folds is 0.78 and 0.05 respectively. For more details on receiver operator characteristics and AUC, see .
Exploration vs. Exploitation
In some use cases, having a good ranking is not enough. Consider again the potential lead generator which should guide your telephone campaign. If we always take the best building according to our current model, chances are high that the generated leads will always propose the same type of building. However, there are possibly different customer types that are interested in your product. Possibly there are customers out there that are even more interested than your regular customer, which you, however, do not know nothing about, as you never meet them with your ever-green strategy. Or some customer segments will become more and more depleted while others would possibly be more upcoming.
In fact, what we want to do is to optimize some longterm reward. To that end, sometimes suboptimal actions are required in the short term in order to detect new profitable regions.
This is known as the exploration-exploitation dilemma in reinforcement learning.
A simple strategy, known as epsilon-greedy, is to pick the optimal decision with probability 1−ε and a random decision with probability ε. However, other methods usually trade off exploration vs. exploitation more efficiently, in particular, Thompson sampling .
Yet another strategy, related to stratified sampling, is to first cluster your potential into different groups with high intra-group and low inter-group similarity and then take optimal choices from each group proportionally to the group's size. Such a strategy ensures, that you explore and exploit different tracks.
Figure 4: Similarity matrix of 2000 buildings clustered into 10 groups using supervised clustering. Buildings in the same group tend to perform similarly with respect to the problem statement.