⚖️ Building Rule-Based Data Products

Jun 11, 2021 • 5 min • Data Science

Machine learning is seemingly baked into every established product these days. If you're building something new and look around, it's really tempting to say "We need machine learning too if we're going to be as successful as the other guys."

But I would posit that you probably don't need a legitimate machine learning solution at first. In most cases, you can get 80-90% of the way there with a rule-based approach:

In the infamous Rules of Machine Learning, one of the first sections states “don’t be afraid to launch a product without machine learning” – and suggests launching a product that uses rules. One of the examples it gives is ranking apps in an app store using a heuristic that captures the app’s popularity: install rate. This is really solid advice.

In this post, I'll cover a short framework for thinking through problems that might merit a rule-based solution:

Validate the problem
Validate the data
Decide on a "one-by-one" or "final decision" approach
Use intuition or data analysis to create rules
Write a v1 and iterate on the results
Coordinate with product to ship

🤔 Validate the problem

This might go without saying, but it's important to explicitly ask yourself and your product team: Is this problem worth solving for? Is this a high leverage area of the user journey? What do we hope to achieve by making the experience more personalized?

Even if you can ship this project in a reasonable amount of time, it's something that you will be maintaining for a while. Take time to understand the tradeoff between resources vs. lift before you commit time to building anything. Also take some time to fully understand the product requirements from relevant stakeholders.

🔎 Validate the data

You also need to validate that you actually have the data to solve the problem effectively. You would be surprised by how often things fall apart at this stage. Are you collecting enough data to inform your rules at this stage? For instance, if you want to recommend apps for the user to install, you should probably have a good amount of data on when users install apps and what kind of apps those were.

Having the right tracking in place is essential, but there's also another angle: Are there meaningful insights in the data that you can use to inform your rules? Do certain segments of users behave meaningfully different to others? The fancy way to say this would be to ask, "Is there enough predictive power in this data to produce an output we're confident in?"

You should be realistic with your answer here and don't be afraid to scrap the project at this point. If you don't have the data to solve your problem, you'll fall short regardless of the effort involved.

⚖️ Decide on "one-by-one" or "final decision"

Now that you know the problem is solvable and worth spending time on, we can start getting into the actual "how" of solving it. Given a set of rules, there are two different ways that systems can apply them:

Apply rules one-by-one in order and stop when one matches
Apply all the rules and then make a final decision

Knowing this in advance will make your life a whole lot easier when it comes to actually writing the rules and query for your engine. In the "one-by-one" scenario, rules are generally expressed as a family of case statements and tend to do really well in deterministic scenarios. In the "final decision" scenario, you are typically considering a number of factors and then returning the best option.

📊 Use intuition or data analysis to create rules

When it comes to actually creating your rules, my advice is largely dependent on the type of problem you are dealing with. For some scenarios, you intuitively know what outputs are best. A good example of this is creating rules for a customer support chatbot. If someone says, "forgot password" then a decent recommendation is going to be material on how to request a new password.

However, not all rule-based engines are that intuitive. Some rules need to be informed by data analysis. This often means identifying variables and benchmarks that are important. You can do this through informal data analysis and visualization. Or if want want to put your data science chops to work then you can create an inference model for a more unbiased source of predictors.

✍️ Write a v1 and iterate on the results

Now that you have your rules written down in some form, it's time to write a v1 that you can use to test results and iterate on them until they are satisfactory. My default tool here is SQL, as it's quick and powerful enough to get the job done. Write up a query that pulls in the data that the product will have at that point, and then applies a series of case statements, either one by one or summing each together, that outputs one or more results.

Once you have results, I suggest continuing to iterate on the query and rules in your IDE until you are happy with them at a glance. Then export your predications to a .csv file and toss it in a Google Sheet so that you and any relevant stakeholders can look them over. At this point I recommend evaluating the results on two points:

Intuition gut check: Would this prediction make sense if I were the user
Coverage of cases: What percentage of users receive a prediction?

🚢 Coordinate with product to ship

If you made it this far, congrats! This is an uncommon case in any (pseudo) machine learning project. Next steps are organization-dependent at this point, but if you are in an analytics role then there will likely be some sort of handoff with the product team that will work on shipping the feature.

Otherwise, sit back and be proud of yourself. Oh, and don't forget to measure and track the effectiveness of the output once it's shipped. 😉

Thanks for reading! If you enjoyed this post, you can subscribe in the form below to get future ones like this one straight to your inbox. If Twitter is more your speed, you can follow me there as well. 🔥