Prediction is not the same as understanding

The distinction sounds obvious when stated directly. A model that predicts accurately does not necessarily tell you why something happened, or what would happen if you changed something. A model trained to predict which students will fail a course can be very accurate without revealing anything about what caused those students to struggle, or what intervention would help.

This came into focus during my MSc thesis work. The goal was to build a classifier that could detect student confusion and frustration in online discussion forum posts – not to measure sentiment in general, but to identify the specific signal that indicated a student needed a response from an instructor. An SVM classifier with a non-linear Gaussian kernel, combined with POS frequency counts and a custom course-content dictionary, achieved an F1 score of 0.79 and an accuracy of 0.83. Inter-rater reliability testing against experienced college instructors put agreement at between 74% and 91%, depending on the instructor.

By the standards of the task, those are useful numbers. The classifier does what it is supposed to do. What it cannot tell you is why certain course content consistently generates confused posts, or whether changing the content would reduce confusion, or whether the confusion is caused by the content at all rather than by something else happening in the course at the same time. Prediction and explanation are not the same problem, and a classifier trained to do one does not automatically do the other.

This distinction matters more in some domains than others. In marketing measurement, which is where much of my applied work sits, it matters a great deal. Knowing that a channel correlates with conversions is not the same as knowing that the channel caused them. A customer who sees a display ad and converts through paid search three days later may have converted anyway. The correlation is real. The causal claim requires more work.

Causal inference provides the framework for doing that work. The tools: potential outcomes, directed acyclic graphs, do-calculus; are not new but their application to marketing measurement is underdeveloped, in part because the data requirements are non-trivial and in part because the outputs are less immediately legible than a coefficient in a regression.

The research thread I am pursuing sits at this intersection: causal attribution models for digital marketing, using the kind of server-side event data that a tool like CampaignCheck generates. The practical and the theoretical are, in this case, the same problem approached from different directions.

More on this as the work develops.

Understanding Attribution Models

A few years ago now, Google removed rules-based attribution models from Google Analytics. If you used Universal Analytics, or even early versions of GA4, you may remember these: last click, first click, linear, time decay, position-based. You could select a model that reflected your marketing efforts or customer journey, apply it to your conversion data, and get a clear view of how credit was distributed across the channels that touched a conversion.

These attribution reports could highlight the importance or impact of upper-funnel and awareness content that won’t necessarily get the credit it deserves when looking at a last click or ad-focused model. I can’t count the number of times I’ve seen investments in organic social traffic how up in the first click or time decay reports, but disappear in last click.

When Google replaced these with last click or data-driven attribution, which is a bit of a black box, it made determining the importance of these broader awareness marketing initiatives much more challenging.

It’s not that the new options are bad, exactly. The data-driven model uses machine learning to assign credit based on patterns in your conversion data, and Google’s position is that this is more accurate – especially in a cookie-reduced world. But while that may be true in some cases, it is also a model you cannot inspect, cannot replicate in a spreadsheet, cannot explain to a client in a meeting, and cannot meaningfully compare across time periods or assisted conversion values of specific campaigns, when the underlying model updates itself.

For large advertisers with high conversion volumes and dedicated analytics teams, data-driven attribution is probably fine. The model has enough data to work with, and there are people whose job it is to interpret the outputs. For the marketing manager at a 50 to 500 person company running a handful of campaigns, which is most marketing managers, it is a significant step backwards in practical utility.

What was actually lost is not a feature. It is control. Rule-based attribution models were transparent by design. You could disagree with a model’s assumptions, switch to a different one, and immediately see how your channel credit changed. That transparency made it possible to have an informed conversation about what was actually driving performance.

The workaround most teams have landed on is some combination of UTM parameters, last-click data from individual ad platforms, and manual reconciliation in a spreadsheet. This works, after a fashion. It is also time-consuming, error-prone, and depends entirely on everyone applying UTM conventions correctly every time. In practice, they don’t.

CampaignCheck is my attempt to recapture some of that control as a standalone tool that a marketing team controls rather than a feature inside a platform with other priorities. It is campaign-scoped rather than site-wide, which sidesteps most of the consent and data collection complexity, and it uses rule-based models because transparency is the point, not a limitation.

It is not finished yet. But this is why it exists.