Steve – Steve C. Harris

Why I rebuilt SplitCheck’s statistics engine

For the past several months I have been working on a new series of digital marketing products, largely focused on gaps I have encountered in the marketplace over the years. The first of these were released a couple months ago now, called SplitCheck.

SplitCheck was initially designed to be an A/B Testing solution for organizations who felt the loss of a simple, straight-forward platform when Google’s Optimize existed the market a couple of years back, and the interest and use in the tools has grown steadily over the past few months.

In speaking to users in various forums after the initial launch, however, I have realized that the standard approach to A/B testing has a weakness nobody talks about: you need volume to test. Lots of it.

Almost every major A/B testing platform, as well as the tools built into most marketing stacks, use the two-proportion Z-test to determine whether a variant is winning against other contenders. This is frequentist statistics, where you set a significance threshold, collect data until you cross the finish line, and get a (usually binary) verdict. Significant or not significant. Winner or no winner.

There is a reason this method so popular in the digital marketing world. It is relatively straight forward to implement and define. And it works. Under the right conditions.

The think is that the conditions for the traditional approach generally require at least several hundred conversions per variant, stable traffic within a relatively short time period, and the patience to wait for the data to accumulate. For a large e-commerce platform or a SaaS company with very high channel traffic, this is a very reasonable and effective.

For years, however, I have been working with small to medium-sized organizations running campaigns to a landing page or a conversion action with much smaller volumes, it means waiting months for a result (by which time the competitive landscape has often shifted, and the test has become irrelevant), or simply breaking the statistics model by prematurely choosing a winner mid-test.

The issue is what “not significant” actually means. In practice, most SMB marketers interpret it as “no difference.” That is not what it means. It means “we do not have enough data to be confident either way.” Those are very different claims, and confusing them leads to bad decision making; abandoning tests that are showing real signal, or running variants indefinitely on the grounds that nothing is proven.

I have been thinking about this problem for a while, and this week I replaced SplitCheck’s traditional statistical engine with a Bayesian alternative. Well, actually I added Bayesian and then put it up front. The traditional stats are still there as well.

The change is conceptually simple. Instead of asking “is this result statistically significant?”, the Bayesian engine asks “given the data we have, what is the probability that Variant B outperforms Variant A?” The answer is a number between 0 and 1, updated continuously as visitors arrive.

The underlying model is Beta-Binomial: each variant’s true conversion rate is modelled as a Beta distribution, updated as a conjugate posterior from the Binomial likelihood of observed conversions. P(B > A) is estimated via Monte Carlo simulation: 10,000 draws from each posterior, counting the proportion of draws where B exceeds A. At 10,000 samples the Monte Carlo standard error is bounded at 0.5%, which is more than adequate for the precision required.

The practical effect is significant. A test where Variant B has converted 9 out of 50 visitors and Variant A has converted 3 out of 50 will return a Bayesian probability of around 96% that B is better. A frequentist Z-test at the same sample size will say “insufficient data.” One of these outputs is useful to small scale tests. The other isn’t.

We also report a 90% credible interval on the expected lift, the Bayesian equivalent of a confidence interval, but with the interpretation practitioners have always (incorrectly) applied to confidence intervals. A 90% credible interval of +4% to +21% means there is a 90% posterior probability that the true lift falls in that range. That is the statement people think a confidence interval makes, but it isn’t.

As I said above, the more traditional frequentist engine still runs in parallel on SplitCheck, and for larger organizations this could very well continue to be their deciding metric. It would probably be mine if I had the traffic patterns to fill it. Now both results are stored on every test, and the traditional significance output is available in a collapsed section of the dashboard for those who want it. This parallel approach serves a dual purpose: it lets customers cross-reference results, and it generates the validation dataset I need for the research work I am developing on this methodology.

That research is the longer-term motivation for me. The experimentation literature has focused almost exclusively on enterprise-scale contexts. There is real work – both practical and academic – to be done on valid causal inference for conversion experimentation at SMB traffic volumes, on what methods are calibrated, what decision thresholds are appropriate, and whether empirically derived priors from aggregate platform data can accelerate time-to-decision.

The new results panel is live at splitcheck.io. The methodology page at splitcheck.io/methodology has the full technical explanation for anyone who wants to go deeper.

Aria — Local AI with Causal Reasoning

Several months ago I was flipping through game reviews when I came across an interesting new title called Whispers From the Star. The premise of the game is that you accidentally make contact with an astronaut on a far away planet, and communicate with her in order to help her survive. The creators claim the game is fully AI-driven, and reacts based on the players’ input on a number of levels that adds to the immersion of the experience. It’s an interesting study in human/AI interaction.

So, for the past few months I have been working on a fully local AI companion running on Ubuntu with an NVIDIA GPU, loosely inspired by the game interface, with one twist: no cloud services, no API keys, no data leaving the machine.

The surface layer is a 3D animated character built in Godot 4, voice-driven, with real-time lip sync and facial expressions responding to conversation state. This is really just a placeholder for me at the moment, as I am no artist, but the more interesting work is underneath it.

Aria maintains a persistent memory of personal interactions: journal entries written after each conversation, a relationship model that accumulates interests and personality observations across sessions, and an emotional state that drifts slowly based on how recent conversations have gone. The system even reading things while idle, searching topics from past discussions and occasionally surfacing something worth sharing without being asked.

Obviously this is a lot of data, so I have had to come up with a way of weighting and depreciating information so that the model isn’t overloaded. I have spent years working with data, and even then this has been a master class in representation, classification, storage, retrieval, and the technology behind it.

One of the more interesting research threads running through the project is the application of Pearl’s causal framework to human-AI relationship dynamics. After each conversation, structured observations are recorded: engagement level, depth, mood signals, conversational interventions. A pattern detector identifies statistical regularities in the accumulated data. A reasoning pass interprets those regularities as causal hypotheses with mechanisms and falsifiers. The system is currently in Phase 1 (observation only); Phase 2 introduces interventions, and Phase 3 introduces counterfactual reflection.

This extends directly from the MSc thesis work. TutorAlert detected student frustration from discussion posts — Level 1 on Pearl’s Ladder of Causation. Aria is an attempt to reach Levels 2 and 3: not just detecting states, but reasoning about what causes them and what would change them.

Built with: Ollama (gemma2:9b / phi4:14b / qwen2.5-coder:7b), faster-whisper, Piper TTS, Rhubarb Lip Sync, Godot 4, Python asyncio, DoWhy, pandas. I am interested in seeing whether the causal layer produces intervention strategies that are genuinely useful, and not just descriptions of what happened but guidance on what to do differently.

In active development, more to come.

Prediction is not the same as understanding

The distinction sounds obvious when stated directly. A model that predicts accurately does not necessarily tell you why something happened, or what would happen if you changed something. A model trained to predict which students will fail a course can be very accurate without revealing anything about what caused those students to struggle, or what intervention would help.

This came into focus during my MSc thesis work. The goal was to build a classifier that could detect student confusion and frustration in online discussion forum posts – not to measure sentiment in general, but to identify the specific signal that indicated a student needed a response from an instructor. An SVM classifier with a non-linear Gaussian kernel, combined with POS frequency counts and a custom course-content dictionary, achieved an F1 score of 0.79 and an accuracy of 0.83. Inter-rater reliability testing against experienced college instructors put agreement at between 74% and 91%, depending on the instructor.

By the standards of the task, those are useful numbers. The classifier does what it is supposed to do. What it cannot tell you is why certain course content consistently generates confused posts, or whether changing the content would reduce confusion, or whether the confusion is caused by the content at all rather than by something else happening in the course at the same time. Prediction and explanation are not the same problem, and a classifier trained to do one does not automatically do the other.

This distinction matters more in some domains than others. In marketing measurement, which is where much of my applied work sits, it matters a great deal. Knowing that a channel correlates with conversions is not the same as knowing that the channel caused them. A customer who sees a display ad and converts through paid search three days later may have converted anyway. The correlation is real. The causal claim requires more work.

Causal inference provides the framework for doing that work. The tools: potential outcomes, directed acyclic graphs, do-calculus; are not new but their application to marketing measurement is underdeveloped, in part because the data requirements are non-trivial and in part because the outputs are less immediately legible than a coefficient in a regression.

The research thread I am pursuing sits at this intersection: causal attribution models for digital marketing, using the kind of server-side event data that a tool like CampaignCheck generates. The practical and the theoretical are, in this case, the same problem approached from different directions.

Understanding Attribution Models

A few years ago now, Google removed rules-based attribution models from Google Analytics. If you used Universal Analytics, or even early versions of GA4, you may remember these: last click, first click, linear, time decay, position-based. You could select a model that reflected your marketing efforts or customer journey, apply it to your conversion data, and get a clear view of how credit was distributed across the channels that touched a conversion.

These attribution reports could highlight the importance or impact of upper-funnel and awareness content that won’t necessarily get the credit it deserves when looking at a last click or ad-focused model. I can’t count the number of times I’ve seen investments in organic social traffic how up in the first click or time decay reports, but disappear in last click.

When Google replaced these with last click or data-driven attribution, which is a bit of a black box, it made determining the importance of these broader awareness marketing initiatives much more challenging.

It’s not that the new options are bad, exactly. The data-driven model uses machine learning to assign credit based on patterns in your conversion data, and Google’s position is that this is more accurate – especially in a cookie-reduced world. But while that may be true in some cases, it is also a model you cannot inspect, cannot replicate in a spreadsheet, cannot explain to a client in a meeting, and cannot meaningfully compare across time periods or assisted conversion values of specific campaigns, when the underlying model updates itself.

For large advertisers with high conversion volumes and dedicated analytics teams, data-driven attribution is probably fine. The model has enough data to work with, and there are people whose job it is to interpret the outputs. For the marketing manager at a 50 to 500 person company running a handful of campaigns, which is most marketing managers, it is a significant step backwards in practical utility.

What was actually lost is not a feature. It is control. Rule-based attribution models were transparent by design. You could disagree with a model’s assumptions, switch to a different one, and immediately see how your channel credit changed. That transparency made it possible to have an informed conversation about what was actually driving performance.

The workaround most teams have landed on is some combination of UTM parameters, last-click data from individual ad platforms, and manual reconciliation in a spreadsheet. This works, after a fashion. It is also time-consuming, error-prone, and depends entirely on everyone applying UTM conventions correctly every time. In practice, they don’t.

CampaignCheck is my attempt to recapture some of that control as a standalone tool that a marketing team controls rather than a feature inside a platform with other priorities. It is campaign-scoped rather than site-wide, which sidesteps most of the consent and data collection complexity, and it uses rule-based models because transparency is the point, not a limitation.

It is not finished yet. But this is why it exists.