Beyond the Exclamation Points!!!

Why Are We Here?

Goal: Demonstrate a clearer approach to interpreting logistic regression using Bayesian methods and HDI-ROPE analysis, illustrated through real-world SMS spam detection.

Case Study: Predicting whether an SMS message is spam based on linguistic toxicity patterns (captured through NLP+PCA), message length, and punctuation usage.

The Bottom Line: This tutorial shows how Bayesian modeling combined with marginal effects and HDI-ROPE analysis creates a more intuitive workflow for binary outcome analysis—avoiding the notorious “log-odds” interpretation problem while tackling the practical challenge of spam detection.

I make assumptions (too)!

I assume you:

- Have basic familiarity with R and the tidyverse.

- Understand the fundamentals of regression analysis.

- Have encountered logistic regression before and familiar with the interpretation frustration.

The Messages Behind the Data: Understanding SMS Spam in Context

Imagine receiving a text message: “URGENT!!! You have WON £1,000,000!!! Reply NOW with your bank details!!!”

Your brain instantly recognizes this as spam—but how? It’s not just excessive exclamation points or too-good-to-be-true offers. There’s a complex pattern of linguistic signals distinguishing legitimate messages from spam, which can vary significantly across cultural and linguistic contexts.

The dataset we’re exploring comes from the ExAIS SMS Spam project conducted in Nigeria, featuring 5,240 SMS messages (2,350 spam, 2,890 ham) collected from university community members aged 20–50.

The data contain the SPAM/HAM (not spam) classification and the text message itself. Using NLP magic, dimensionality reduction, and text analysis, I extracted some additional features:

is_spam - Our outcome variable; a binary classification indicating spam or legitimate communication.
pc_aggression & pc_incoherence - Principal components I extracted from Google’s Perspective API toxicity scores capturing sophisticated linguistic patterns:
- pc_aggression: threatening, toxic, and insulting language patterns.
- pc_incoherence: inflammatory, incoherent, or unsubstantial messages.
word_count - The total number of words per message (includes squared term to capture non-linear relationships).
exclamation_count - The number of exclamation points, a simple but telling spam feature.

Our central question is this: How do linguistic toxicity patterns (captured by PCA components) interact with message characteristics like length and punctuation to identify spam? And more importantly, how can we interpret these relationships meaningfully?

In the note below you can find the full pipeline with code for the NLP and PCA code.

A Note on Advanced NLP Features

Below is the exact, reproducible pipeline I used to transform raw text into the two tidy principal‑component dials (pc_aggression, pc_incoherence) you saw in the main post. Everything is wrapped in code‑chunks so you (or your future self) can copy‑paste the whole block into Quarto/R Markdown and run it end‑to‑end.

# ── 1 · Load required packages ───────────────────────────────────────────
library(tidyverse)     # data manipulation and pipes
library(peRspective)   # R client for Google/Jigsaw Perspective API
library(FactoMineR)    # Principal‑Component Analysis
library(pROC)          # ROC curves (for later evaluation)

What is peRspective?
peRspective is a thin, tidyverse‑friendly wrapper around the Perspective API. It takes care of batching requests, retrying on rate‑limits, and returning scores as a clean data frame. Before running the chunk below you’ll need a free API key (set it once with Sys.setenv(PERSPECTIVE_API_KEY = "<your‑key>")).

Step 1 – Obtain linguistic scores from the Perspective API

Each message is sent to the API, which returns up to nine probabilities indicating the presence of attributes such as toxicity, threat, or incoherence.

# ⚠️  Disabled by default to avoid accidental quota use.
perspective_scores <- dataset_clean %>%
  prsp_stream(
    text        = message,          # column containing the SMS text
    text_id     = text_id,          # unique identifier for safe joins
    score_model = c("THREAT", "TOXICITY", "INSULT", "SPAM",
                    "INFLAMMATORY", "INCOHERENT", "UNSUBSTANTIAL",
                    "FLIRTATION", "PROFANITY"),
    safe_output = TRUE,             # masks content the API flags as unsafe
    verbose     = TRUE)             # progress messages

Development tip:
When iterating on downstream scripts, save the scores once (write_csv()) and reload them to avoid repeated network calls:
perspective_scores <- read_csv("~/perspective_scores_saved.csv")

Step 2 – Merge, clean, and prepare for PCA

Occasionally a request fails or returns NA. We drop the few problematic rows and replace any missing attribute scores with zeros so PCA receives a complete numeric matrix.

scored_data <- dataset_clean %>%
  left_join(perspective_scores, by = "text_id") %>%
  filter(!(has_error = (!is.na(error) & error != "No Error"))) %>%
  mutate(across(THREAT:PROFANITY, ~replace_na(.x, 0))) %>%
  select(-SPAM, -FLIRTATION)   # remove two attributes found redundant here

Step 3 – Reduce nine correlated scores to two principal components

Principal‑Component Analysis (PCA) rotates the nine‑dimensional attribute space until the first component captures the largest systematic variation, the second captures the next‑largest, and so on.

pca_result <- PCA(scored_data %>% select(THREAT:PROFANITY),
                  ncp   = 3,    # keep three components for inspection
                  graph = FALSE)

pca_scores <- as_tibble(pca_result$ind$coord) %>%
  set_names(c("pc_aggression",     # roughly: threat + toxicity + insult
              "pc_inflammatory",   # moderate mix (not used later)
              "pc_incoherence")) %>%   # roughly: incoherent + unsubstantial
  mutate(text_id = scored_data$text_id)

pc_aggression ranges from neutral tone to overt hostility.
pc_incoherence tracks the continuum from cohesive message to nonsensical or fragmentary text.

These two components retain ≈ 72 % of the total variance in the original nine attributes, providing a compact yet informative description of each message.

Step 4 – Assemble the modelling table

Finally, we merge the PCA scores back with the pre‑computed surface features (word_count, exclamation_count, etc.) so the eventual model can consider both linguistic tone and formatting cues.

model_data <- scored_data %>%
  select(-THREAT:-PROFANITY) %>%
  left_join(pca_scores, by = "text_id")

At this point model_data is a tidy, analysis‑ready data frame: one row per SMS, interpretable columns for tone and structure, and no missing values to trip up downstream methods.

The Interpretation Challenge:
Why Binary Outcomes Are Tricky

Logistic regression dominates binary outcome modeling, but before diving into its interpretation challenges, let’s see why we can’t just use the familiar linear regression like:

\[ \text{spam} \;=\; \beta_0 \;+\; \beta_1 \times \text{aggression} \;+\; \beta_2 \times \text{word count} \;+\; \dots \]
Let me demonstrate with some simulated spam data.

The first problem is that linear regression fundamentally misunderstands the nature of binary data. Let’s examine the posterior predictive checks—these diagnostic plots ask “if our model were true, what would data look like?” By generating many fake datasets from each model and comparing them to reality, we can see whether the model truly grasps the data-generating process:

Look at the stark mismatch! Our actual data (gray lines) consists solely of 0s and 1s—spam or not spam, no middle ground. Here’s the crucial distinction: while both models predict probabilities internally (e.g., “this message has 70% chance of being spam”), this plot shows what type of observed data each model’s mathematical structure implies. The logistic model (green bars) is specified for binary outcomes—it uses its internal probabilities to generate realistic 0s and 1s, creating those two distinct spikes. The linear model (red distribution), however, assumes continuous outcomes and generates spam “scores” spread out in a bell curve, as if we could observe a message being “0.7 spam.”

This mismatch isn’t trivial—it fundamentally undermines the model’s validity for binary outcomes. When a model’s mathematical structure doesn’t match the basic nature of your outcome variable, its estimates of relationships become suspect. The linear model’s equations assume continuous variation that doesn’t actually exist in binary data. The effect sizes, confidence intervals, and predictions all come from a model that by definition grasps the data wrongfully.

But the problems run deeper. Let’s look at specific predictions across the range of our predictors:

The linear model draws a straight line that blissfully crosses into impossible territory, predicting probabilities well above 100% for messages with high aggression scores. This isn’t a minor edge case—any linear model with meaningful predictors will eventually produce impossible predictions at the extremes. It’s mathematically inevitable when you try to force a straight line onto a bounded outcome.

Among the many ways linear regression fails for binary data, we’ve seen two critical issues: it misunderstands the nature of the outcome (expecting continuous values when only 0s and 1s exist) and produces nonsensical predictions that escape probability bounds.

How Logistic Regression Fixes It

The mathematical solution of the (in)famous “S-curve” formula goes as follow:

\[p(\text{spam}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times \text{aggression} + \beta_2 \times \text{word count} + ...)}}\]

Inside the parentheses is still a straight-line blend of your predictors; the logistic function simply transforms this into a valid probability between 0 and 1.

The S-shaped curve elegantly solves both our problems: it respects the binary nature of data and keeps predictions within valid probability bounds. But this solution creates a new challenge: the curve’s varying steepness means that the effect of any predictor depends on where you are on the curve. A one-unit increase in aggression might change spam probability by 20 percentage points in the middle of the curve but only 2 percentage points near the edges.

A Peek Behind the Curtain: Log-Odds

To understand why interpretation becomes tricky, we need to peek at how logistic regression actually works. It achieves that S-curve by working in a transformed space called “log-odds.”

Think of it as changing measuring units—like converting temperature from Celsius to Fahrenheit, but for probabilities:

Odds re-phrase probability: 30% chance of spam is equal in odds = 0.3 / 0.7 ≈ 0.43.
Log-odds take those odds and apply a logarithm. This stretches the probability scale: 0% becomes negative infinity, 100% becomes positive infinity, and everything else spreads out smoothly in between.

On this stretched-out log-odds scale, we can finally draw our straight line (the right panel):

\[ \log\!\Bigl(\tfrac{p}{1-p}\Bigr) \;=\; \beta_0 \;+\; \beta_1 \times \text{aggression} \;+\; \beta_2 \times \text{word count} \;+\; \dots \]

The left panel shows the jittered 0/1 spam labels (grey points) alongside the model’s predicted probability
\(\hat{p} = p(\text{spam}=1)\), plotted as a solid blue curve on \([0,1]\).
The right panel displays the same model on the log-odds scale with a dashed red line for
\(\operatorname{logit}(\hat{p}) = \ln\!\bigl(\tfrac{\hat{p}}{1-\hat{p}}\bigr)\), which straightens the S-shaped curve.
In both panels, grey arrows trace one example predictor value and illustrate how it maps from the probability curve to the straight line in log-odds space.

The Interpretation Problem

Imagine explaining the model to a spam-filter engineer or product manager:

“A one-unit increase in the aggression principal component raises the log-odds of spam by 0.85.”

Cue blank stares.
So we translate to odds ratios:

“Each unit increase in aggression multiplies the odds of spam by 2.33 (e^0.85).”

Still murky! Even seasoned analysts struggle with odds because humans naturally think in probabilities, not odds. Add interaction terms (e.g., how aggression’s effect changes with message length) and interpretation gets even thornier.

I’m going to put in tremendous effort to cut through that fog—re-expressing results exclusively on the familiar 0 %–100 % probability scale, and showing you practical tricks to keep interpretations clear in Bayesian statistics.

But Wait! Why Bayesian?

The marginaleffects package beautifully transforms model results into interpretable probability statements. More on that later. I will not stop here but also combine it with Bayesian estimation to create an even more powerful analytical framework.

Bayesian methods solve several technical problems that plague binary outcome models:

Complete Separation Issues

The Problem: Complete separation occurs when a predictor perfectly divides outcome categories. Imagine if every message with more than 10 exclamation points was spam while no message with fewer was—traditional maximum likelihood estimation would produce infinite coefficient estimates.

The Bayesian Solution: Priors act as natural regularizers, keeping estimates finite and meaningful even in extreme cases. Instead of model failure, we get sensible uncertainty quantification around our estimates. This is particularly important in spam detection where new tactics constantly emerge, potentially creating separation in specific feature combinations.

Robust Computation

The Problem: Our model includes interactions between PCA components and text features—these complex relationships often cause convergence failures in traditional frameworks. Researchers are forced to simplify their models, potentially missing important patterns in how spam characteristics combine.

The Bayesian Solution: Modern implementations like brms handle complex model structures that would break optimization-based methods, letting us build models that match our theoretical questions rather than computational constraints.

This Bayesian foundation, combined with marginaleffects for interpretation, gives us the best of both worlds: robust estimation and intuitive communication of results.

Making Bayesian Binary Models Practical: From Priors to Inference

There are two main obstacles that often discourage researchers from adopting Bayesian methods: choosing appropriate priors and interpreting inference from posterior distributions. My goal here is to show that both are surprisingly straightforward for logistic hierarchical models—especially when we combine the right tools.

Let’s build this model step by step, starting with prior specification:

The Prior Specification Problem

When you fit a Bayesian logistic model in brms, your regression coefficients live on the log-odds scale (each β is a log-odds-ratio for a one-unit bump in the predictor). This creates an immediate headache: what does a “reasonable” prior look like for log-odds? Is normal(0, 1) too wide? Too narrow? It’s hard to have intuitions about log-odds because we don’t naturally think that way. How do we translate our existed knowledge (or intuitions) about effect sizes into appropriate priors for log-odds coefficients?

There’s a clever heuristic that can help us translate our familiar Cohen’s d intuitions into the log-odds world.

A Useful Translation Trick

Sánchez-Meca, Marín-Martínez, and Chacón-Moscoso (2003) developed a relationship between odds ratios and Cohen’s d for meta-analytic contexts:

\[d = log(OR) \times \frac{\sqrt{3}}{\pi}\]

Rearranging gives us:

\[log(OR) = d \times \frac{\pi}{\sqrt{3}}\] Because a logistic distribution has variance π² / 3 while a standard normal has variance 1, multiplying a log-odds-ratio by √3 / π (≈ 0.551) rescales it into d. Handy, but only a rough guide—so wield with caution.

We can use this as a starting point for prior specification. If we expect mostly small-to-medium effects (d ≈ 0.2-0.5), we can translate those familiar benchmarks into log-odds standard deviations.

A Practical Workflow

Here’s how I use this approach:

Step 1: Think in Cohen’s d terms

For spam detection, most individual features probably have small to medium effects:

Small effect: d ≈ 0.2
Medium effect: d ≈ 0.5
Large effect: d ≈ 0.8

Step 2: Convert to log-odds standard deviation

Since we want unbiased estimates, we want our prior to be centered at zero (no effect), therefore we can set the standard deviation of our normal prior using:

\[\sigma_{\log-odds} = d \times \frac{\pi}{\sqrt{3}}\]

Let me show you how this works in practice:

Small effects lead to narrow distributions, which creates stronger regularization. When we expect our predictors to have small effects (d = 0.2), the resulting prior standard deviation of ~0.36 keeps coefficients tightly concentrated around zero. This way we are preventing overfitting by shrinking coefficients toward zero unless the data provides strong evidence otherwise. In contrast, expecting medium effects (d = 0.5) gives us a wider prior (σ ≈ 0.91) that allows coefficients more freedom to deviate from zero.

This makes intuitive sense: if we genuinely believe our features have small effects, we should be skeptical of large coefficient estimates and let the prior express that skepticism through increased shrinkage.

There you go! We have our priors! Let’s specify our model structure.

Choosing Predictors

The choice of predictors typically depends on your goals—theory testing versus prediction optimization. Here, I’ve chosen predictors for tutorial purposes rather than optimal spam detection. Each one helps illustrate different aspects of Bayesian logistic regression interpretation:

  is_spam ~ (pc_aggression + pc_incoherence) * 
    (word_count + exclamation_count)

These predictors give us different story types to tell:

pc_aggression: “How does linguistic hostility signal spam?”
word_count: “Do spammers prefer short punchy messages or longer sales pitches?”
exclamation_count: “When do exclamation points become suspicious?”
Interactions: “How do these patterns combine and depend on each other?”

Bringing It All Together: Fitting the Model in brms

Now let’s translate our prior intuitions and model structure into actual brms code. This is where everything comes together—our Cohen’s d-derived priors meet our pedagogically-chosen predictors.

library(brms)
library(bayestestR)

# Scaling the predictors first

model_data <- model_data %>%
  mutate(
    across(
      all_of(c("pc_aggression",
                    "word_count",
                    "pc_incoherence",
                    "exclamation_count")),
      ~ as.numeric(
        scale(.x)
        )
      )
    )


# Set our priors based on expected small-to-medium effects
d_expected <- 0.3
prior_sd <- d_expected * pi / sqrt(3)  # ≈ 0.54

# Define priors for all coefficients
priors <- c(
  prior(normal(0, 0.54), class = b)  # All slopes get the same prior
)

# Fit the model
spam_model <- brm(        
  is_spam ~ (pc_aggression + pc_incoherence) * 
            (word_count + exclamation_count),
  data = model_data,
  family = bernoulli(),
  prior = priors,
  cores = 4,
  iter = 2000,
  warmup = 1000,
  chains = 4,
  seed = 123
)

Here are some basic diagnostic for the model:

pp_plot

rhat_plot

roc_plot

It’s time to interpret the model’s effects. But how?

HDI+ROPE: A Bayesian Path to Statistical Inference

Traditional null hypothesis significance testing (NHST) with p-values pushed us into a binary world: an effect is either “significant” or “not significant” based on an arbitrary threshold (typically p < .05). This approach has been criticized extensively—not least because it collapses a continuous measure of evidence into a dichotomous decision. It’s like trying to describe a complex spam pattern with just “suspicious” or “not suspicious”—when in reality, there’s a rich spectrum of possible spam indicators.

Bayesian inference offers a more nuanced perspective through posterior distributions. Instead of a single p-value, we get an entire distribution of plausible parameter values. But this richness creates a new challenge: how do we make practical decisions without being overwhelmed by distributional complexity?

John Kruschke (2014, 2018) developed a powerful framework that provides a principled alternative to traditional significance testing.

The Highest Density Interval (HDI)

The HDI contains a specified percentage of the most probable parameter values, where every value inside the interval has higher probability density than any value outside. Unlike frequentist confidence intervals, the HDI has a direct probabilistic interpretation: given our data and model, there’s an X% probability that the true parameter value lies within the X% HDI (e.g., 95% probability for a 95% HDI).

While Kruschke originally recommended using 95% HDIs (and later suggested 89% as potentially better), we’ll take advantage of the full posterior distribution (100% HDI)—using the complete picture of uncertainty in our spam analysis.

The Region of Practical Equivalence (ROPE)

The ROPE is essentially us asking, “What effect is so small that I wouldn’t care about them in practice?” For illustration, we might set our ROPE at ±5%—meaning any feature that changes spam probability by less than 5 percentage points either way is considered practically negligible.

Importantly, ROPE represents a shift from traditional significance testing to effect size reasoning. Rather than asking “Is there any effect, no matter how small?” (significance testing), we ask “Is the effect large enough to matter?” (effect size evaluation). If adding one exclamation point increases spam probability by 0.01%, that might be statistically detectable with enough data, but no spam filter designer would ever notice or care. The ROPE lets us define this “too small to matter” range—distinguishing between effects that are statistically detectable versus practically meaningful for spam detection. This focus on effect magnitude rather than mere detectability represents a fundamental shift toward more substantive scientific inference.

Making Decisions with HDI+ROPE

When using the full posterior distribution approach, the decision rules are straightforward. Using our 5% example threshold (though other values may be appropriate for different contexts):

Reject the null hypothesis if less than 2.5% of the posterior distribution falls within the ROPE. This means we have strong evidence for a practically meaningful effect. The visualization below shows this as a distribution clearly extending beyond our the ROPE boundaries.
Accept the null hypothesis if more than 97.5% of the posterior distribution falls within the ROPE. This means we have evidence for the practical absence of an effect—the parameter is essentially equivalent to zero in spam detection terms. This appears as a distribution tightly concentrated within the ROPE zone.
Remain undecided if the percentage falls between these thresholds. The evidence is inconclusive at our current precision level, shown as distributions that substantially span the ROPE boundaries.

The following visualization demonstrates these decision rules in action, showing four distinct patterns you might encounter when analyzing spam features. Each scenario represents a different relationship between the posterior distribution and our example ±5% ROPE, illustrating how the same statistical framework can yield different conclusions depending on where the evidence falls. Notice how some effects might be statistically detectable (consistent small impacts) yet still fall within our practical equivalence zone—a nuance that traditional p-value approaches often miss.

Analyzing the Spam data with `marginaleffects` and `bayestestR`

After this conceptual introduction, how do we actually implement HDI+ROPE in practice? We’ll harness two powerful R packages that make this analysis both rigorous and accessible.

Why `marginaleffects` for Logistic Regression?

This tutorial happened to be already too long, so we won’t dive deeply into why marginaleffects is transformative for logistic regression analysis. But in brief, this package solves several critical pain points:

Meaningful effect sizes: Rather than wrestling with log-odds coefficients that nobody intuitively understands, marginaleffects automatically translates everything to the probability scale—telling us how spam probability actually changes, not just abstract log-odds ratios. We can’t escape logistic regression’s inherent property where effects vary across different regions of the S-curve, but we can measure and report these varying effects in a transparent, interpretable way.
Interaction interpretation made simple: Our model includes four interactions ((A+B)*(C+D)), which would traditionally require careful algebra and chain rule calculations. The package handles all the calculus behind the scenes.
Uncertainty propagation done right: Every estimate comes with properly computed standard errors that account for the non-linear transformations inherent in logistic regression.
Seamless Bayesian integration: The package works identically with brms models as with frequentist ones, automatically extracting posterior draws when needed. This means our workflow remains consistent whether we’re computing average marginal effects or testing complex hypotheses.

For a comprehensive exploration of these capabilities, Andrew Heiss provides an excellent deep dive here.

Why `bayestestR` for ROPE?

While marginaleffects handles the effect computation, bayestestR provides the decision-making framework. It offers battle-tested functions to calculate the proportion of the posterior distribution falling within our ROPE. This seamless integration means we can move from posterior distributions to practical decisions without manual probability calculations or custom functions.

Making Sense of the Results

Let’s see this powerful combination in action by computing the average marginal effects for each predictor. These tell us how much each feature changes the probability of spam classification, averaged across all observations in our data:

library(marginaleffects)
library(tidybayes)
library(ggdist)

rope = c(-0.05,0.05)
ci  = 1.0

main_effects <- avg_slopes(spam_model,type = "response")

bayestestR::ci(main_effects,ci = ci, method = "HDI")

Highest Density Interval

term              | contrast |       100% HDI
---------------------------------------------
exclamation_count |    dY/dX | [ 0.00,  0.05]
pc_aggression     |    dY/dX | [-0.07, -0.01]
pc_incoherence    |    dY/dX | [-0.19, -0.14]
word_count        |    dY/dX | [ 0.05,  0.11]

Since we standardized all predictors before analysis, these effects represent the impact of a one standard deviation change in each feature.

Looking solely on the HDIs, the first thing that jumps out is that all four predictors main effects show “statistically significant” effects in the traditional sense—not a single HDI includes zero.

In the old world of p-values, we’d declare victory—four significant predictors! Pop the champagne! But wait… let’s look at what happens when we apply our HDI+ROPE examination:

rope(main_effects ,range = rope,ci = ci)

# Proportion of samples inside the ROPE [-0.05, 0.05]:

term              | contrast | inside ROPE
------------------------------------------
exclamation_count |    dY/dX |     99.92 %
pc_aggression     |    dY/dX |     86.83 %
pc_incoherence    |    dY/dX |      0.00 %
word_count        |    dY/dX |      0.00 %

Remember, we set our ROPE at ±5% (e.g., ±0.05 probability points)—any effect smaller than a 5 percentage point change in spam probability is too small to matter for practical spam filtering. Now watch what happens:

Two of our four “statistically significant” predictors fall predominantly within the ROPE:

exclamation_count: 99.92% of the posterior is inside ROPE - supporting the null hypothesis
pc_aggression: 86.83% of the posterior is inside ROPE - which is an inconclusive result that leans toward the null hypothesis.

The other two predictors are showing practical significance:

pc_incoherence: 0% inside the ROPE (averaging around -16.5%)
word_count: 0% inside the ROPE (averaging around +8%)

This is exactly why HDI+ROPE analysis is so powerful. Traditional significance testing would have given us four “significant” results without distinguishing their practical importance. Our Bayesian approach reveals which effects actually matter: linguistic incoherence strongly signals legitimate messages (reducing spam probability by about 16.5%), while longer messages are more likely to be spam (increasing probability by about 8%).

The visualization brings this distinction to life. The gray distributions (exclamation_count and pc_aggression) cluster around zero, mostly contained within our ROPE boundaries—statistically detectable but practically negligible. In contrast, both pc_incoherence and word_count distributions (in blue) extend well beyond the ROPE. The incoherence effect is our strongest predictor, while the positive word_count effect suggests that spammers tend to write longer messages—perhaps needing more words to spin their tales of exotic princes or miracle cures.

Exploring Interactions

So far, we’ve examined how each feature independently affects spam probability. But real-world spam detection is more nuanced—the impact of one feature often depends on the context provided by others. With our model specification inspecting interactions, we’re explicitly allowing for these interdependencies.

When working with continuous predictors, interactions create a little twist: the effect of one variable literally changes as we move along the range of another variable. Think of it this way—the impact of aggressive language on spam probability might be minimal in very short messages (where there’s little room for aggression to manifest) but could become substantial in longer messages (where sustained aggressive tone becomes more apparent).

To capture these shifting relationships, we need to examine how effects vary across different contexts. The avg_slopes function with datagrid helps us do exactly this—it calculates the average effect across specified points along the moderating variable’s distribution. This gives us a single summary measure of how strong the interaction effect is overall.

Let me demonstrate with our four interaction pairs:

# Interaction 1: How does pc_incoherence's effect vary with message length?
incoherence_by_wordcount <- avg_slopes(
    spam_model,
    variables = "pc_incoherence",
    newdata   = datagrid(
        word_count = quantile(model_data$word_count, 
                            probs = c(.05, .25, .50, .75, .95))
    ),
    type      = "response"
)

# Interaction 2: How pc_incoherence's effect changes with exclamation count
incoherence_by_exclamation <- avg_slopes(
    spam_model,
    variables = "pc_incoherence",
    newdata   = datagrid(
        exclamation_count = quantile(model_data$exclamation_count, 
                                   probs = c(.05, .25, .50, .75, .95))
    ),
    type      = "response"
)

# Interaction 3: How pc_aggression's effect changes with exclamation count
aggression_by_exclamation <- avg_slopes(
    spam_model,
    variables = "pc_aggression",
    newdata   = datagrid(
        exclamation_count = quantile(model_data$exclamation_count, 
                                   probs = c(.05, .25, .50, .75, .95))
    ),
    type      = "response"
)

# Interaction 4: How pc_aggression's effect changes with word count
aggression_by_wordcount <- avg_slopes(
    spam_model,
    variables = "pc_aggression",
    newdata   = datagrid(
        word_count = quantile(model_data$word_count, 
                            probs = c(.05, .25, .50, .75, .95))
    ),
    type      = "response"
)

bind_rows(
rope(incoherence_by_wordcount, range = rope, ci = ci) %>%
  mutate(Parameter = "incoherence × word_count"),

rope(incoherence_by_exclamation, range = rope, ci = ci) %>%
  mutate(Parameter = "incoherence × exclamation"),

rope(aggression_by_exclamation, range = rope, ci = ci) %>%
  mutate(Parameter = "aggression × exclamation"),

rope(aggression_by_wordcount, range = rope, ci = ci) %>%
  mutate(Parameter = "aggression × word_count")
)

# Proportion of samples inside the ROPE [-0.05, 0.05]:

Parameter                 | inside ROPE
---------------------------------------
incoherence × word_count  |      0.00 %
incoherence × exclamation |      0.00 %
aggression × exclamation  |      9.93 %
aggression × word_count   |     63.48 %

Looking at these interaction results, distinct patterns emerge across our four combinations. The two interactions involving linguistic incoherence show substantial effects that completely escape our ROPE boundaries (0% inside). The aggression interactions tell a more varied story—the aggression × exclamation interaction is shy from a meaningful practical importance with only 9.93% falling within the ROPE, while the aggression × word_count interaction is non decisive at 63.48% inside the ROPE.

Think about what this means in practical terms. When a message scores high on incoherence—perhaps it’s a hastily typed personal message or an auto-generated notification with templated chunks—it’s substantially less likely to be spam. This effect averages around -20% across different message lengths and punctuation patterns. The consistency is striking: whether it’s a brief “Running late, see u soon!” or a longer rambling message full of typos and incomplete thoughts, linguistic incoherence signals authenticity rather than commercial intent.

These patterns reveal how spam characteristics combine in practice. The incoherence effect remains robust across different contexts—whether messages are short or long, heavily punctuated or not, linguistic incoherence consistently signals legitimate communication with effects around -17% to -19%. This stability suggests that real human messiness in texting transcends other message characteristics and trustworthiness.

The aggression × exclamation interaction deserves special attention. While aggressive language alone showed negligible main effects, its combination with heavy exclamation mark usage could create a meaningful spam signal. This makes intuitive sense: the pattern of aggressive tone plus excessive punctuation (“URGENT!!! CLAIM NOW!!!”) represents a classic spam signature that our model successfully identifies. The interaction captures something neither component could detect alone—the multiplicative effect of multiple spam tactics used together.

Here’s the visualization showing all four interaction effects:

Wrapping Up: A New Lens for Binary Classification

We started this journey frustrated with log-odds coefficients that nobody could interpret. Through the combination of Bayesian inference, marginal effects, and HDI+ROPE analysis, we’ve transformed an opaque logistic regression into a story that is easier to understand.

But the real victory here isn’t just about spam detection. It’s about the analytical framework we’ve demonstrated. We tackled one of Bayesian analysis’s most intimidating challenges—setting priors for log-odds coefficients—by developing a practical heuristic that translates familiar Cohen’s d effect sizes into appropriate prior distributions. By combining brms for robust Bayesian estimation, marginaleffects for interpretable effect sizes, and bayestestR for principled decision-making, we’ve created a workflow that turns statistical significance into practical insight. No more explaining odds ratios to confused stakeholders. No more pretending that p < 0.05 means something matters in the real world.

The HDI+ROPE approach deserves special emphasis. It elegantly solves the fundamental tension in applied statistics: distinguishing between effects we can detect and effects that actually matter. In our analysis, we found multiple “statistically significant” predictors that were practically useless—a distinction that traditional methods would have missed entirely.

For practitioners working with binary outcomes—whether in spam detection, medical diagnosis, customer churn, or any other domain—this framework offers a path forward. Set your ROPE based on domain expertise, specify your priors using the Cohen’s d translation trick, fit your model with confidence, and let the posterior distributions tell you not just what’s real, but what’s worth caring about. The result is statistical analysis that speaks the language of practical decision-making, turning the notorious complexity of logistic regression into insights that drive action.

Session Info

sessioninfo::session_info()

─ Session info ─────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       macOS Ventura 13.3.1
 system   x86_64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Asia/Jerusalem
 date     2025-07-14
 pandoc   3.5 @ /usr/local/bin/ (via rmarkdown)
 quarto   1.6.40 @ /usr/local/bin/quarto

─ Packages ─────────────────────────────────────────────────────────────────────────────
 package         * version    date (UTC) lib source
 abind             1.4-8      2024-09-12 [1] CRAN (R 4.4.1)
 arrayhelpers      1.1-0      2020-02-04 [1] CRAN (R 4.4.0)
 backports         1.5.0      2024-05-23 [1] CRAN (R 4.4.0)
 bayesplot       * 1.13.0     2025-06-18 [1] CRAN (R 4.4.1)
 bayestestR      * 0.16.0     2025-05-20 [1] CRAN (R 4.4.1)
 bridgesampling    1.1-2      2021-04-16 [1] CRAN (R 4.4.0)
 brms            * 2.22.0     2024-09-23 [1] CRAN (R 4.4.1)
 Brobdingnag       1.2-9      2022-10-19 [1] CRAN (R 4.4.0)
 checkmate         2.3.2      2024-07-29 [1] CRAN (R 4.4.0)
 class             7.3-23     2025-01-01 [1] CRAN (R 4.4.1)
 classInt          0.4-11     2025-01-08 [1] CRAN (R 4.4.1)
 cli               3.6.5      2025-04-23 [1] CRAN (R 4.4.1)
 cluster           2.1.8      2024-12-11 [1] CRAN (R 4.4.1)
 cmdstanr          0.8.1      2025-02-01 [1] https://stan-dev.r-universe.dev (R 4.4.2)
 coda              0.19-4.1   2024-01-31 [1] CRAN (R 4.4.0)
 codetools         0.2-20     2024-03-31 [1] CRAN (R 4.4.2)
 collapse          2.0.19     2025-01-09 [1] CRAN (R 4.4.1)
 colorspace        2.1-1      2024-07-26 [1] CRAN (R 4.4.0)
 crayon            1.5.3      2024-06-20 [1] CRAN (R 4.4.0)
 curl              6.4.0      2025-06-22 [1] CRAN (R 4.4.1)
 data.table        1.17.6     2025-06-17 [1] CRAN (R 4.4.1)
 DBI               1.2.3      2024-06-02 [1] CRAN (R 4.4.0)
 digest            0.6.37     2024-08-19 [1] CRAN (R 4.4.1)
 distributional    0.5.0      2024-09-17 [1] CRAN (R 4.4.1)
 dplyr           * 1.1.4      2023-11-17 [1] CRAN (R 4.4.0)
 DT                0.33       2024-04-04 [1] CRAN (R 4.4.0)
 e1071             1.7-16     2024-09-16 [1] CRAN (R 4.4.1)
 emmeans           1.11.1     2025-05-04 [1] CRAN (R 4.4.1)
 estimability      1.5.1      2024-05-12 [1] CRAN (R 4.4.0)
 evaluate          1.0.3      2025-01-10 [1] CRAN (R 4.4.1)
 FactoMineR      * 2.11       2024-04-20 [1] CRAN (R 4.4.0)
 farver            2.1.2      2024-05-13 [1] CRAN (R 4.4.0)
 fastmap           1.2.0      2024-05-15 [1] CRAN (R 4.4.0)
 flashClust        1.01-2     2012-08-21 [1] CRAN (R 4.4.0)
 forcats         * 1.0.0      2023-01-29 [1] CRAN (R 4.4.0)
 generics          0.1.4      2025-05-09 [1] CRAN (R 4.4.1)
 gganimate       * 1.0.9      2024-02-27 [1] CRAN (R 4.4.0)
 ggdist          * 3.3.2      2024-03-05 [1] CRAN (R 4.4.0)
 ggplot2         * 3.5.2      2025-04-09 [1] CRAN (R 4.4.1)
 ggrepel           0.9.6      2024-09-07 [1] CRAN (R 4.4.1)
 gifski          * 1.32.0-1   2024-10-13 [1] CRAN (R 4.4.1)
 glue            * 1.8.0      2024-09-30 [1] CRAN (R 4.4.1)
 gridExtra         2.3        2017-09-09 [1] CRAN (R 4.4.0)
 gtable            0.3.6      2024-10-25 [1] CRAN (R 4.4.1)
 hms               1.1.3      2023-03-21 [1] CRAN (R 4.4.0)
 htmltools         0.5.8.1    2024-04-04 [1] CRAN (R 4.4.0)
 htmlwidgets       1.6.4      2023-12-06 [1] CRAN (R 4.4.0)
 inline            0.3.21     2025-01-09 [1] CRAN (R 4.4.1)
 insight           1.3.0      2025-05-20 [1] CRAN (R 4.4.1)
 jsonlite          2.0.0      2025-03-27 [1] CRAN (R 4.4.1)
 KernSmooth        2.23-26    2025-01-01 [1] CRAN (R 4.4.1)
 knitr             1.49       2024-11-08 [1] CRAN (R 4.4.1)
 labeling          0.4.3      2023-08-29 [1] CRAN (R 4.4.0)
 lattice           0.22-6     2024-03-20 [1] CRAN (R 4.4.2)
 leaps             3.2        2024-06-10 [1] CRAN (R 4.4.0)
 lifecycle         1.0.4      2023-11-07 [1] CRAN (R 4.4.0)
 loo               2.8.0      2024-07-03 [1] CRAN (R 4.4.0)
 lpSolve           5.6.23     2024-12-14 [1] CRAN (R 4.4.1)
 lubridate       * 1.9.4      2024-12-08 [1] CRAN (R 4.4.1)
 magrittr          2.0.3      2022-03-30 [1] CRAN (R 4.4.0)
 marginaleffects * 0.28.0     2025-06-25 [1] CRAN (R 4.4.1)
 MASS              7.3-64     2025-01-04 [1] CRAN (R 4.4.1)
 Matrix            1.7-2      2025-01-23 [1] CRAN (R 4.4.1)
 matrixStats       1.5.0      2025-01-07 [1] CRAN (R 4.4.1)
 multcomp          1.4-28     2025-01-29 [1] CRAN (R 4.4.1)
 multcompView      0.1-10     2024-03-08 [1] CRAN (R 4.4.0)
 munsell           0.5.1      2024-04-01 [1] CRAN (R 4.4.0)
 mvtnorm           1.3-3      2025-01-10 [1] CRAN (R 4.4.1)
 nlme              3.1-167    2025-01-27 [1] CRAN (R 4.4.1)
 patchwork       * 1.3.0      2024-09-16 [1] CRAN (R 4.4.1)
 peRspective     * 0.1.1      2025-06-27 [1] Github (favstats/peRspective@4373272)
 pillar            1.10.2     2025-04-05 [1] CRAN (R 4.4.1)
 pkgbuild          1.4.6      2025-01-16 [1] CRAN (R 4.4.1)
 pkgconfig         2.0.3      2019-09-22 [1] CRAN (R 4.4.0)
 plyr              1.8.9      2023-10-02 [1] CRAN (R 4.4.0)
 posterior       * 1.6.0.9000 2025-01-30 [1] https://stan-dev.r-universe.dev (R 4.4.2)
 prettyunits       1.2.0      2023-09-24 [1] CRAN (R 4.4.0)
 pROC            * 1.18.5     2023-11-01 [1] CRAN (R 4.4.0)
 processx          3.8.5      2025-01-08 [1] CRAN (R 4.4.1)
 progress          1.2.3      2023-12-06 [1] CRAN (R 4.4.0)
 proxy             0.4-27     2022-06-09 [1] CRAN (R 4.4.0)
 ps                1.8.1      2024-10-28 [1] CRAN (R 4.4.1)
 purrr           * 1.0.4      2025-02-05 [1] CRAN (R 4.4.1)
 QuickJSR          1.5.1      2025-01-08 [1] CRAN (R 4.4.1)
 R6                2.6.1      2025-02-15 [1] CRAN (R 4.4.1)
 Rcpp            * 1.0.14     2025-01-12 [1] CRAN (R 4.4.1)
 RcppParallel      5.1.10     2025-01-24 [1] CRAN (R 4.4.1)
 readr           * 2.1.5      2024-01-10 [1] CRAN (R 4.4.0)
 reshape2          1.4.4      2020-04-09 [1] CRAN (R 4.4.0)
 rlang             1.1.6      2025-04-11 [1] CRAN (R 4.4.1)
 rmarkdown         2.29       2024-11-04 [1] CRAN (R 4.4.1)
 rstan           * 2.32.6     2024-03-05 [1] CRAN (R 4.4.1)
 rstantools        2.4.0      2024-01-31 [1] CRAN (R 4.4.1)
 rstudioapi        0.17.1     2024-10-22 [1] CRAN (R 4.4.1)
 sandwich          3.1-1      2024-09-15 [1] CRAN (R 4.4.1)
 scales            1.3.0      2023-11-28 [1] CRAN (R 4.4.0)
 scatterplot3d     0.3-44     2023-05-05 [1] CRAN (R 4.4.0)
 sessioninfo       1.2.3      2025-02-05 [1] CRAN (R 4.4.1)
 sf                1.0-19     2024-11-05 [1] CRAN (R 4.4.1)
 StanHeaders     * 2.32.10    2024-07-15 [1] CRAN (R 4.4.1)
 stringi           1.8.7      2025-03-27 [1] CRAN (R 4.4.1)
 stringr         * 1.5.1      2023-11-14 [1] CRAN (R 4.4.0)
 survival          3.8-3      2024-12-17 [1] CRAN (R 4.4.1)
 svUnit            1.0.6      2021-04-19 [1] CRAN (R 4.4.0)
 tensorA           0.36.2.1   2023-12-13 [1] CRAN (R 4.4.0)
 TH.data           1.1-3      2025-01-17 [1] CRAN (R 4.4.1)
 tibble          * 3.3.0      2025-06-08 [1] CRAN (R 4.4.1)
 tidybayes       * 3.0.7      2024-09-15 [1] CRAN (R 4.4.1)
 tidyr           * 1.3.1      2024-01-24 [1] CRAN (R 4.4.0)
 tidyselect        1.2.1      2024-03-11 [1] CRAN (R 4.4.0)
 tidyverse       * 2.0.0      2023-02-22 [1] CRAN (R 4.4.0)
 timechange        0.3.0      2024-01-18 [1] CRAN (R 4.4.0)
 transformr        0.1.5      2024-02-26 [1] CRAN (R 4.4.0)
 tweenr            2.0.3      2024-02-26 [1] CRAN (R 4.4.0)
 tzdb              0.4.0      2023-05-12 [1] CRAN (R 4.4.0)
 units             0.8-5      2023-11-28 [1] CRAN (R 4.4.0)
 V8                6.0.1      2025-02-02 [1] CRAN (R 4.4.1)
 vctrs             0.6.5      2023-12-01 [1] CRAN (R 4.4.0)
 withr             3.0.2      2024-10-28 [1] CRAN (R 4.4.1)
 xfun              0.50       2025-01-07 [1] CRAN (R 4.4.1)
 xtable            1.8-4      2019-04-21 [1] CRAN (R 4.4.1)
 yaml              2.3.10     2024-07-26 [1] CRAN (R 4.4.0)
 zoo               1.8-12     2023-04-13 [1] CRAN (R 4.4.0)

 [1] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
 * ── Packages attached to the search path.

────────────────────────────────────────────────────────────────────────────────────────

Why Are We Here?

The Messages Behind the Data: Understanding SMS Spam in Context

Step 1 – Obtain linguistic scores from the Perspective API

Step 2 – Merge, clean, and prepare for PCA

Step 3 – Reduce nine correlated scores to two principal components

Step 4 – Assemble the modelling table

The Interpretation Challenge: Why Binary Outcomes Are Tricky

How Logistic Regression Fixes It

A Peek Behind the Curtain: Log-Odds

The Interpretation Problem

Complete Separation Issues

Robust Computation

Making Bayesian Binary Models Practical: From Priors to Inference

The Prior Specification Problem

A Useful Translation Trick

A Practical Workflow

Choosing Predictors

Bringing It All Together: Fitting the Model in brms

HDI+ROPE: A Bayesian Path to Statistical Inference

The Highest Density Interval (HDI)

The Region of Practical Equivalence (ROPE)

Making Decisions with HDI+ROPE

Analyzing the Spam data with marginaleffects and bayestestR

Why marginaleffects for Logistic Regression?

Why bayestestR for ROPE?

Making Sense of the Results

Exploring Interactions

Wrapping Up: A New Lens for Binary Classification

The Interpretation Challenge:
Why Binary Outcomes Are Tricky

Analyzing the Spam data with `marginaleffects` and `bayestestR`

Why `marginaleffects` for Logistic Regression?

Why `bayestestR` for ROPE?