1 Mathematical Foundations
1.1 Binomial Distribution Foundation
The BinomialTree algorithm is built on the assumption that the target data follows a binomial distribution. For each observation \(i\), we have:
- \(k_i\): number of successes (events)
- \(n_i\): number of trials (exposure)
- \(p_i\): true probability of success (unknown, to be estimated)
The likelihood for observation \(i\) under a binomial distribution is:
\[P(k_i | n_i, p) = \binom{n_i}{k_i} p^{k_i} (1-p)^{n_i - k_i}\]
The log-likelihood for observation \(i\) is:
\[\ell_i(p) = \log\binom{n_i}{k_i} + k_i \log(p) + (n_i - k_i) \log(1-p)\]
For a set of observations in a node, the total log-likelihood is:
\[\ell_{node}(p) = \sum_{i \in node} \ell_i(p)\]
1.2 Why Model Counts Instead of Rates?
A key insight is that we model the count data directly rather than the rate \(k_i/n_i\). This is motivated by several factors:
1.2.1 Statistical Properties
For rare events where \(p\) is small and \(n_i\) varies significantly:
- Variance Instability: The variance of \(k_i/n_i\) is \(p(1-p)/n_i\), which depends on \(n_i\)
- Distribution Shape: The distribution of \(k_i/n_i\) becomes highly skewed and poorly behaved for small \(p\) and varying \(n_i\)
- Information Loss: Converting to rates discards the exposure information that affects uncertainty
1.2.2 Practical Implications
Consider two observations:
- Observation A: 1 success out of 10 trials (\(k/n = 0.1\))
- Observation B: 10 successes out of 100 trials (\(k/n = 0.1\))
Both have the same rate but vastly different uncertainty. The binomial likelihood naturally accounts for this difference through the exposure term.
1.3 Splitting Criteria
1.3.1 Objective Function
At each node, we seek the split that maximizes the combined log-likelihood of the resulting child nodes:
\[\text{Split Quality} = \ell_{left}(\hat{p}_{left}) + \ell_{right}(\hat{p}_{right}) - \ell_{parent}(\hat{p}_{parent})\]
Where \(\hat{p}\) is the maximum likelihood estimate: \(\hat{p} = \frac{\sum k_i}{\sum n_i}\)
1.3.2 Numerical Features
For numerical features, we:
- Sort observations by feature value
- Consider all possible split points (midpoints between consecutive unique values)
- For computational efficiency, subsample split points if there are more than
max_numerical_split_points
unique values - Handle missing values by always assigning them to the right child
The algorithm efficiently computes split statistics by maintaining cumulative sums as it iterates through sorted values.
1.3.3 Categorical Features
For categorical features with \(C\) categories, we use an optimal grouping strategy:
- Sort by Rate: Order categories by their estimated success rate \(\hat{p}_c = \frac{\sum_{i \in c} k_i}{\sum_{i \in c} n_i}\)
- Optimal Grouping: Consider all possible ways to split the sorted categories into two groups
- Degrees of Freedom: The split has \(df = C - 1\) degrees of freedom for hypothesis testing
This approach is theoretically optimal for binomial data and avoids the exponential complexity of considering all possible category groupings.
1.4 Statistical Stopping Criteria
1.4.1 Motivation
Traditional decision trees often require external validation or pruning to prevent overfitting. BinomialTree incorporates statistical hypothesis testing directly into the splitting process, inspired by Conditional Inference Trees (ctree).
1.4.2 Hypothesis Testing Framework
For each potential split, we test:
- Null Hypothesis (\(H_0\)): The split provides no improvement over the parent node
- Alternative Hypothesis (\(H_1\)): The split provides significant improvement
1.4.3 Likelihood Ratio Test
For each potential split, we calculate a likelihood ratio test statistic to assess whether the split provides a statistically significant improvement over the parent node.
The test statistic compares the log-likelihood of the data under two models:
- Null model: Single binomial distribution for the entire parent node
- Alternative model: Separate binomial distributions for left and right child nodes
The likelihood ratio test statistic is:
\[LR = 2(\ell_{children} - \ell_{parent}) = 2[(\ell_{left} + \ell_{right}) - \ell_{parent}]\]
Where:
- \(\ell_{parent}\) is the log-likelihood of the parent node with estimated probability \(\hat{p}_{parent} = \frac{\sum k_i}{\sum n_i}\)
- \(\ell_{left}\) is the log-likelihood of the left child with estimated probability \(\hat{p}_{left}\)
- \(\ell_{right}\) is the log-likelihood of the right child with estimated probability \(\hat{p}_{right}\)
Under the null hypothesis \(H_0\) (no benefit from splitting), this test statistic follows a chi-squared distribution with degrees of freedom:
- Numerical features: \(df = 1\) (one additional parameter: the split threshold)
- Categorical features: \(df = C - 1\) (where \(C\) is the number of categories being split)
1.4.4 Bonferroni Correction
Since we test multiple features at each node, we apply Bonferroni correction:
\[p_{adjusted} = \min(1, p_{raw} \times m)\]
Where \(m\) is the number of features tested. The split is considered significant if \(p_{adjusted} < \alpha\).
1.4.5 Implementation Details
The stopping procedure follows these steps:
- Pre-split Checks: Verify minimum sample requirements and maximum depth
- Find Best Split: For each feature, find the split that maximizes log-likelihood gain
- Calculate P-values: Compute likelihood ratio test p-value for each feature’s best split
- Multiple Testing Correction: Apply Bonferroni correction to the minimum p-value
- Decision: Split only if adjusted p-value < α
1.5 Comparison to Permutation Testing
The BinomialTree approach differs fundamentally from permutation-based methods like those used in Conditional Inference Trees (ctree):
1.5.1 Ctree Permutation Approach
Conditional Inference Trees use permutation testing because they make no assumptions about the target distribution:
- Compute observed test statistic for the best split
- Randomly permute the target values while keeping exposure fixed
- Recompute test statistic on permuted data
- Repeat B times (typically B = 1000+) to build null distribution
- P-value = proportion of permuted statistics ≥ observed statistic
Advantages of Permutation Testing:
- No distributional assumptions required
- Exact finite-sample validity
- Robust to any form of model misspecification
- Works for any target distribution
Computational Cost:
- Requires B × (number of features) × (split evaluations) calculations
- Typically 1000+ permutations needed for stable p-values
- Computationally expensive for large datasets
1.5.2 BinomialTree Parametric Approach
BinomialTree leverages the explicit binomial distribution assumption to use more efficient inference:
Key Insight: Since we assume the target follows a binomial distribution, we can use the known theoretical properties of the likelihood ratio test under this assumption.
Computational Efficiency:
- Uses chi-squared distribution for p-value calculation
- Single evaluation per feature (no permutations needed)
- Orders of magnitude faster than permutation testing
Trade-off:
- More efficient when binomial assumption holds
- Less robust when assumption is violated
- Requires careful validation of distributional assumptions
This design choice reflects the specialized nature of BinomialTree: by accepting the constraint of binomial-distributed targets, we achieve significant computational efficiency while maintaining statistical rigor through parametric hypothesis testing.
1.6 Assumptions and Limitations
1.6.1 Key Assumptions
- Binomial Distribution: Target data follows or approximates a binomial distribution
- Independence: Observations are independent conditional on features
- Constant Probability: Within each leaf, the success probability is constant
1.6.2 When Assumptions Are Violated
- Overdispersion: If variance > mean (for count data), consider beta-binomial models
- Temporal Dependence: Time-varying probabilities may violate independence
- Zero-Inflation: Excess zeros may indicate a different generating process
1.6.3 Performance Implications
BinomialTree should perform well when assumptions are met but may underperform gradient boosting methods when assumptions are significantly violated. The statistical stopping criteria help prevent overfitting but may also limit the model’s ability to capture complex nonlinear relationships.
1.7 Theoretical Advantages
- Principled Splitting: Splits directly optimize the relevant likelihood rather than surrogate measures
- Uncertainty Quantification: Naturally handles varying exposure levels
- Statistical Rigor: Hypothesis testing framework reduces overfitting
- Interpretability: Clear decision boundaries with statistical significance
- Reduced Validation Needs: Statistical stopping may reduce the need for train-test splits
This theoretical foundation guides the practical implementation and helps understand when BinomialTree is likely to excel compared to traditional methods.