1 Mathematical Foundations

1.1 Binomial Distribution Foundation

The BinomialTree algorithm is built on the assumption that the target data follows a binomial distribution. For each observation $i$, we have:

$k_i$: number of successes (events)
$n_i$: number of trials (exposure)
$p_i$: true probability of success (unknown, to be estimated)

The likelihood for observation $i$ under a binomial distribution is:

\[P(k_i | n_i, p) = \binom{n_i}{k_i} p^{k_i} (1-p)^{n_i - k_i}\]

The log-likelihood for observation $i$ is:

\[\ell_i(p) = \log\binom{n_i}{k_i} + k_i \log(p) + (n_i - k_i) \log(1-p)\]

For a set of observations in a node, the total log-likelihood is:

\[\ell_{node}(p) = \sum_{i \in node} \ell_i(p)\]

1.2 Why Model Counts Instead of Rates?

A key insight is that we model the count data directly rather than the rate $k_i/n_i$. This is motivated by several factors:

1.2.1 Statistical Properties

For rare events where $p$ is small and $n_i$ varies significantly:

Variance Instability: The variance of $k_i/n_i$ is $p(1-p)/n_i$, which depends on $n_i$
Distribution Shape: The distribution of $k_i/n_i$ becomes highly skewed and poorly behaved for small $p$ and varying $n_i$
Information Loss: Converting to rates discards the exposure information that affects uncertainty

1.2.2 Practical Implications

Consider two observations:

Observation A: 1 success out of 10 trials ($k/n = 0.1$)
Observation B: 10 successes out of 100 trials ($k/n = 0.1$)

Both have the same rate but vastly different uncertainty. The binomial likelihood naturally accounts for this difference through the exposure term.

1.3 Splitting Criteria

1.3.1 Objective Function

At each node, we seek the split that maximizes the combined log-likelihood of the resulting child nodes:

\[\text{Split Quality} = \ell_{left}(\hat{p}_{left}) + \ell_{right}(\hat{p}_{right}) - \ell_{parent}(\hat{p}_{parent})\]

Where $\hat{p}$ is the maximum likelihood estimate: $\hat{p} = \frac{\sum k_i}{\sum n_i}$

1.3.2 Numerical Features

For numerical features, we:

Sort observations by feature value
Consider all possible split points (midpoints between consecutive unique values)
For computational efficiency, subsample split points if there are more than max_numerical_split_points unique values
Handle missing values by always assigning them to the right child

The algorithm efficiently computes split statistics by maintaining cumulative sums as it iterates through sorted values.

1.3.3 Categorical Features

For categorical features with $C$ categories, we use an optimal grouping strategy:

Sort by Rate: Order categories by their estimated success rate $\hat{p}_c = \frac{\sum_{i \in c} k_i}{\sum_{i \in c} n_i}$
Optimal Grouping: Consider all possible ways to split the sorted categories into two groups
Degrees of Freedom: The split has $df = C - 1$ degrees of freedom for hypothesis testing

This approach is theoretically optimal for binomial data and avoids the exponential complexity of considering all possible category groupings.

1.4 Statistical Stopping Criteria

1.4.1 Motivation

Traditional decision trees often require external validation or pruning to prevent overfitting. BinomialTree incorporates statistical hypothesis testing directly into the splitting process, inspired by Conditional Inference Trees (ctree).

1.4.2 Hypothesis Testing Framework

For each potential split, we test:

Null Hypothesis ($H_0$): The split provides no improvement over the parent node
Alternative Hypothesis ($H_1$): The split provides significant improvement

1.4.3 Likelihood Ratio Test

For each potential split, we calculate a likelihood ratio test statistic to assess whether the split provides a statistically significant improvement over the parent node.

The test statistic compares the log-likelihood of the data under two models:

Null model: Single binomial distribution for the entire parent node
Alternative model: Separate binomial distributions for left and right child nodes

The likelihood ratio test statistic is:

\[LR = 2(\ell_{children} - \ell_{parent}) = 2[(\ell_{left} + \ell_{right}) - \ell_{parent}]\]

Where:

$\ell_{parent}$ is the log-likelihood of the parent node with estimated probability $\hat{p}_{parent} = \frac{\sum k_i}{\sum n_i}$
$\ell_{left}$ is the log-likelihood of the left child with estimated probability $\hat{p}_{left}$
$\ell_{right}$ is the log-likelihood of the right child with estimated probability $\hat{p}_{right}$

Under the null hypothesis $H_0$ (no benefit from splitting), this test statistic follows a chi-squared distribution with degrees of freedom:

Numerical features: $df = 1$ (one additional parameter: the split threshold)
Categorical features: $df = C - 1$ (where $C$ is the number of categories being split)

1.4.4 Bonferroni Correction

Since we test multiple features at each node, we apply Bonferroni correction:

\[p_{adjusted} = \min(1, p_{raw} \times m)\]

Where $m$ is the number of features tested. The split is considered significant if $p_{adjusted} < \alpha$.

1.4.5 Implementation Details

The stopping procedure follows these steps:

Pre-split Checks: Verify minimum sample requirements and maximum depth
Find Best Split: For each feature, find the split that maximizes log-likelihood gain
Calculate P-values: Compute likelihood ratio test p-value for each feature’s best split
Multiple Testing Correction: Apply Bonferroni correction to the minimum p-value
Decision: Split only if adjusted p-value < α

1.5 Comparison to Permutation Testing

The BinomialTree approach differs fundamentally from permutation-based methods like those used in Conditional Inference Trees (ctree):

1.5.1 Ctree Permutation Approach

Conditional Inference Trees use permutation testing because they make no assumptions about the target distribution:

Compute observed test statistic for the best split
Randomly permute the target values while keeping exposure fixed
Recompute test statistic on permuted data
Repeat B times (typically B = 1000+) to build null distribution
P-value = proportion of permuted statistics ≥ observed statistic

Advantages of Permutation Testing:

No distributional assumptions required
Exact finite-sample validity
Robust to any form of model misspecification
Works for any target distribution

Computational Cost:

Requires B × (number of features) × (split evaluations) calculations
Typically 1000+ permutations needed for stable p-values
Computationally expensive for large datasets

1.5.2 BinomialTree Parametric Approach

BinomialTree leverages the explicit binomial distribution assumption to use more efficient inference:

Key Insight: Since we assume the target follows a binomial distribution, we can use the known theoretical properties of the likelihood ratio test under this assumption.

Computational Efficiency:

Uses chi-squared distribution for p-value calculation
Single evaluation per feature (no permutations needed)
Orders of magnitude faster than permutation testing

Trade-off:

More efficient when binomial assumption holds
Less robust when assumption is violated
Requires careful validation of distributional assumptions

This design choice reflects the specialized nature of BinomialTree: by accepting the constraint of binomial-distributed targets, we achieve significant computational efficiency while maintaining statistical rigor through parametric hypothesis testing.

1.6 Assumptions and Limitations

1.6.1 Key Assumptions

Binomial Distribution: Target data follows or approximates a binomial distribution
Independence: Observations are independent conditional on features
Constant Probability: Within each leaf, the success probability is constant

1.6.2 When Assumptions Are Violated

Overdispersion: If variance > mean (for count data), consider beta-binomial models
Temporal Dependence: Time-varying probabilities may violate independence
Zero-Inflation: Excess zeros may indicate a different generating process

1.6.3 Performance Implications

BinomialTree should perform well when assumptions are met but may underperform gradient boosting methods when assumptions are significantly violated. The statistical stopping criteria help prevent overfitting but may also limit the model’s ability to capture complex nonlinear relationships.

1.7 Theoretical Advantages

Principled Splitting: Splits directly optimize the relevant likelihood rather than surrogate measures
Uncertainty Quantification: Naturally handles varying exposure levels
Statistical Rigor: Hypothesis testing framework reduces overfitting
Interpretability: Clear decision boundaries with statistical significance
Reduced Validation Needs: Statistical stopping may reduce the need for train-test splits

This theoretical foundation guides the practical implementation and helps understand when BinomialTree is likely to excel compared to traditional methods.

# Mathematical Foundations ## Binomial Distribution Foundation The BinomialTree algorithm is built on the assumption that the target data follows a binomial distribution. For each observation $i$, we have: - $k_i$: number of successes (events) - $n_i$: number of trials (exposure) - $p_i$: true probability of success (unknown, to be estimated) The likelihood for observation $i$ under a binomial distribution is: $$P(k_i | n_i, p) = \binom{n_i}{k_i} p^{k_i} (1-p)^{n_i - k_i}$$ The log-likelihood for observation $i$ is: $$\ell_i(p) = \log\binom{n_i}{k_i} + k_i \log(p) + (n_i - k_i) \log(1-p)$$ For a set of observations in a node, the total log-likelihood is: $$\ell_{node}(p) = \sum_{i \in node} \ell_i(p)$$ ## Why Model Counts Instead of Rates? A key insight is that we model the count data directly rather than the rate $k_i/n_i$. This is motivated by several factors: ### Statistical Properties For rare events where $p$ is small and $n_i$ varies significantly: 1. **Variance Instability**: The variance of $k_i/n_i$ is $p(1-p)/n_i$, which depends on $n_i$ 2. **Distribution Shape**: The distribution of $k_i/n_i$ becomes highly skewed and poorly behaved for small $p$ and varying $n_i$ 3. **Information Loss**: Converting to rates discards the exposure information that affects uncertainty ### Practical Implications Consider two observations: - Observation A: 1 success out of 10 trials ($k/n = 0.1$) - Observation B: 10 successes out of 100 trials ($k/n = 0.1$) Both have the same rate but vastly different uncertainty. The binomial likelihood naturally accounts for this difference through the exposure term. ## Splitting Criteria ### Objective Function At each node, we seek the split that maximizes the combined log-likelihood of the resulting child nodes: $$\text{Split Quality} = \ell_{left}(\hat{p}_{left}) + \ell_{right}(\hat{p}_{right}) - \ell_{parent}(\hat{p}_{parent})$$ Where $\hat{p}$ is the maximum likelihood estimate: $\hat{p} = \frac{\sum k_i}{\sum n_i}$ ### Numerical Features For numerical features, we: 1. Sort observations by feature value 2. Consider all possible split points (midpoints between consecutive unique values) 3. For computational efficiency, subsample split points if there are more than `max_numerical_split_points` unique values 4. Handle missing values by always assigning them to the right child The algorithm efficiently computes split statistics by maintaining cumulative sums as it iterates through sorted values. ### Categorical Features For categorical features with $C$ categories, we use an optimal grouping strategy: 1. **Sort by Rate**: Order categories by their estimated success rate $\hat{p}_c = \frac{\sum_{i \in c} k_i}{\sum_{i \in c} n_i}$ 2. **Optimal Grouping**: Consider all possible ways to split the sorted categories into two groups 3. **Degrees of Freedom**: The split has $df = C - 1$ degrees of freedom for hypothesis testing This approach is theoretically optimal for binomial data and avoids the exponential complexity of considering all possible category groupings. ## Statistical Stopping Criteria ### Motivation Traditional decision trees often require external validation or pruning to prevent overfitting. BinomialTree incorporates statistical hypothesis testing directly into the splitting process, inspired by Conditional Inference Trees (ctree). ### Hypothesis Testing Framework For each potential split, we test: - **Null Hypothesis ($H_0$)**: The split provides no improvement over the parent node - **Alternative Hypothesis ($H_1$)**: The split provides significant improvement ### Likelihood Ratio Test For each potential split, we calculate a likelihood ratio test statistic to assess whether the split provides a statistically significant improvement over the parent node. The test statistic compares the log-likelihood of the data under two models: - **Null model**: Single binomial distribution for the entire parent node - **Alternative model**: Separate binomial distributions for left and right child nodes The likelihood ratio test statistic is: $$LR = 2(\ell_{children} - \ell_{parent}) = 2[(\ell_{left} + \ell_{right}) - \ell_{parent}]$$ Where: - $\ell_{parent}$ is the log-likelihood of the parent node with estimated probability $\hat{p}_{parent} = \frac{\sum k_i}{\sum n_i}$ - $\ell_{left}$ is the log-likelihood of the left child with estimated probability $\hat{p}_{left}$ - $\ell_{right}$ is the log-likelihood of the right child with estimated probability $\hat{p}_{right}$ Under the null hypothesis $H_0$ (no benefit from splitting), this test statistic follows a chi-squared distribution with degrees of freedom: - **Numerical features**: $df = 1$ (one additional parameter: the split threshold) - **Categorical features**: $df = C - 1$ (where $C$ is the number of categories being split) ### Bonferroni Correction Since we test multiple features at each node, we apply Bonferroni correction: $$p_{adjusted} = \min(1, p_{raw} \times m)$$ Where $m$ is the number of features tested. The split is considered significant if $p_{adjusted} < \alpha$. ### Implementation Details The stopping procedure follows these steps: 1. **Pre-split Checks**: Verify minimum sample requirements and maximum depth 2. **Find Best Split**: For each feature, find the split that maximizes log-likelihood gain 3. **Calculate P-values**: Compute likelihood ratio test p-value for each feature's best split 4. **Multiple Testing Correction**: Apply Bonferroni correction to the minimum p-value 5. **Decision**: Split only if adjusted p-value < α ## Comparison to Permutation Testing The BinomialTree approach differs fundamentally from permutation-based methods like those used in Conditional Inference Trees (ctree): ### Ctree Permutation Approach Conditional Inference Trees use permutation testing because they make **no assumptions about the target distribution**: 1. Compute observed test statistic for the best split 2. Randomly permute the target values while keeping exposure fixed 3. Recompute test statistic on permuted data 4. Repeat B times (typically B = 1000+) to build null distribution 5. P-value = proportion of permuted statistics ≥ observed statistic **Advantages of Permutation Testing:** - No distributional assumptions required - Exact finite-sample validity - Robust to any form of model misspecification - Works for any target distribution **Computational Cost:** - Requires B × (number of features) × (split evaluations) calculations - Typically 1000+ permutations needed for stable p-values - Computationally expensive for large datasets ### BinomialTree Parametric Approach BinomialTree leverages the **explicit binomial distribution assumption** to use more efficient inference: **Key Insight:** Since we assume the target follows a binomial distribution, we can use the known theoretical properties of the likelihood ratio test under this assumption. **Computational Efficiency:** - Uses chi-squared distribution for p-value calculation - Single evaluation per feature (no permutations needed) - Orders of magnitude faster than permutation testing **Trade-off:** - **More efficient** when binomial assumption holds - **Less robust** when assumption is violated - **Requires careful validation** of distributional assumptions This design choice reflects the specialized nature of BinomialTree: by accepting the constraint of binomial-distributed targets, we achieve significant computational efficiency while maintaining statistical rigor through parametric hypothesis testing. ## Assumptions and Limitations ### Key Assumptions 1. **Binomial Distribution**: Target data follows or approximates a binomial distribution 2. **Independence**: Observations are independent conditional on features 3. **Constant Probability**: Within each leaf, the success probability is constant ### When Assumptions Are Violated - **Overdispersion**: If variance > mean (for count data), consider beta-binomial models - **Temporal Dependence**: Time-varying probabilities may violate independence - **Zero-Inflation**: Excess zeros may indicate a different generating process ### Performance Implications BinomialTree should perform well when assumptions are met but may underperform gradient boosting methods when assumptions are significantly violated. The statistical stopping criteria help prevent overfitting but may also limit the model's ability to capture complex nonlinear relationships. ## Theoretical Advantages 1. **Principled Splitting**: Splits directly optimize the relevant likelihood rather than surrogate measures 2. **Uncertainty Quantification**: Naturally handles varying exposure levels 3. **Statistical Rigor**: Hypothesis testing framework reduces overfitting 4. **Interpretability**: Clear decision boundaries with statistical significance 5. **Reduced Validation Needs**: Statistical stopping may reduce the need for train-test splits This theoretical foundation guides the practical implementation and helps understand when BinomialTree is likely to excel compared to traditional methods.