2  Implementation Guide

2.1 Installation and Dependencies

BinomialTree is implemented in pure Python with minimal dependencies:

# Required dependencies
import numpy as np
import pandas as pd  # Optional, for DataFrame support

No external machine learning libraries are required for the core functionality.

2.2 Basic Usage

2.2.1 Data Preparation

Your data should contain:

  • Target column: Number of successes (k)
  • Exposure column: Number of trials (n)
  • Feature columns: Predictor variables (numerical or categorical)
# Example data structure
data = [
    {'feature_num': 10.0, 'feature_cat': 'A', 'successes': 2, 'trials': 20},
    {'feature_num': 12.0, 'feature_cat': 'B', 'successes': 8, 'trials': 25},
    {'feature_num': 15.0, 'feature_cat': 'A', 'successes': 3, 'trials': 18},
    # ... more observations
]

# Or as a pandas DataFrame
import pandas as pd
df = pd.DataFrame(data)

2.2.2 Basic Model Training

from binomial_tree.tree import BinomialDecisionTree

# Initialize the tree
tree = BinomialDecisionTree(
    min_samples_split=20,
    min_samples_leaf=10, 
    max_depth=5,
    alpha=0.05,
    verbose=True
)

# Fit the model
tree.fit(
    data=data,  # or df for pandas DataFrame
    target_column='successes',
    exposure_column='trials', 
    feature_columns=['feature_num', 'feature_cat']
)

# Make predictions
new_data = [
    {'feature_num': 13.0, 'feature_cat': 'A'},
    {'feature_num': 23.0, 'feature_cat': 'C'}
]
predicted_probabilities = tree.predict_p(new_data)

2.2.3 Inspecting the Model

# Print the tree structure
tree.print_tree()

# Output example:
# Split: feature_num <= 15.500 (p-val=0.0123, gain=12.45) | k=45, n=180 (p̂=0.250)
#   |--L: Split: feature_cat in {'A', 'C'} (p-val=0.0089, gain=8.32) | k=15, n=80 (p̂=0.188)
#   |    |--L: Leaf: k=8, n=50 (p̂=0.160) | Reason: stat_stop
#   |    +--R: Leaf: k=7, n=30 (p̂=0.233) | Reason: min_samples_split
#   +--R: Leaf: k=30, n=100 (p̂=0.300) | Reason: stat_stop

2.3 Hyperparameter Configuration

2.3.1 Core Parameters

tree = BinomialDecisionTree(
    # Structural constraints
    max_depth=5,                    # Maximum tree depth
    min_samples_split=20,           # Min samples to consider splitting
    min_samples_leaf=10,            # Min samples in each leaf
    
    # Statistical stopping
    alpha=0.05,                     # Significance level for splits
    
    # Performance tuning  
    max_numerical_split_points=255, # Limit split points for large features
    
    # Output control
    verbose=False,                  # Enable detailed logging
    confidence_level=0.95           # For confidence intervals (display only)
)

2.3.2 Parameter Guidelines

alpha (Significance Level)

  • Lower values (0.01) create more conservative, smaller trees
  • Higher values (0.10) allow more aggressive splitting
  • Default 0.05 provides good balance

min_samples_split and min_samples_leaf

  • Increase for rare events to ensure statistical power
  • Decrease for abundant data to capture fine patterns
  • Rule of thumb: min_samples_leaf ≥ 5-10 expected events

max_depth

  • Acts as a safety constraint
  • Statistical stopping often kicks in before max depth
  • Set higher when alpha is strict (low)

2.4 Advanced Usage

2.4.1 Feature Type Specification

# Explicit feature type control
tree.fit(
    data=data,
    target_column='successes',
    exposure_column='trials',
    feature_columns=['numeric_feat', 'categorical_feat'],
    feature_types={
        'numeric_feat': 'numerical',
        'categorical_feat': 'categorical'
    }
)

2.4.2 Missing Value Handling

BinomialTree handles missing values automatically:

Numerical Features

  • Missing values imputed with median during training
  • Same median used for prediction

Categorical Features

  • Missing values treated as a distinct category (‘NaN’)
  • Unseen categories in prediction mapped to NaN path
# Data with missing values
data_with_missing = [
    {'num_feat': 10.0, 'cat_feat': 'A', 'k': 2, 'n': 20},
    {'num_feat': None, 'cat_feat': 'B', 'k': 8, 'n': 25},  # Missing numeric
    {'num_feat': 15.0, 'cat_feat': None, 'k': 3, 'n': 18}, # Missing categorical
]

# No special handling needed
tree.fit(data=data_with_missing, ...)

2.4.3 Pandas Integration

import pandas as pd
import numpy as np

# Create DataFrame with missing values
df = pd.DataFrame({
    'numeric_feature': [10.0, 12.0, np.nan, 15.0],
    'categorical_feature': ['A', 'B', 'A', None], 
    'successes': [2, 8, 1, 3],
    'trials': [20, 25, 5, 18]
})

# Seamless integration
tree.fit(
    data=df,
    target_column='successes', 
    exposure_column='trials',
    feature_columns=['numeric_feature', 'categorical_feature']
)

# Prediction on new DataFrame
new_df = pd.DataFrame({
    'numeric_feature': [13.0, 23.0],
    'categorical_feature': ['A', 'C']
})
predictions = tree.predict_p(new_df)

2.5 Model Interpretation

2.5.1 Understanding Tree Output

Each node displays comprehensive statistics:

Split: feature_name <= threshold (p-val=X.XXXX, gain=XX.XX) | k=XX, n=XXX (p̂=X.XXX) | CI_rel_width=X.XX | LL=XX.XX | N=XXX

Split Information

  • p-val: Statistical significance of the split
  • gain: Log-likelihood improvement from splitting

Node Statistics

  • k: Total successes in node
  • n: Total trials in node
  • : Estimated success probability
  • CI_rel_width: Relative width of confidence interval
  • LL: Log-likelihood of the node
  • N: Number of observations

Leaf Reasons

  • stat_stop: Stopped due to statistical test
  • min_samples_split: Not enough samples to split
  • max_depth: Reached maximum depth
  • pure_node: All observations have same outcome

2.5.2 Extracting Predictions and Uncertainty

# Get point predictions
probabilities = tree.predict_p(test_data)

# Access detailed node information for uncertainty
def get_prediction_details(tree, data_point):
    """Get prediction with node statistics"""
    # This would require extending the current API
    # Implementation would traverse tree and return node info
    pass

2.6 Common Patterns and Best Practices

2.6.1 Rare Event Modeling

# Configuration for rare events (p < 0.01)
rare_event_tree = BinomialDecisionTree(
    min_samples_split=100,    # Need more samples for stability
    min_samples_leaf=50,      # Ensure adequate events per leaf
    max_depth=6,              # Allow deeper trees
    alpha=0.01,               # More conservative splitting
    verbose=True
)

2.6.2 High-Cardinality Categoricals

# For categorical features with many levels
high_card_tree = BinomialDecisionTree(
    min_samples_split=60,     # Account for category splits
    min_samples_leaf=30,      # Ensure representation per category
    max_depth=6,              # Categories may need more depth
    alpha=0.05
)

2.6.3 Large Dataset Optimization

# For datasets with many unique numerical values
large_data_tree = BinomialDecisionTree(
    max_numerical_split_points=500,  # More split points
    min_samples_split=50,            # Can afford larger minimums
    verbose=False                    # Reduce logging overhead
)

2.7 Error Handling and Diagnostics

2.7.1 Common Issues

Empty Leaves

  • Increase min_samples_leaf
  • Check for data quality issues
  • Consider feature engineering

No Splits Found

  • Increase alpha to be less strict
  • Ensure adequate sample sizes
  • Check feature-target relationships

Performance Issues

  • Reduce max_numerical_split_points
  • Limit max_depth
  • Consider feature selection

2.7.2 Debugging Output

# Enable verbose mode for detailed logging
tree = BinomialDecisionTree(verbose=True)
tree.fit(...)

# Sample verbose output:
# Processing Node abc123 (Depth 0): 1000 samples
#   Evaluating feature 'numeric_feat' (numerical)...
#   Feature 'numeric_feat' best split LL Gain: 23.45, p-value: 0.0012
#   Evaluating feature 'cat_feat' (categorical)...  
#   Feature 'cat_feat' best split LL Gain: 18.32, p-value: 0.0089
#   Overall best split: Feature 'numeric_feat' with p-value: 0.0012
#   Stat Stop Check: Bonferroni-adjusted p-value: 0.0024 < 0.05
#   Node abc123 SPLIT on numeric_feat

This implementation guide provides the essential knowledge for effectively using BinomialTree in practice, from basic usage to advanced configurations for specific use cases.