2 Implementation Guide
2.1 Installation and Dependencies
BinomialTree is implemented in pure Python with minimal dependencies:
# Required dependencies
import numpy as np
import pandas as pd # Optional, for DataFrame support
No external machine learning libraries are required for the core functionality.
2.2 Basic Usage
2.2.1 Data Preparation
Your data should contain:
- Target column: Number of successes (k)
- Exposure column: Number of trials (n)
- Feature columns: Predictor variables (numerical or categorical)
# Example data structure
= [
data 'feature_num': 10.0, 'feature_cat': 'A', 'successes': 2, 'trials': 20},
{'feature_num': 12.0, 'feature_cat': 'B', 'successes': 8, 'trials': 25},
{'feature_num': 15.0, 'feature_cat': 'A', 'successes': 3, 'trials': 18},
{# ... more observations
]
# Or as a pandas DataFrame
import pandas as pd
= pd.DataFrame(data) df
2.2.2 Basic Model Training
from binomial_tree.tree import BinomialDecisionTree
# Initialize the tree
= BinomialDecisionTree(
tree =20,
min_samples_split=10,
min_samples_leaf=5,
max_depth=0.05,
alpha=True
verbose
)
# Fit the model
tree.fit(=data, # or df for pandas DataFrame
data='successes',
target_column='trials',
exposure_column=['feature_num', 'feature_cat']
feature_columns
)
# Make predictions
= [
new_data 'feature_num': 13.0, 'feature_cat': 'A'},
{'feature_num': 23.0, 'feature_cat': 'C'}
{
]= tree.predict_p(new_data) predicted_probabilities
2.2.3 Inspecting the Model
# Print the tree structure
tree.print_tree()
# Output example:
# Split: feature_num <= 15.500 (p-val=0.0123, gain=12.45) | k=45, n=180 (p̂=0.250)
# |--L: Split: feature_cat in {'A', 'C'} (p-val=0.0089, gain=8.32) | k=15, n=80 (p̂=0.188)
# | |--L: Leaf: k=8, n=50 (p̂=0.160) | Reason: stat_stop
# | +--R: Leaf: k=7, n=30 (p̂=0.233) | Reason: min_samples_split
# +--R: Leaf: k=30, n=100 (p̂=0.300) | Reason: stat_stop
2.3 Hyperparameter Configuration
2.3.1 Core Parameters
= BinomialDecisionTree(
tree # Structural constraints
=5, # Maximum tree depth
max_depth=20, # Min samples to consider splitting
min_samples_split=10, # Min samples in each leaf
min_samples_leaf
# Statistical stopping
=0.05, # Significance level for splits
alpha
# Performance tuning
=255, # Limit split points for large features
max_numerical_split_points
# Output control
=False, # Enable detailed logging
verbose=0.95 # For confidence intervals (display only)
confidence_level )
2.3.2 Parameter Guidelines
alpha
(Significance Level)
- Lower values (0.01) create more conservative, smaller trees
- Higher values (0.10) allow more aggressive splitting
- Default 0.05 provides good balance
min_samples_split
and min_samples_leaf
- Increase for rare events to ensure statistical power
- Decrease for abundant data to capture fine patterns
- Rule of thumb: min_samples_leaf ≥ 5-10 expected events
max_depth
- Acts as a safety constraint
- Statistical stopping often kicks in before max depth
- Set higher when alpha is strict (low)
2.4 Advanced Usage
2.4.1 Feature Type Specification
# Explicit feature type control
tree.fit(=data,
data='successes',
target_column='trials',
exposure_column=['numeric_feat', 'categorical_feat'],
feature_columns={
feature_types'numeric_feat': 'numerical',
'categorical_feat': 'categorical'
} )
2.4.2 Missing Value Handling
BinomialTree handles missing values automatically:
Numerical Features
- Missing values imputed with median during training
- Same median used for prediction
Categorical Features
- Missing values treated as a distinct category (‘NaN’)
- Unseen categories in prediction mapped to NaN path
# Data with missing values
= [
data_with_missing 'num_feat': 10.0, 'cat_feat': 'A', 'k': 2, 'n': 20},
{'num_feat': None, 'cat_feat': 'B', 'k': 8, 'n': 25}, # Missing numeric
{'num_feat': 15.0, 'cat_feat': None, 'k': 3, 'n': 18}, # Missing categorical
{
]
# No special handling needed
=data_with_missing, ...) tree.fit(data
2.4.3 Pandas Integration
import pandas as pd
import numpy as np
# Create DataFrame with missing values
= pd.DataFrame({
df 'numeric_feature': [10.0, 12.0, np.nan, 15.0],
'categorical_feature': ['A', 'B', 'A', None],
'successes': [2, 8, 1, 3],
'trials': [20, 25, 5, 18]
})
# Seamless integration
tree.fit(=df,
data='successes',
target_column='trials',
exposure_column=['numeric_feature', 'categorical_feature']
feature_columns
)
# Prediction on new DataFrame
= pd.DataFrame({
new_df 'numeric_feature': [13.0, 23.0],
'categorical_feature': ['A', 'C']
})= tree.predict_p(new_df) predictions
2.5 Model Interpretation
2.5.1 Understanding Tree Output
Each node displays comprehensive statistics:
Split: feature_name <= threshold (p-val=X.XXXX, gain=XX.XX) | k=XX, n=XXX (p̂=X.XXX) | CI_rel_width=X.XX | LL=XX.XX | N=XXX
Split Information
p-val
: Statistical significance of the splitgain
: Log-likelihood improvement from splitting
Node Statistics
k
: Total successes in noden
: Total trials in node
p̂
: Estimated success probabilityCI_rel_width
: Relative width of confidence intervalLL
: Log-likelihood of the nodeN
: Number of observations
Leaf Reasons
stat_stop
: Stopped due to statistical testmin_samples_split
: Not enough samples to splitmax_depth
: Reached maximum depthpure_node
: All observations have same outcome
2.5.2 Extracting Predictions and Uncertainty
# Get point predictions
= tree.predict_p(test_data)
probabilities
# Access detailed node information for uncertainty
def get_prediction_details(tree, data_point):
"""Get prediction with node statistics"""
# This would require extending the current API
# Implementation would traverse tree and return node info
pass
2.6 Common Patterns and Best Practices
2.6.1 Rare Event Modeling
# Configuration for rare events (p < 0.01)
= BinomialDecisionTree(
rare_event_tree =100, # Need more samples for stability
min_samples_split=50, # Ensure adequate events per leaf
min_samples_leaf=6, # Allow deeper trees
max_depth=0.01, # More conservative splitting
alpha=True
verbose )
2.6.2 High-Cardinality Categoricals
# For categorical features with many levels
= BinomialDecisionTree(
high_card_tree =60, # Account for category splits
min_samples_split=30, # Ensure representation per category
min_samples_leaf=6, # Categories may need more depth
max_depth=0.05
alpha )
2.6.3 Large Dataset Optimization
# For datasets with many unique numerical values
= BinomialDecisionTree(
large_data_tree =500, # More split points
max_numerical_split_points=50, # Can afford larger minimums
min_samples_split=False # Reduce logging overhead
verbose )
2.7 Error Handling and Diagnostics
2.7.1 Common Issues
Empty Leaves
- Increase
min_samples_leaf
- Check for data quality issues
- Consider feature engineering
No Splits Found
- Increase
alpha
to be less strict - Ensure adequate sample sizes
- Check feature-target relationships
Performance Issues
- Reduce
max_numerical_split_points
- Limit
max_depth
- Consider feature selection
2.7.2 Debugging Output
# Enable verbose mode for detailed logging
= BinomialDecisionTree(verbose=True)
tree
tree.fit(...)
# Sample verbose output:
# Processing Node abc123 (Depth 0): 1000 samples
# Evaluating feature 'numeric_feat' (numerical)...
# Feature 'numeric_feat' best split LL Gain: 23.45, p-value: 0.0012
# Evaluating feature 'cat_feat' (categorical)...
# Feature 'cat_feat' best split LL Gain: 18.32, p-value: 0.0089
# Overall best split: Feature 'numeric_feat' with p-value: 0.0012
# Stat Stop Check: Bonferroni-adjusted p-value: 0.0024 < 0.05
# Node abc123 SPLIT on numeric_feat
This implementation guide provides the essential knowledge for effectively using BinomialTree in practice, from basic usage to advanced configurations for specific use cases.