Loading data info | 正在读取数据信息
File name | 文件名: | |
Memory usage | 内存占用: | |
Shape | 行列数: | rows × columns |
Status | 状态:
1. Data Report | 数据报告: Not started | 未开始
2. WoE Analysis | 变量WoE分析: Not started | 未开始
3. Correlation Calculation | 相关性计算: Not started | 未开始
4. Stepwise Selection | 逐步回归过程: Not started | 未开始
Click "target_income_greater_than_50k" in the table to set it as target variable, then start modeling.
点击表格中的 “target_income_greater_than_50k” 将其设置为目标变量,然后开始建模。
# | Mark | 标记 | Variable | 变量 | Dtype | 类型 |
---|
Info | 提示
Settings | 设置
In this demo, only default configuration values work, settings other than the default values are ignored.
在此演示页面,只有默认设置被用于模型开发,其它非默认设置值会被忽略。
Confirm | 确认
Target variable
目标变量
Excluded variables
排除变量
Settings
设置
Missing rate for variable exclusion | 被剔除变量缺失率阈值: | % |
String variable uniqueness | 字符型变量唯一值: | |
Keep binning WoE monotonic | 保持 WoE 值单调: | |
Difference of WoE between 2 adjacent bins | 相邻分箱之间的 WoE 差值: | |
Minimum population of a single bin | 单个分箱的最小占比: | % |
IV for variable exclusion | 用于剔除变量的 IV 阈值: | |
Correlation for variable exclusion | 用于剔除变量的相关系数阈值: | |
Force variable coefficients to be negative | 确保入模变量系数均为负值: | |
Significance level for entry | 变量进入模型的显著性水平: | |
Significance level for stay | 变量退出模型的显著性水平: |
About t1modeler Algorithm | 关于 t1modeler 算法
In order to build reliable and sound logistic regression models, these algorithms are applied to model development
procedure:
1. Fast data profiling;
2. Iteratively merged binning algorithm;
3. Pearson correlation coefficient;
4. Augmented stepwise.
Based on open source packages, every single algorithm listed above is tuned
to the best performance, processing of data sets with large amount of variables is made possible.
1. Fast data profiling
By using fast data profiling, data sets with 100,000 records and 500 variables are able to be profiled within just 20
seconds, generating metrics including percentile, uniqueness, missing rate for each variable. According to the result
of data profiling, variables meet any of the following criteria are excluded from modeling: a. those with missing
rate greater than 95%; b. those with uniqueness equal to 1; c. datetime variables; d. string variables with uniqueness
greater than 30.
Settings: Missing rate for variable exclusion (default 95%), String variable uniqueness (default 20).
2. Iteratively merged binning algorithm
In general, a good-enough binning meets the criteria: a. the binning's Weight-of-Evidence is monotonic; b. the
difference of Weight-of-Evidence between 2 adjacent bins is greater than 20; c. the population in any single bin is
greater than 5% of total population. Under the constraints iteratively merged binning algorithm is able to maximize
binning's IV (Information Value) with excellent performance. Dealing with data sets of 100,000 records, the
algorithm only takes 1 second on average to process a single variable's binning.
Settings: Keep binning WoE monotonic (default enabled), Difference of WoE between 2 adjacent bins (default 20),
Minimum population of a single bin (default 5% of total population), IV for variable exclusion (default 0.02).
3. Pearson correlation coefficient
Pair-wise Pearson correlation coefficient are calculated enumeratively. When correlation coefficient for 2 variables
is greater than 0.95, only the variable which has a higher IV is kept for modeling.
Settings: Correlation for variable exclusion (default 0.9).
4. Augmented stepwise
Based on traditional stepwise seletion, namely the significant level (P value) selection, a coefficient condition
is introduced to the selection procedure for stricter variable selection. Every variable that enters into the model,
is the most significant (P value is the lowest among the candidates and P value is equal to or less than 0.05) and
with proper coefficient (a negative coefficient). Every variable that removes from the model, is the least
significant (P value is the highest among the candidates and P value is equal to or greater than 0.05) or with
improper coefficient (a positive coefficient).
Settings: Force variable coefficients to be negative (default enabled),
Significance level for entry (default 0.05), Significance level for stay (default 0.05).
为了开发可靠且合理的逻辑回归模型,这些算法被依次使用在模型开发过程中:
1. 快速数据画像;
2. 迭代合并变量分箱算法;
3. 皮尔逊相关系数算法;
4. 增强型逐步回归。
每一种算法,都在开源标准化算法包的基础上做了大量优化,提升运算性能及减少硬件消耗,以便轻松处理具有大量特征的数据集。
1. 快速数据画像
使用快速数据画像,能够在 20 秒内对具有500个变量的 10 万样本完成数据画像,计算包括各分位数,唯一值,缺失率等各项统计值。根据数据画像的结果,
符合以下任意一种条件的变量会被剔除:a. 缺失率大于 95% 的变量;b. 取值范围唯一值等于 1 的变量;
c. 时间型变量(datetime);d. 字符型变量且取值范围唯一值大于 20 的变量。
设置:被剔除变量缺失率阈值(默认 95%),字符型变量唯一值(默认 20)。
2. 迭代合并变量分箱算法
一般情况下,合理的变量分箱应满足以下条件:a. 分箱后变量的 WoE 值是单调的;b. 两个相邻分箱之间的 WoE 值至少相差 20;
c. 任意一个分箱的占比应大于总样本数的 5%。迭代合并变量分箱算法能够在以上条件都满足的情况下,寻求 IV 值最大化的分箱结果,且速度优异,
对于 10 万样本的数据集,每一个变量平均仅需 1 秒即可完成迭代分箱。
设置:保持 WoE 值单调(默认是),相邻分箱之间的 WoE 差值(默认 20),单个分箱的最小占比(默认 5%),用于剔除变量的 IV 阈值(默认 0.02)。
3. 皮尔逊相关系数
所有的 WoE 变量,两两运行皮尔逊相关系数,对于相关系数大于 0.9 的两个 WoE 变量,只保留 IV 值较大的一个。
设置:用于剔除变量的相关系数阈值(默认 0.9)。
4. 增强型逐步回归
在传统逐步回归的基础上,即在显著性水平(P 值)的判断的基础上,加入变量系数的判断,以便更严格地选择变量。每一步放入模型的变量,
必须是最显著的(P 值最小且 P 值小于等于 0.05) 同时变量系数正确的(变量系数为负值);每一步剔除的变量,
是最不显著的(P 值最大且 P 值大于等于 0.05)或者变量系数不正确的(变量系数为正值)。
设置:确保入模变量系数均为负值(默认是),变量进入模型的显著性水平(默认 0.05),变量退出模型的显著性水平(默认 0.05)。