您好,欢迎访问三七文档
分类与预测Vicky银行个人住房贷款审批银行个人客户提出住房贷款申请,根据历史数据发现:部分贷款客户不能按时还款。为尽量降低这种现象,需要发现不能按时还款客户的特征,以便对以后住房贷款申请的审批提供依据。2006年年底,由SAS机构与招商银行启动了全行个人住房贷款评分卡开发与推广项目。该项目利用客户的历史数据构建评分卡模型,然后将该模型应用到新客户上,最后决定是否接受新客户的贷款申请。分析数据集应该包括哪些客户?银行贷款申请IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes4YoungYesYesFairYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo8MiddleYesYesGoodYes9MiddleNoYesExcellentYes10MiddleNoYesExcellentYes11OldNoYesExcellentYes12OldNoYesGoodYes13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNo分类与预测•分类:–目标变量为非数值型•预测:–目标变量为数值型•根据历史数据集(已知目标变量),构建模型描述目标变量与输入变量之间的关系,并依据模型来分类或预测新数据(目标变量值未知)。分类模型也称为分类器。模型应用建模规则1:Ifrefund=noandmarst=marriedthencheat=no……模型评估分类的过程•数据集分区–训练集:建立模型–验证集:调整和选择模型–测试集:评估模型的预测能力•建立模型•评估并选择模型•运用模型新数据(打分集)思考:分类模型在什么情况下不适合用于新数据?分类方法•决策树方法•贝叶斯分类法•LOGISTIC回归•神经网络方法•K近邻分类法•SVM分类法……….RootLeafNode7决策树(decisiontree)规则1:Ifrefund=noand(marst=singleormarst=divorced)andtaxincome80kthencheat=yes……决策树•是一棵二叉或多叉树结构•每个内部节点代表一个属性,该节点的分支表示根据该属性的不同测试条件的输出•叶子节点表示一个类标•决策树一般是自上而下生成的决策树基本思想建立决策树将决策树转换为决策规则并应用相关问题讨论内容一、决策树思想•将数据集根据某种测试条件分为2个或多个子集,使分裂后的子集在目标变量上具有更纯的分类纯度与混杂度混杂度的常用测度指标•信息熵(Entropy)•基尼指数(GiniIndex)•分类误差(classificationerror)Pj是数据集合中类别j的相对比例.entropy=iiipp2log12信息熵(Entropy)什么情况下,熵最小?什么情况下,熵最大?entropy=-1log21-0log20=0目标变量为二元变量:entropy=-0.5log20.5–0.5log20.5=1IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes4YoungYesYesFairYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo8MiddleYesYesGoodYes9MiddleNoYesExcellentYes10MiddleNoYesExcellentYes11OldNoYesExcellentYes12OldNoYesGoodYes13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNo银行贷款数据集银行贷款案例数据集的熵:Entropy(T)=−6/15*log2(6/15)−9/15*log2(9/15)=0.971Gini指数Pj是数据集合中类别j的相对比例.GINI最大=?GINI最小=?1-1/2(目标变量为二元变量)0IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes4YoungYesYesFairYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo8MiddleYesYesGoodYes9MiddleNoYesExcellentYes10MiddleNoYesExcellentYes11OldNoYesExcellentYes12OldNoYesGoodYes13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNo银行贷款数据集银行贷款案例数据集的基尼指数:gini=1-(6/15)2-(9/15)2=0.48分类误差(classificationerror)CE最大=?CE最小=?1-1/2(目标变量为二元变量)0IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes4YoungYesYesFairYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo8MiddleYesYesGoodYes9MiddleNoYesExcellentYes10MiddleNoYesExcellentYes11OldNoYesExcellentYes12OldNoYesGoodYes13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNo银行贷款数据集银行贷款案例数据集的分类误差:CE=1-9/15=6/15=0.4二、建立决策树常用算法•ID3-ID5,C4,C4.5,C5.0•CART(ClassificationandRegressionTrees分类与回归树)(C&RT)•CHAID(chi-squaredautomaticinteractiondetection,卡方自动交互检测)二叉GINI指数二叉或多叉信息熵二叉或多叉建立决策树•树的生长–分裂属性及其条件的选择–何时结束分裂•树的选择1.裂分目标与属性选择•裂分目标使分裂后数据子集的纯度比裂分前数据集的纯度最大限度的提高;即不同类别的观测尽量分散在不同的子集中。•指标–信息增益与信息增益率–GINI指数的下降–二分指数–卡方检验–C-SEP、…信息增益InformationGain=裂分前数据集的熵–裂分后各子数据集的熵加权和其中:权重为每个子集中的观测数在裂分前总观测数中所占的比例案例数据集基于own_home属性划分IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes4YoungYesYesFairYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo8MiddleYesYesGoodYes9MiddleNoYesExcellentYes10MiddleNoYesExcellentYes11OldNoYesExcellentYes12OldNoYesGoodYes13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNo案例数据集基于ownhome属性划分划分后数据集的熵EntropyOwn_home(T)=6/15*Entropy(T1)+9/15*Entropy(T2)=6/15*(−6/6*log2(6/6)−0/0*log2(0/6))+9/15*(−3/9*log2(3/9)−6/9*log2(6/9)=0.551信息增益Gain(ownhome)=0.971-0.551=0.42Own_homeYesNoYes:6No:0No:6Yes:3裂分前数据集的熵:Entropy(T0)=−6/15*log2(6/15)−9/15*log2(9/15)=0.971案例数据集基于age属性划分IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes4YoungYesYesFairYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo8MiddleYesYesGoodYes9MiddleNoYesExcellentYes10MiddleNoYesExcellentYes11OldNoYesExcellentYes12OldNoYesGoodYes13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNo案例数据集基于age属性划分裂分后数据集的熵EntropyAge(T)=5/15*Entropy(T1)+5/15*Entropy(T2)+5/15*Entropy(T3)=5/15*(−3/5*log2(3/5)−2/5*log2(2/5))+5/15*(−3/5*log2(3/5)−2/5*log2(2/5))+5/15*(−1/5*log2(1/5)−4/5*log2(4/5))=0.888信息增益Gain(age)=0.971-0.888=0.083AgeYoungMiddleOldYes:2No:3Yes:3No:2No:1Yes:4案例数据集基于其它属性划分根据hasjob和credit划分后的熵分别为EntropyHas_job(T)=0.647EntropyCredit(T)=0.608信息增益分别为:Gain(hasjob)=0.324Gain(credit)=0.363Gain(ownhome)=0.42Gain(age)=0.971-0.888=0.083has_jobYesNoYes:5No:0No:6Yes:4creditfairgoodexcellentYes:1No:4Yes:4No:2No:0Yes:4Own_homeYesNoYes:6No:0No:6Yes:3IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNoOwn_homeYesNoNo:6Yes:3Yes:6No:0has_jobYesNoYes:3No:0No:6Yes:0IDAgeHas_jobOwn_homeCreditClass1YoungNoNoFairNo2YoungNoNoGoodNo3YoungYesNoGoodYes4YoungYesYesFairYes5YoungNoNoFairNo6MiddleNoNoFairNo7MiddleNoNoGoodNo8MiddleYesYesGoodYes9MiddleNoYesExcellentYes10MiddleNoYesExcellentYes11OldNoYesExcellentYes12OldNoYesGoodYes13OldYesNoGoodYes14OldYesNoExcellentYes15OldNoNoFairNo信息增益方法偏向选择具有大量取值的属性信息增益率•假设按照属性S来划分T,设S有m个值,根据该属性的取值将数据集T划
本文标题:分类与决策树
链接地址:https://www.777doc.com/doc-614498 .html