代写Guidelines for the Project代写C/C++编程

Guidelines for the Project

There is no single (optimal) solution for this project. But we should pay attention on some key points and steps. And also, there are standards on the misclassification rate. I summarize as follows.  

1. Clean the data:

Although there are about twelve hundred customers in this data set, a small percentage of the sample set can be removed because they are not meaningful (e.g., a person is too young to have a child or a person with many no responses). If we include these samples in the data set, they will affect our result in a negative way. In practice, it is unavoidable to have a small percentage of meaningless samples. But we do not have to include them when we build the model.

2. The non-numeric independent variables:

Among the fourteen independent variables, the job status and residential status are non-numeric. Computers only read numeric values. Therefore, we need to assign numeric scores to these independent variables. You can decide the range of scores. But with common sense, people with more stable income should receive higher scores. And also, people who are the owner of their apartments should receive higher score.

3. Merge and delete some independent variables:

In this data set, we have a lot of independent variables that contain very detailed information. But as some of you have discovered, more independent variables do not necessarily give us better result in terms of classification accuracy. If you merge some of the independent variables that are similar in nature (e.g., those outgoings) to create a new independent variable, you might be able to get better result because this new independent variable can play a much stronger role in the model. At the same time, you can delete some independent variables that are not very relevant or important. It is true that the more information we collect, the better. But when it comes to build the model, more independent variables do not necessarily lead to a better result.

4. Construct the training and testing data sets:

We need to divide the original data set into two parts, the training data set and testing data set. We should pay special attention on the good and bad ratio in the training data set. As I explained in the lecture, if the number of good customers and bad customers in the training data set are unbalanced, the model built on this set will be biased. Therefore, the ideal ratio is 50:50. But in this data set, the number of good customers is much more than the number of bad customers. At the same time, we want to leave some bad customers in the testing data set. We can allow some degree of unbalance between good and bad customers up to 60:40. I suggest you choose 200-250 bad customers and 200-300 good customers for the training data set. All of the rest customers should be put into the testing data set.

5. The objectives of the project and standard of the accuracy rate.

The first objective is to minimize the training error rate and testing error rate. If one of your three methods (we dont require all three) has both training error rate and testing error rate are within 30%, you have done a good job. We use this standard because this is not an easy data set to work with. This data set is from a bank. We know that each customer is good or bad because they are already a customer of the bank. In another word, the bank originally classified all of them as good customerto give them credit card. But it turns out that around 30% of them are bad customers. Therefore, if one of your three methods can control both training error rate and testing error rate within 30%, you have done a comparable job with the professionals in the bank. Then we should not complain. If you get 20% misclassification rate, you have outperformed those people who work in the bank. If you have two methods that give you similar training error rate and testing error rate, you should pick the one that gives you a smaller Type II error rate, which is the second objective of the project.

6. Linear Regression:

When you do the linear regression on all the independent variables, you will find that the p-values of some variables are very large. In this case, you should remove some variables and do the regression again on the remaining variables. Our objective here is to get a meaningful linear model with small misclassification rate. You might also find that for some customers, their total weighted values (yi, the value of dependent variable) are out of the range [0, 1], which can be a problem if you look this value as the probability of default. This comes to the question that if Logistics regression can do a better job since the returned value of dependent variable is always within [0, 1] for Logistics regression. Actually, when Logistics regression was created, people expected that Logistics regression can do a better job than linear regression in these cases. However, in reality, Logistics regression can only do a comparable job with linear regression, which is also tested in our case. Therefore, for the linear regression, we do not need to take it too seriously on the fact that some of the returned values for dependent variable are out of the region of [0, 1] because our objective is not to get the probability of default. Instead, our objective is to get a linear function that separate good and bad customers with minimal misclassification rate. For this reason, you can decide the optimal cut off point (above which customer is classified as bad and below which customer is classified as good) that minimizes the training error rate and testing error rate.

7. Linear Programming:

The decision variables for this model are wi and ai. Therefore, if the number of independent variables is p and the size of training data set is n, the number of decision variables is n+p and the number of constraints is 2n. The upgraded software I suggested can easily solve a linear programming problem with this size. The predetermined c can be any value except 0 because when c is 0, wi=0 and ai=0 are trivial solutions to this model. If you can correctly set up and solve the linear programming model in this project, you will be able to solve any linear programming problem in the real world. This is what I want you to get from this project.

8. Classification tree:

As the tree becomes larger and larger, the training error rate will decrease. However, the testing error rate will increase. Therefore, you need to use the input parameters to find a balance. The key of this method is to find a tree that can control both the training error rate and testing error rate within 30% (of course the smaller, the better).





热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图