1、提出问题
什么样的人在泰坦尼克号中更容易存活?
2、理解数据
2.1 采集数据
从Kaggle泰坦尼克号项目页面下载数据:点击进入
2.2 导入数据
1 | import numpy as np |
2.3 查看数据集信息
1 | full.head() |
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
| 1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
| 2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
| 3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
| 4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
1 | #查看每一列的数据类型、数据规模 |
发现:数据一共1309行,其中训练集891行、测试集418行。
目标变量是Survived(0.0=遇难、1.0=生还),是浮点型;特征分为以下几种类型:
(1)数值型(浮点型)
①Age:年龄,缺失263条数据
②Fare:票价,缺失1条数据
(2)数值型(整型)
①Parch:隔代直系亲属数
②SibSp:同代直系亲属数
(3)类别型
③Pclass:客舱等级(1=一等舱、2=二等舱、3=三等舱)
②Embarked:登船港口(S=英国南安普顿、C=法国瑟堡市、Q=爱尔兰昆士敦),缺失2条数据
③Sex:性别(’male’、’female’)
(4)文本型
①Cabin:船舱号,缺失1014条数据
②Name:乘客姓名
③Ticket:船票编号
3、特征工程
3.1 数据预处理
缺失值处理
Age:缺失263条,缺失量较大,之后用提取出特征,构建随机森林模型填充缺失值
Fare:缺失1条数据,通过其它数据推理出缺失值
Embarked:缺失2条数据,通过其它数据推理出缺失值
Cabin:缺失1014条数据,缺失过多,将缺失值都填充为01
2#查看Fare的缺失值
full[full['Fare'].isnull()]
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1043 | 60.5 | NaN | S | NaN | Storey, Mr. Thomas | 0 | 1044 | 3 | male | 0 | NaN | 3701 |
缺失票价信息的乘客,在S登船港口、三等舱,所以用Embarked为S,Pclass为3的乘客的Fare中位数填充。1
2
3
4
5
6#填充Fare的缺失值
fare = full[(full['Embarked'] == "S") & (full['Pclass'] == 3)].Fare.median()
full['Fare']=full['Fare'].fillna(fare)
#查看Embarked的缺失值
full[full['Embarked'].isnull()]
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61 | 38.0 | B28 | NaN | 80.0 | Icard, Miss. Amelie | 0 | 62 | 1 | female | 0 | 1.0 | 113572 |
| 829 | 62.0 | B28 | NaN | 80.0 | Stone, Mrs. George Nelson (Martha Evelyn) | 0 | 830 | 1 | female | 0 | 1.0 | 113572 |
缺失Embarked信息的乘客:都是一等舱,票价都是801
full.groupby(['Pclass','Embarked']).Fare.median()
Pclass Embarked
1 C 76.7292
Q 90.0000
S 52.0000
Name: Fare, dtype: float64
Embarked为C且Pclass为1的乘客的Fare中位数最接济你80,所以缺失值填充为C1
2
3
4
5
6
7full['Embarked']=full['Embarked'].fillna('C')
#填充Cabin的缺失值
full['Cabin'] = full['Cabin'].fillna('None')
#检查缺失值处理情况,最后只剩Age没有处理
full.info()
3.2 特征提取
3.2.1 文本型数据
Cabin:缺失的1014条数据都已经填充为0,特征值是由”字母+数字”组成,可以将字母提取出来,即客舱的类别
Name:从乘客姓名中提取头衔,可以用来预测年龄。将头衔分为以下6个类别:①Officer政府官员 ②Royalty王室(皇室) ③Mr已婚男士 ④Mrs已婚妇女 ⑤Miss年轻未婚女子 ⑥Master有技能的人/教师
Ticket:存在船票编号情况,统计共票号数1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16#提取Cabin中的字母
full['Cabin'] = full['Cabin'].map(lambda c:c[0])
#提取姓名中的头衔
full['Title'] = full['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
title_map = {"Capt":"Officer", "Col":"Officer", "Major":"Officer", "Dr":"Officer", "Rev":"Officer",
"Don":"Royalty", "Dona":"Royalty", "Sir":"Royalty", "the Countess":"Royalty",
"Mr":"Mr",
"Mme":"Mrs", "Ms":"Mrs", "Mrs":"Mrs",
"Mlle":"Miss", "Miss":"Miss",
"Master":"Master", "Jonkheer":"Master" }
full['Title'] = full['Title'].map(title_map)
#统计每个乘客的共票号数
ticketCount = dict(full['Ticket'].value_counts())
full['TicketCount'] = full['Ticket'].apply(lambda x:ticketCount[x])
1 | #填充Age的缺失值:用Sex、Title、Pclass三个特征构建随机森林模型 |
3.2.2 组合特征
新增’家庭人数’和’家庭类别’特征:家庭人数FamilySize=Parch+SibSp+1,然后将家庭分为三类:①小家庭FamilySmall:人数=1 ②中等家庭FamilyMedian:人数在2到4之间 ③大家庭FamilyLarge:人数多于5人1
2
3
4
5#新增FamilySize和三个家庭类别特征
full['FamilySize'] = full['SibSp'] + full['Parch']+1
full['FamilySmall'] = full['FamilySize'].map(lambda s:1 if s==1 else 0)
full['FamilyMedian'] = full['FamilySize'].map(lambda s:1 if (s>=2)&(s<=4) else 0)
full['FamilyLarge'] = full['FamilySize'].map(lambda s:1 if s>=5 else 0)
提取异常组:因为女性和儿童的生存率较高,成年男性生存率较低,若一个家庭中(通过姓氏识别)女性和儿童都遇难,或者男性生存,是不太可能发生的情况。将这两种情况作为反常组进行单独处理,即异常遇难组、异常幸存组。1
2
3
4
5
6
7
8
9#提取姓氏
full['Surname'] = full['Name'].apply(lambda x:x.split(',')[0].strip())
#统计相同姓氏的人数
SurnameCount = dict(full['Surname'].value_counts())
full['SurnameGroup'] = full['Surname'].apply(lambda x:SurnameCount[x])
#提取同姓氏人数大于1的妇女儿童
FemaleChildGroup = full.loc[(full['SurnameGroup']>=2) & ((full['Age']<=16) | (full['Sex']=='female'))]
#提取同姓氏人数大于1的男性
MaleGroup = full.loc[(full['SurnameGroup']>=2) & (full['Age']>16) & (full['Sex']=='male')]
1 | #查看妇女儿童组的幸存率 |
Survived为1.0的有114个,Survived为0.0的有32个1
2#查看男性组的幸存率
MaleGroup.groupby('Surname')['Survived'].mean().value_counts()
Survived为1.0的有20个,Survived为0.0的有117个1
2
3
4
5#提取异常组的姓氏
a = FemaleChildGroup.groupby('Surname')['Survived'].mean()
deadSurname = set(a[a.apply(lambda x:x==0)].index)
b = MaleGroup.groupby('Surname')['Survived'].mean()
survivedSurname = set(b[b.apply(lambda x:x==1)].index)
1 | #将这两个异常组修改为正常 |
3.2.3 类别型数据
类别值为2个的特征,原地修改;类别值为多个的特征,用One-hot编码生成虚拟变量,再添加进特征集中。1
2
3
4
5
6full2 = full[['Survived','Pclass','Sex','Age','Fare','Embarked','Title','FamilyLarge','FamilyMedian','FamilySmall','Cabin','TicketCount']]
full2 = pd.get_dummies(full2).astype('float')
pclassDf = pd.get_dummies( full['Pclass'] , prefix='Pclass' )
full2 = pd.concat([full2,pclassDf],axis=1).astype('float')
full2.drop('Pclass',axis=1,inplace=True)
full2.head()
| Survived | Age | Fare | FamilyLarge | TicketCount | Sex_female | Sex_male | Embarked_C | ... | Cabin_C | Pclass_1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 22.0 | 7.2500 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 |
| 1 | 1.0 | 38.0 | 71.2833 | 0.0 | 2.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 1.0 |
| 2 | 1.0 | 26.0 | 7.9250 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 |
| 3 | 1.0 | 35.0 | 53.1000 | 0.0 | 2.0 | 1.0 | 0.0 | 0.0 | ... | 1.0 | 1.0 |
| 4 | 0.0 | 35.0 | 8.0500 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 |
5 rows × 30 columns
3.2.4 数值型数据
1 | from sklearn.preprocessing import MinMaxScaler |
3.3 特征选择
相关系数法:计算各个特征的相关系数1
2#相关性矩阵
full2.corr()['Survived'].sort_values(ascending=False)
Survived 1.000000
Sex_female 0.710385
Title_Miss 0.497464
Title_Mrs 0.403888
Pclass_1 0.285904
FamilyMedian 0.279855
Fare 0.257307
Title_Master 0.178526
Cabin_B 0.175095
Embarked_C 0.174718
Cabin_D 0.150716
Cabin_E 0.145321
Cabin_C 0.114652
Pclass_2 0.093349
TicketCount 0.064962
Cabin_F 0.057935
Cabin_A 0.022287
Cabin_G 0.016040
Title_Royalty 0.011329
Embarked_Q 0.003650
Cabin_T -0.026456
Title_Officer -0.042599
Age -0.073512
FamilyLarge -0.125147
Embarked_S -0.155660
FamilySmall -0.203367
Cabin_N -0.316912
Pclass_3 -0.322308
Sex_male -0.710385
Title_Mr -0.732785
Name: Survived, dtype: float64
一般来说,取绝对值后在0.09以下的没有相关性,0.09~0.3为弱相关,0.3~0.5为中等相关,0.5~1.0为强相关
剔除与Survived没有相关性的特征:TicketCount、Cabin_F、Cabin_A、Cabin_G、Title_Royalty、Embarked_Q、Cabin_T、Title_Officer、Age1
2del_list = ['TicketCount','Cabin_F','Cabin_A','Cabin_G','Title_Royalty','Embarked_Q','Cabin_T','Title_Officer','Age']
full3 = full2.drop(del_list,axis=1)
4.构建模型
用训练集和机器学习算法得到模型,用测试数据评估模型。
4.1 划分训练集和测试集
1 | train = full3[full3['Survived'].notnull()] |
4.2 选择算法并评估模型
1 | #逻辑斯谛回归 |
返回0.888268156424581011
2
3
4
5#支持向量机
from sklearn.svm import SVC, LinearSVC
svc = SVC()
svc.fit(X_train, y_train)
svc.score(X_valid, y_valid)
返回0.89385474860335191
2
3
4
5#随机森林
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
rf.score(X_valid, y_valid)
返回0.865921787709497241
2
3
4
5#梯度提升
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb.score(X_valid, y_valid)
返回0.871508379888268131
2
3
4
5#K近邻
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
knn.score(X_valid, y_valid)
返回0.871508379888268131
2
3
4
5#高斯贝叶斯
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb.score(X_valid, y_valid)
返回0.84916201117318435
6.提交结果到Kaggle
用训练好的模型svc预测测试集,得到的结果保存到csv文件中。1
2
3
4y_pred = svc.predict(test).astype(int) #预测生成了numpy数组,元素是浮点型,要将元素转换成整型
PassengerId = test_data.loc[:,'PassengerId']
predDf = pd.DataFrame({'PassengerId': PassengerId, 'Survived': y_pred})
predDf.head()
| PassengerId | Survived | |
|---|---|---|
| 0 | 892 | 0 |
| 1 | 893 | 1 |
| 2 | 894 | 0 |
| 3 | 895 | 0 |
| 4 | 896 | 1 |
1 | #保存结果 |
交结果,得分0.81339