1、提出问题

  什么样的人在泰坦尼克号中更容易存活?

2、理解数据

2.1 采集数据

  从Kaggle泰坦尼克号项目页面下载数据:点击进入

2.2 导入数据

1
2
3
4
5
6
7
import numpy as np
import pandas as pd
#导入训练集、测试集
train_data = pd.read_csv("./train.csv")
test_data = pd.read_csv("./test.csv")
#合并训练集和测试集,方便同时对两个数据集进行清洗
full = pd.concat([train_data, test_data], ignore_index = True)

2.3 查看数据集信息

1
full.head()
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
1
2
3
4
#查看每一列的数据类型、数据规模
full.info()
#查看数值类型的描述统计信息
full.describe()

  发现:数据一共1309行,其中训练集891行、测试集418行。
  目标变量是Survived(0.0=遇难、1.0=生还),是浮点型;特征分为以下几种类型:
(1)数值型(浮点型)
  ①Age:年龄,缺失263条数据
  ②Fare:票价,缺失1条数据
(2)数值型(整型)
  ①Parch:隔代直系亲属数
  ②SibSp:同代直系亲属数
(3)类别型
  ③Pclass:客舱等级(1=一等舱、2=二等舱、3=三等舱)
  ②Embarked:登船港口(S=英国南安普顿、C=法国瑟堡市、Q=爱尔兰昆士敦),缺失2条数据
  ③Sex:性别(’male’、’female’)
(4)文本型
  ①Cabin:船舱号,缺失1014条数据
  ②Name:乘客姓名
  ③Ticket:船票编号

3、特征工程

3.1 数据预处理

 缺失值处理

  Age:缺失263条,缺失量较大,之后用提取出特征,构建随机森林模型填充缺失值
  Fare:缺失1条数据,通过其它数据推理出缺失值
  Embarked:缺失2条数据,通过其它数据推理出缺失值
  Cabin:缺失1014条数据,缺失过多,将缺失值都填充为0

1
2
#查看Fare的缺失值
full[full['Fare'].isnull()]

Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
1043 60.5 NaN S NaN Storey, Mr. Thomas 0 1044 3 male 0 NaN 3701

  缺失票价信息的乘客,在S登船港口、三等舱,所以用Embarked为S,Pclass为3的乘客的Fare中位数填充。

1
2
3
4
5
6
#填充Fare的缺失值
fare = full[(full['Embarked'] == "S") & (full['Pclass'] == 3)].Fare.median()
full['Fare']=full['Fare'].fillna(fare)

#查看Embarked的缺失值
full[full['Embarked'].isnull()]

Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
61 38.0 B28 NaN 80.0 Icard, Miss. Amelie 0 62 1 female 0 1.0 113572
829 62.0 B28 NaN 80.0 Stone, Mrs. George Nelson (Martha Evelyn) 0 830 1 female 0 1.0 113572

  缺失Embarked信息的乘客:都是一等舱,票价都是80

1
full.groupby(['Pclass','Embarked']).Fare.median()

Pclass Embarked
 1    C   76.7292
      Q   90.0000
      S   52.0000
  Name: Fare, dtype: float64
  Embarked为C且Pclass为1的乘客的Fare中位数最接济你80,所以缺失值填充为C

1
2
3
4
5
6
7
full['Embarked']=full['Embarked'].fillna('C')

#填充Cabin的缺失值
full['Cabin'] = full['Cabin'].fillna('None')

#检查缺失值处理情况,最后只剩Age没有处理
full.info()

3.2 特征提取

3.2.1 文本型数据

  Cabin:缺失的1014条数据都已经填充为0,特征值是由”字母+数字”组成,可以将字母提取出来,即客舱的类别
  Name:从乘客姓名中提取头衔,可以用来预测年龄。将头衔分为以下6个类别:①Officer政府官员 ②Royalty王室(皇室) ③Mr已婚男士 ④Mrs已婚妇女 ⑤Miss年轻未婚女子 ⑥Master有技能的人/教师
Ticket:存在船票编号情况,统计共票号数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#提取Cabin中的字母
full['Cabin'] = full['Cabin'].map(lambda c:c[0])

#提取姓名中的头衔
full['Title'] = full['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip())
title_map = {"Capt":"Officer", "Col":"Officer", "Major":"Officer", "Dr":"Officer", "Rev":"Officer",
"Don":"Royalty", "Dona":"Royalty", "Sir":"Royalty", "the Countess":"Royalty",
"Mr":"Mr",
"Mme":"Mrs", "Ms":"Mrs", "Mrs":"Mrs",
"Mlle":"Miss", "Miss":"Miss",
"Master":"Master", "Jonkheer":"Master" }
full['Title'] = full['Title'].map(title_map)

#统计每个乘客的共票号数
ticketCount = dict(full['Ticket'].value_counts())
full['TicketCount'] = full['Ticket'].apply(lambda x:ticketCount[x])

1
2
3
4
5
6
7
8
9
10
11
12
#填充Age的缺失值:用Sex、Title、Pclass三个特征构建随机森林模型
from sklearn.ensemble import RandomForestRegressor
ageDf = full[['Age', 'Pclass','Sex','Title']]
ageDf = pd.get_dummies(ageDf)
known_age = ageDf[ageDf.Age.notnull()].as_matrix()
unknown_age = ageDf[ageDf.Age.isnull()].as_matrix()
X = known_age[:, 1:]
y = known_age[:, 0]
age_rf = RandomForestRegressor(random_state=0)
age_rf.fit(X, y)
age_pred = age_rf.predict(unknown_age[:, 1:])
full.loc[(full.Age.isnull()), 'Age'] = age_pred

3.2.2 组合特征

  新增’家庭人数’和’家庭类别’特征:家庭人数FamilySize=Parch+SibSp+1,然后将家庭分为三类:①小家庭FamilySmall:人数=1 ②中等家庭FamilyMedian:人数在2到4之间 ③大家庭FamilyLarge:人数多于5人

1
2
3
4
5
#新增FamilySize和三个家庭类别特征
full['FamilySize'] = full['SibSp'] + full['Parch']+1
full['FamilySmall'] = full['FamilySize'].map(lambda s:1 if s==1 else 0)
full['FamilyMedian'] = full['FamilySize'].map(lambda s:1 if (s>=2)&(s<=4) else 0)
full['FamilyLarge'] = full['FamilySize'].map(lambda s:1 if s>=5 else 0)

  提取异常组:因为女性和儿童的生存率较高,成年男性生存率较低,若一个家庭中(通过姓氏识别)女性和儿童都遇难,或者男性生存,是不太可能发生的情况。将这两种情况作为反常组进行单独处理,即异常遇难组、异常幸存组。

1
2
3
4
5
6
7
8
9
#提取姓氏
full['Surname'] = full['Name'].apply(lambda x:x.split(',')[0].strip())
#统计相同姓氏的人数
SurnameCount = dict(full['Surname'].value_counts())
full['SurnameGroup'] = full['Surname'].apply(lambda x:SurnameCount[x])
#提取同姓氏人数大于1的妇女儿童
FemaleChildGroup = full.loc[(full['SurnameGroup']>=2) & ((full['Age']<=16) | (full['Sex']=='female'))]
#提取同姓氏人数大于1的男性
MaleGroup = full.loc[(full['SurnameGroup']>=2) & (full['Age']>16) & (full['Sex']=='male')]

1
2
#查看妇女儿童组的幸存率
FemaleChildGroup.groupby('Surname')['Survived'].mean().value_counts()

  Survived为1.0的有114个,Survived为0.0的有32个

1
2
#查看男性组的幸存率
MaleGroup.groupby('Surname')['Survived'].mean().value_counts()

  Survived为1.0的有20个,Survived为0.0的有117个

1
2
3
4
5
#提取异常组的姓氏
a = FemaleChildGroup.groupby('Surname')['Survived'].mean()
deadSurname = set(a[a.apply(lambda x:x==0)].index)
b = MaleGroup.groupby('Surname')['Survived'].mean()
survivedSurname = set(b[b.apply(lambda x:x==1)].index)

1
2
3
4
5
#将这两个异常组修改为正常
full.loc[(full['Surname'].apply(lambda x:x in deadSurname)),'Sex'] = 'male'
full.loc[(full['Surname'].apply(lambda x:x in deadSurname)),'Title'] = 'Mr'
full.loc[(full['Surname'].apply(lambda x:x in survivedSurname)),'Sex'] = 'female'
full.loc[(full['Surname'].apply(lambda x:x in survivedSurname)),'Title'] = 'Miss'

3.2.3 类别型数据

  类别值为2个的特征,原地修改;类别值为多个的特征,用One-hot编码生成虚拟变量,再添加进特征集中。

1
2
3
4
5
6
full2 = full[['Survived','Pclass','Sex','Age','Fare','Embarked','Title','FamilyLarge','FamilyMedian','FamilySmall','Cabin','TicketCount']]
full2 = pd.get_dummies(full2).astype('float')
pclassDf = pd.get_dummies( full['Pclass'] , prefix='Pclass' )
full2 = pd.concat([full2,pclassDf],axis=1).astype('float')
full2.drop('Pclass',axis=1,inplace=True)
full2.head()

Survived Age Fare FamilyLarge TicketCount Sex_female Sex_male Embarked_C ... Cabin_C Pclass_1
0 0.0 22.0 7.2500 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0
1 1.0 38.0 71.2833 0.0 2.0 1.0 0.0 1.0 ... 1.0 1.0
2 1.0 26.0 7.9250 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0
3 1.0 35.0 53.1000 0.0 2.0 1.0 0.0 0.0 ... 1.0 1.0
4 0.0 35.0 8.0500 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0

5 rows × 30 columns

3.2.4 数值型数据

1
2
3
4
from sklearn.preprocessing import MinMaxScaler
full2['Age'] = MinMaxScaler().fit_transform(full2.Age.reshape(-1,1))
full2['Fare'] = MinMaxScaler().fit_transform(full2.Fare.reshape(-1,1))
full2['TicketCount'] = MinMaxScaler().fit_transform(full2.TicketCount.reshape(-1,1))

3.3 特征选择

  相关系数法:计算各个特征的相关系数

1
2
#相关性矩阵
full2.corr()['Survived'].sort_values(ascending=False)

  Survived    1.000000
  Sex_female   0.710385
  Title_Miss   0.497464
  Title_Mrs    0.403888
  Pclass_1    0.285904
  FamilyMedian  0.279855
  Fare      0.257307
  Title_Master  0.178526
  Cabin_B     0.175095
  Embarked_C   0.174718
  Cabin_D     0.150716
  Cabin_E     0.145321
  Cabin_C     0.114652
  Pclass_2    0.093349
  TicketCount   0.064962
  Cabin_F     0.057935
  Cabin_A     0.022287
  Cabin_G     0.016040
  Title_Royalty  0.011329
  Embarked_Q   0.003650
  Cabin_T    -0.026456
  Title_Officer  -0.042599
  Age      -0.073512
  FamilyLarge  -0.125147
  Embarked_S   -0.155660
  FamilySmall  -0.203367
  Cabin_N    -0.316912
  Pclass_3    -0.322308
  Sex_male    -0.710385
  Title_Mr    -0.732785
  Name: Survived, dtype: float64
  一般来说,取绝对值后在0.09以下的没有相关性,0.09~0.3为弱相关,0.3~0.5为中等相关,0.5~1.0为强相关
剔除与Survived没有相关性的特征:TicketCount、Cabin_F、Cabin_A、Cabin_G、Title_Royalty、Embarked_Q、Cabin_T、Title_Officer、Age

1
2
del_list = ['TicketCount','Cabin_F','Cabin_A','Cabin_G','Title_Royalty','Embarked_Q','Cabin_T','Title_Officer','Age']
full3 = full2.drop(del_list,axis=1)

4.构建模型

  用训练集和机器学习算法得到模型,用测试数据评估模型。

4.1 划分训练集和测试集

1
2
3
4
5
6
7
8
train = full3[full3['Survived'].notnull()]
test = full3[full3['Survived'].isnull()].drop('Survived',axis=1)
X = train.iloc[:,1:]
y = train.iloc[:,0]

#从训练集中划分出验证集
from sklearn.cross_validation import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=.8)

4.2 选择算法并评估模型

1
2
3
4
5
6
#逻辑斯谛回归
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
#分类问题,score得到的是模型的正确率
lr.score(X_valid, y_valid)

  返回0.88826815642458101

1
2
3
4
5
#支持向量机
from sklearn.svm import SVC, LinearSVC
svc = SVC()
svc.fit(X_train, y_train)
svc.score(X_valid, y_valid)

  返回0.8938547486033519

1
2
3
4
5
#随机森林
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
rf.score(X_valid, y_valid)

  返回0.86592178770949724

1
2
3
4
5
#梯度提升
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb.score(X_valid, y_valid)

  返回0.87150837988826813

1
2
3
4
5
#K近邻
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
knn.score(X_valid, y_valid)

  返回0.87150837988826813

1
2
3
4
5
#高斯贝叶斯
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb.score(X_valid, y_valid)

  返回0.84916201117318435

6.提交结果到Kaggle

  用训练好的模型svc预测测试集,得到的结果保存到csv文件中。

1
2
3
4
y_pred = svc.predict(test).astype(int)   #预测生成了numpy数组,元素是浮点型,要将元素转换成整型
PassengerId = test_data.loc[:,'PassengerId']
predDf = pd.DataFrame({'PassengerId': PassengerId, 'Survived': y_pred})
predDf.head()

PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1
1
2
#保存结果
predDf.to_csv('result1.csv', index = False)

  交结果,得分0.81339