Python——决策树实战:california房价预测

时间:2018-08-13 22:56:14   收藏:0   阅读:1548

Python——决策树实战:california房价预测

 

编译环境:Anaconda、Jupyter Notebook

首先,导入模块:

1 import pandas as pd
2 import matplotlib.pyplot as plt
3 %matplotlib inline

接下来导入数据集:

1 from sklearn.datasets.california_housing import fetch_california_housing
2 housing = fetch_california_housing()
3 print(housing.DESCR)  #description

使用sklearn自带的数据集california_housing,详情见:Python——sklearn提供的自带的数据集

运行结果:

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/datasets/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.

 

查看一下数据:

1 housing.data.shape
(20640, 8)
1 housing.data[0]
array([   8.3252    ,   41.        ,    6.98412698,    1.02380952,
        322.        ,    2.55555556,   37.88      , -122.23      ])

 

树模型参数:

 

接下来首先把算法实例化出来,然后传参进行训练。

1 from sklearn import tree
2 dtr = tree.DecisionTreeRegressor(max_depth = 2)
3 # 使用两列的特征进行训练 即传两个参数x, y
4 dtr.fit(housing.data[:, [6, 7]], housing.target)

输出:

DecisionTreeRegressor(criterion=mse, max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter=best)

 

为了将决策树可视化,首先安装graphviz。export_graphviz出口也支持多种美学选项,包括可以通过类着色节点(或值回归)和如果需要的话使用显式的变量和类名称。IPython笔记本还可以使用Image()函数内联渲染这些图:

 1 #要可视化显示 首先需要安装 graphviz   http://www.graphviz.org/Download..php
 2 dot_data =  3     tree.export_graphviz(
 4         dtr,
 5         out_file = None,
 6         feature_names = housing.feature_names[6:8],
 7         filled = True,
 8         impurity = False,
 9         rounded = True
10     )

graphviz及pydotplus安装步骤 :Python——graphviz及pydotplus安装步骤

安装了Python模块pydotplus后,可以直接在Python中生成PNG文件(或任何其他支持的文件类型):

1 #pip install pydotplus
2 import pydotplus
3 graph = pydotplus.graph_from_dot_data(dot_data)
4 graph.get_nodes()[7].set_fillcolor("#FFF2DD")
5 graph.write_png("graph.png")
6 from IPython.display import Image
7 Image(graph.create_png())

 

技术分享图片

 

将数据集进行划分,划分为训练集和测试集,并进行训练、验证

1 from sklearn.model_selection import train_test_split
2 x_train, x_test, y_train, y_test = 3     train_test_split(housing.data, housing.target, test_size = 0.1, random_state = 42)
4 dtr = tree.DecisionTreeRegressor(random_state=42)
5 dtr.fit(x_train, y_train)
6 
7 dtr.score(x_test, y_test)

结果:

0.637318351331017

 使用随机森林:

1 from sklearn.ensemble import RandomForestRegressor
2 rfr = RandomForestRegressor( random_state = 42)
3 rfr.fit(x_train, y_train)
4 rfr.score(x_test, y_test)

结果:

0.79086492280964926

 

用交叉验证选取参数:

1 from sklearn.grid_search import GridSearchCV
2 
3 # 一般把参数写成字典的格式:
4 tree_param_grid = { min_samples_split: list((3, 6, 9)),n_estimators: list((10,50,100))}
5 
6 # 第一个参数是模型,第二个参数是待选的参数,cv:进行几次交叉验证
7 grid = GridSearchCV(RandomForestRegressor(), param_grid = tree_param_grid, cv = 5)
8 grid.fit(x_train, y_train)
9 grid.grid_scores_, grid.best_params_, grid.best_score_

结果为:

([mean: 0.78795, std: 0.00337, params: {‘min_samples_split‘: 3, ‘n_estimators‘: 10},
  mean: 0.80463, std: 0.00308, params: {‘min_samples_split‘: 3, ‘n_estimators‘: 50},
  mean: 0.80732, std: 0.00448, params: {‘min_samples_split‘: 3, ‘n_estimators‘: 100},
  mean: 0.78535, std: 0.00506, params: {‘min_samples_split‘: 6, ‘n_estimators‘: 10},
  mean: 0.80446, std: 0.00399, params: {‘min_samples_split‘: 6, ‘n_estimators‘: 50},
  mean: 0.80688, std: 0.00424, params: {‘min_samples_split‘: 6, ‘n_estimators‘: 100},
  mean: 0.78754, std: 0.00552, params: {‘min_samples_split‘: 9, ‘n_estimators‘: 10},
  mean: 0.80321, std: 0.00487, params: {‘min_samples_split‘: 9, ‘n_estimators‘: 50},
  mean: 0.80553, std: 0.00389, params: {‘min_samples_split‘: 9, ‘n_estimators‘: 100}],
 {‘min_samples_split‘: 3, ‘n_estimators‘: 100},
 0.8073224957136084)

 

使用得到的参数重新训练随机森林:

1 rfr = RandomForestRegressor( min_samples_split=3,n_estimators = 100,random_state = 42)
2 rfr.fit(x_train, y_train)
3 rfr.score(x_test, y_test)

结果为:

0.80908290496531576

 

1 pd.Series(rfr.feature_importances_, index = housing.feature_names).sort_values(ascending = False)

结果为:

MedInc        0.524257
AveOccup      0.137947
Latitude      0.090622
Longitude     0.089414
HouseAge      0.053970
AveRooms      0.044443
Population    0.030263
AveBedrms     0.029084
dtype: float64

 

评论(0
© 2014 mamicode.com 版权所有 京ICP备13008772号-2  联系我们:gaon5@hotmail.com
迷上了代码!