机器学习(一)

数据工具基本工具集

Jupyter

这个系列主要使用python，pip是python比较有名的包管理器，python2.7默认安装了easy_install.py,可以通过这个安装pip

sudo easy_install pip

ipython是加强的python shell，安装

pip install --user IPython

Jupyter整合了ipython－notebook，Jupyter运行python，R，文档和统计界面化都很方便

pip install --user jupyter

启动notebook,显示当前路径的文件，可以很方便的编辑“.ipynb”文件

python -m IPython notebook

Numpy

用python实现的科学计算包，提供很多高级数值编程工具，可以创建数组和数据的基本操作，比如：切片、转置、求特征向量

Scipy

Scipy一般都是操控Numpy数组来进行科学计算，所以可以说是基于Numpy之上了。Scipy有很多子模块可以应对不同的应用，例如插值运算，优化算法、图像处理、数学统计等。

scipy.cluster 向量量化
scipy.constants 数学常量
scipy.fftpack 快速傅里叶变换
scipy.integrate 积分
scipy.interpolate 插值
scipy.io 数据输入输出
scipy.linalg 线性代数
scipy.ndimage N维图像
scipy.odr 正交距离回归
scipy.optimize 优化算法
scipy.signal 信号处理
scipy.sparse 稀疏矩阵
scipy.spatial 空间数据结构和算法
scipy.special 特殊数学函数
scipy.stats 统计函数

Pandas

pannel data analysis（面板数据分析）。pandas是基于numpy构建的，为时间序列分析提供了很好的支持。pandas中有两个主要的数据结构，一个是Series(相当于列向量)，另一个是DataFrame（矩阵）。

创建案例：

输入，创建series

import pandas as pd
import numpy as np
step_data = [3620, 7891, 9761,
            3907, 4338, 5373]
step_counts = pd.Series(step_data,
name='steps') 
print(step_counts)

输出：

0    3620
1    7891
2    9761
3    3907
4    4338
5    5373
Name: steps, dtype: int64

添加时间索引

1
2
3

step_counts.index = pd.date_range("20150329",
                                  periods=6)
print(step_counts)

输出：

2015-03-29    3620
2015-03-30    7891
2015-03-31    9761
2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: steps, dtype: int64

根据索引查询

1	print(step_counts['2015-04-01 '])

输出：

可以查询某月的，对时间序列很友好

1	print(step_counts['2015-04'])

输出

2015-04-01    3907
2015-04-02    4338
2015-04-03    5373
Freq: D, Name: steps, dtype: int64

查询类型

1	print(step_counts.dtype)

输出：

int64

空的赋值

1 2	step_counts[1:3] = np.NaN print(step_counts)

输出：

2015-03-29    3620.0
2015-03-30       NaN
2015-03-31       NaN
2015-04-01    3907.0
2015-04-02    4338.0
2015-04-03    5373.0
Freq: D, Name: steps, dtype: float64

创建DataFrame,zip放回一个tuple列表

cycling_data = [10.7, 0, None, 2.4, 15.3, 10.9, 0, None]
joined_data = list(zip(step_data,
                       cycling_data))
print(joined_data)
activity_df = pd.DataFrame(joined_data)
print(activity_df)

输出：

[(3620, 10.7), (7891, 0), (9761, None), (3907, 2.4), (4338, 15.3), (5373, 10.9)]
      0     1
0  3620  10.7
1  7891   0.0
2  9761   NaN
3  3907   2.4
4  4338  15.3
5  5373  10.9

创建时间索引，和列名

1 2	active_df = pd.DataFrame(joined_data,index=pd.date_range("20171229",periods=6),columns=['Walking','Cycling']) print(active_df)

输出：

            Walking  Cycling
2017-12-29     3620     10.7
2017-12-30     7891      0.0
2017-12-31     9761      NaN
2018-01-01     3907      2.4
2018-01-02     4338     15.3
2018-01-03     5373     10.9

loc是根据dataframe的具体标签选取列，iloc是根据标签所在的位置，从0开始计数。iloc[:,],”,”前面是行的索引，后面是列的索引

1	print(active_df.loc['2017-12-31'])

输出：

Walking    9761.0
Cycling       NaN
Name: 2017-12-31 00:00:00, dtype: float64

－3表示取倒数第三行

1	print(active_df.iloc[-3])

输出

Walking    3907.0
Cycling       2.4
Name: 2018-01-01 00:00:00, dtype: float64

表示取［0，3）行，前闭后开，第一列

1	print(active_df.iloc[0:3,0])

输出：

2017-12-29    3620
2017-12-30    7891
2017-12-31    9761
Freq: D, Name: Walking, dtype: int64