【506】keras 读取及处理 IMDB 数据库

时间:2020-12-31 12:52:15   收藏:0   阅读:0

  利用 IMDB 数据进行 Sentiment Analysis。

  通过 keras.datasets 里面下载,注意下载的结构,并进行预处理。

from keras.datasets import imdb
from keras import preprocessing

# Number of words to consider as features
max_features = 10000
# Cut texts after this number of words 
# (among top max_features most common words)
maxlen = 20

# Load the data as lists of integers.
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

  x_train

  y_train: 二分类 0 和 1

  需要对文本长度进行调节

# This turns our lists of integers
# into a 2D integer tensor of shape `(samples, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

  长度设置为 maxlen=20。

  得到的矩阵可以直接作为 Embedding 层的输入数据。

参考:填充序列pad_sequences

语法:

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype=‘int32‘,
    padding=‘pre‘, truncating=‘pre‘, value=0.)

  将长为nb_samples的序列(标量序列)转化为形如(nb_samples,nb_timesteps)2D numpy array。如果提供了参数maxlennb_timesteps=maxlen,否则其值为最长序列的长度。其他短于该长度的序列都会在后部填充0以达到该长度。长于nb_timesteps的序列将会被截断,以使其匹配目标长度。padding和截断发生的位置分别取决于paddingtruncating.

参数:

返回值:

  返回形如(nb_samples,nb_timesteps)的2D张量

举例:  

>>> a = np.array([[2, 3],
		  [3, 4, 6],
		  [7, 8, 9, 10]])
>>> a
array([list([2, 3]), list([3, 4, 6]), list([7, 8, 9, 10])], dtype=object)
>>> import keras
Using TensorFlow backend.
>>> b = keras.preprocessing.sequence.pad_sequences(a, maxlen=10)
>>> b
array([[ 0,  0,  0,  0,  0,  0,  0,  0,  2,  3],
       [ 0,  0,  0,  0,  0,  0,  0,  3,  4,  6],
       [ 0,  0,  0,  0,  0,  0,  7,  8,  9, 10]])
>>> c = keras.preprocessing.sequence.pad_sequences(a, maxlen=10, padding=‘post‘)
>>> c
array([[ 2,  3,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 3,  4,  6,  0,  0,  0,  0,  0,  0,  0],
       [ 7,  8,  9, 10,  0,  0,  0,  0,  0,  0]])
>>> d = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding=‘post‘)
>>> d
array([[ 2,  3,  0],
       [ 3,  4,  6],
       [ 8,  9, 10]])
>>> e = keras.preprocessing.sequence.pad_sequences(a, maxlen=3)
>>> e
array([[ 0,  2,  3],
       [ 3,  4,  6],
       [ 8,  9, 10]])
>>> f = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding=‘post‘, truncating=‘post‘)
>>> f
array([[2, 3, 0],
       [3, 4, 6],
       [7, 8, 9]])

  

评论(0
© 2014 mamicode.com 版权所有 京ICP备13008772号-2  联系我们:gaon5@hotmail.com
迷上了代码!