ml2021spring_hw1

Introduction

完成李宏毅老师2021春季机器学习课程作业1(HW1)的总结。

Task Description

已知过去三天美国各州的相关调查数据,利用这些数据预测第三天阳性案例的数量。实质是用DNN解决回归问题(regression problem)。

Task Description

Data Analysis

Data Features

给出的数据中包含许多特征(指标),可以大体分为5类:

  • 地区
  • COVID疑似症状
  • 行为
  • 心理健康
  • 检测阳性案例
Data Features

Data Format

数据以csv格式存储,一行代表一个sample。一共有两份数据:

  • 训练数据:包含第三天的检验阳性案例数量
  • 测试数据:不包含第三天的阳性案例数量
Train Data
Test Data

Program

这一部分主要介绍完整程序的各个主要部分,并记录自己在原程序上的修改。
HW1的原始源码可以参见该课程的CoLab

Module

程序的主要模块为:

  • Data Preprocess
  • Construct Neural Network
  • Define Train、Validation、Test Functions
  • Setup Hyper-parametets
  • Train
  • Test
  • Save Data

Data Preprocess

Dataset

Dataset框架
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class COVID19Dataset(Dataset):
''' Dataset for loading and preprocessing the COVID19 dataset '''
def __init__(self,
path,
mode='train',
target_only=False):
self.mode = mode

# Read data into numpy arrays

# Select features

# Splitting training data into train & dev sets

# Convert data into PyTorch tensors

# Normalize features

def __getitem__(self, index):
# Returns one sample at a time
return self.data[index]

def __len__(self):
# Returns the size of the dataset
return len(self.data)

数据预处理的第一步是构建Dataset。Dataset的作用是按自定的规则将数据组织存储(个人理解)。本次作业中的Dataset框架如上所示,包含5个关键部分:

  • 将csv数据读取至numpy array
  • 选取特征值
  • 将训练数据分成train和validation两部分
  • 将np.array转换为pytorch的tensor类型
  • 对特征数据进行标准化

本次作业中,需要自己完成的部分是特征数据的选取。为了实现更好地回归预测,应当选取与预测指标相关度高的指标作为特征数据。这就要求在选取前先对数据做相关性分析(内置函数实现数值分析or直觉)。修改后选取的feature为:covid-like illness、以及前两天的tested positive cases。

未解决的问题:normalization的处理中,Train Validation Test使用的均值与方差并不一致。

DataLoader

DataLoader实现的功能是将Dataset中的数据按设定好的batch-size分成若干份batch,同时可以指定参数实现对数据重新排序的效果(shuffle)。

DataLoader
1
2
3
4
5
6
7
8
def prep_dataloader(path, mode, batch_size, n_jobs=0, target_only=False):
''' Generates a dataset, then is put into a dataloader. '''
dataset = COVID19Dataset(path, mode=mode, target_only=target_only) # Construct dataset
dataloader = DataLoader(
dataset, batch_size,
shuffle=(mode == 'train'), drop_last=False,
num_workers=n_jobs, pin_memory=True) # Construct dataloader
return dataloader

Construct Neural Network

该模块实现的功能是:

  • 定义网络结构
  • 定义网络中数据的传递
  • 定义损失函数
Neural Network
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
class NeuralNet(nn.Module):
''' A simple fully-connected deep neural network '''
def __init__(self, input_dim):
super(NeuralNet, self).__init__()

# Define your neural network here
# TODO: How to modify this model to achieve better performance?
self.net = nn.Sequential(
nn.Linear(input_dim, 32),
nn.ReLU(),
nn.Linear(32, 1)
)

# Mean squared error loss
self.criterion = nn.MSELoss(reduction='mean')

def forward(self, x):
''' Given input of size (batch_size x input_dim), compute output of the network '''
return self.net(x).squeeze(1)

def cal_loss(self, pred, target):
''' Calculate loss '''
# TODO: you may implement L1/L2 regularization here
return self.criterion(pred, target)

相比于原始代码,修改后的网络规模更小,得到了更好的效果。一种解释是:输入的数据维度较小(14维),所以过大的网络并不适用。

未解决的问题:没有在定义network中实现L2 regularization,而是在后续选择optimizer时,通过设置可选参数weight_decay实现了L2,但不知道这两种实现方式有何区别。

Train & Validation & Test Module

Train

train module
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def train(tr_set, dv_set, model, config, device):
''' DNN training '''

#Setup Maximum number of epochs

# Setup optimizer

min_mse = 1000.
loss_record = {'train': [], 'dev': []} # for recording training loss
early_stop_cnt = 0
epoch = 0
while epoch < n_epochs:
model.train() # set model to training mode
for x, y in tr_set: # iterate through the dataloader
# forward pass (compute output)
# compute loss
# backpropagation
# update model with optimizer
# record loss

# After each epoch, test your model on the validation (development) set.
dev_mse = dev(dv_set, model, device)
if dev_mse < min_mse:
# Save model if your model improved
min_mse = dev_mse
print('Saving model (epoch = {:4d}, loss = {:.4f})'
.format(epoch + 1, min_mse))
torch.save(model.state_dict(), config['save_path']) # Save model to specified path
early_stop_cnt = 0
else:
early_stop_cnt += 1

epoch += 1
loss_record['dev'].append(dev_mse)
if early_stop_cnt > config['early_stop']:
# Stop training if your model stops improving for "config['early_stop']" epochs.
break

print('Finished training after {} epochs'.format(epoch))
return min_mse, loss_record

训练模块中,除了完成前向传播、反向传播及更新,在每轮epoch结束后,都会在validation set上进行验证。

validation计算得到的loss如果小于之前训练中最小的loss,则说明network的性能得到了提升,将此network参数保存;如果连续若干轮(early_stop)模型性能没有得到提升,则可以提前结束训练。

Validation

validation module
1
2
3
4
5
6
7
8
9
10
11
12
def dev(dv_set, model, device):
model.eval() # set model to evalutation mode
total_loss = 0
for x, y in dv_set: # iterate through the dataloader
x, y = x.to(device), y.to(device) # move data to device (cpu/cuda)
with torch.no_grad(): # disable gradient calculation
pred = model(x) # forward pass (compute output)
mse_loss = model.cal_loss(pred, y) # compute loss
total_loss += mse_loss.detach().cpu().item() * len(x) # accumulate loss
total_loss = total_loss / len(dv_set.dataset) # compute averaged loss

return total_loss

验证Module的作用是判断训练的结果。该函数的主要步骤与Train主要步骤一致,区别在于Train中network是在model.train()模式,validation中是在model.eval()模式。同时,validation中关闭了梯度计算。

Test

test module
1
2
3
4
5
6
7
8
9
10
def test(tt_set, model, device):
model.eval() # set model to evalutation mode
preds = []
for x in tt_set: # iterate through the dataloader
x = x.to(device) # move data to device (cpu/cuda)
with torch.no_grad(): # disable gradient calculation
pred = model(x) # forward pass (compute output)
preds.append(pred.detach().cpu()) # collect prediction
preds = torch.cat(preds, dim=0).numpy() # concatenate all predictions and convert to a numpy array
return preds

Test模块的流程更为简洁,与Validation基本一致,区别在于没有计算loss。

Setup Hyper-parameters

hyper-parameters
1
2
3
4
5
6
7
8
9
10
11
12
13
config = {
'n_epochs': 3000, # maximum number of epochs
'batch_size': 128, # mini-batch size for dataloader
# 使用SGD会出现loss无法减小的问题
'optimizer': 'Adam', # optimization algorithm (optimizer in torch.optim)
'optim_hparas': { # hyper-parameters for the optimizer (depends on which optimizer you are using)
'lr': 0.001, # learning rate of SGD
# 'momentum': 0.9, # momentum for SGD
'weight_decay':1e-3 # L2 regularization
},
'early_stop': 200, # early stopping epochs (the number epochs since your model's last improvement)
'save_path': 'models/model.pth' # your model will be saved here
}

超参数主要包括:batch_size、优化器类型、学习率、早停计数器等。在这部分,将optimizer从SGD修改为了Adam,设置了用于L2正则化的参数weight_decay。

未解决的问题:在本次作业中,如果使用SGD,会出现训练的loss无法减小的问题,停留在57左右,似乎是停在了saddle point。

Result

提交到kaggle上后,得分为:

  • private score: 0.90250
  • public score: 0.88411

Reference