ml2021spring_hw1

Posted on 2021-10-27 In Technique

Introduction

完成李宏毅老师2021春季机器学习课程作业1（HW1）的总结。

Task Description

已知过去三天美国各州的相关调查数据，利用这些数据预测第三天阳性案例的数量。实质是用DNN解决回归问题（regression problem）。

Task Description

Data Analysis

Data Features

给出的数据中包含许多特征（指标），可以大体分为5类：

地区
COVID疑似症状
行为
心理健康
检测阳性案例

Data Features

Data Format

数据以csv格式存储，一行代表一个sample。一共有两份数据：

训练数据：包含第三天的检验阳性案例数量
测试数据：不包含第三天的阳性案例数量

Train Data

Test Data

Program

这一部分主要介绍完整程序的各个主要部分，并记录自己在原程序上的修改。
HW1的原始源码可以参见该课程的CoLab。

Module

程序的主要模块为：

Data Preprocess
Construct Neural Network
Define Train、Validation、Test Functions
Setup Hyper-parametets
Train
Test
Save Data

Data Preprocess

Dataset

Dataset框架

class COVID19Dataset(Dataset):
    ''' Dataset for loading and preprocessing the COVID19 dataset '''
    def __init__(self,
                 path,
                 mode='train',
                 target_only=False):
        self.mode = mode

        # Read data into numpy arrays
        
        # Select features
        
        # Splitting training data into train & dev sets
        
        # Convert data into PyTorch tensors
         
        # Normalize features 
        
    def __getitem__(self, index):
        # Returns one sample at a time
            return self.data[index]

    def __len__(self):
        # Returns the size of the dataset
        return len(self.data)

数据预处理的第一步是构建Dataset。Dataset的作用是按自定的规则将数据组织存储（个人理解）。本次作业中的Dataset框架如上所示，包含5个关键部分：

将csv数据读取至numpy array
选取特征值
将训练数据分成train和validation两部分
将np.array转换为pytorch的tensor类型
对特征数据进行标准化

本次作业中，需要自己完成的部分是特征数据的选取。为了实现更好地回归预测，应当选取与预测指标相关度高的指标作为特征数据。这就要求在选取前先对数据做相关性分析（内置函数实现数值分析or直觉）。修改后选取的feature为：covid-like illness、以及前两天的tested positive cases。

未解决的问题：normalization的处理中，Train Validation Test使用的均值与方差并不一致。

DataLoader

DataLoader实现的功能是将Dataset中的数据按设定好的batch-size分成若干份batch，同时可以指定参数实现对数据重新排序的效果（shuffle）。

DataLoader

def prep_dataloader(path, mode, batch_size, n_jobs=0, target_only=False):
    ''' Generates a dataset, then is put into a dataloader. '''
    dataset = COVID19Dataset(path, mode=mode, target_only=target_only)  # Construct dataset
    dataloader = DataLoader(
        dataset, batch_size,
        shuffle=(mode == 'train'), drop_last=False,
        num_workers=n_jobs, pin_memory=True)                            # Construct dataloader
    return dataloader

Construct Neural Network

该模块实现的功能是：

定义网络结构
定义网络中数据的传递
定义损失函数

Neural Network

class NeuralNet(nn.Module):
    ''' A simple fully-connected deep neural network '''
    def __init__(self, input_dim):
        super(NeuralNet, self).__init__()

        # Define your neural network here
        # TODO: How to modify this model to achieve better performance?
        self.net = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )

        # Mean squared error loss
        self.criterion = nn.MSELoss(reduction='mean')

    def forward(self, x):
        ''' Given input of size (batch_size x input_dim), compute output of the network '''
        return self.net(x).squeeze(1)

    def cal_loss(self, pred, target):
        ''' Calculate loss '''
        # TODO: you may implement L1/L2 regularization here
        return self.criterion(pred, target)

相比于原始代码，修改后的网络规模更小，得到了更好的效果。一种解释是：输入的数据维度较小（14维），所以过大的网络并不适用。

未解决的问题：没有在定义network中实现L2 regularization，而是在后续选择optimizer时，通过设置可选参数weight_decay实现了L2，但不知道这两种实现方式有何区别。

Train & Validation & Test Module

Train

train module

def train(tr_set, dv_set, model, config, device):
    ''' DNN training '''

    #Setup Maximum number of epochs

    # Setup optimizer
    
    min_mse = 1000.
    loss_record = {'train': [], 'dev': []}      # for recording training loss
    early_stop_cnt = 0
    epoch = 0
    while epoch < n_epochs:
        model.train()                           # set model to training mode
        for x, y in tr_set:                     # iterate through the dataloader
        	# forward pass (compute output)
            # compute loss
            # backpropagation
            # update model with optimizer
            # record loss

        # After each epoch, test your model on the validation (development) set.
        dev_mse = dev(dv_set, model, device)
        if dev_mse < min_mse:
            # Save model if your model improved
            min_mse = dev_mse
            print('Saving model (epoch = {:4d}, loss = {:.4f})'
                .format(epoch + 1, min_mse))
            torch.save(model.state_dict(), config['save_path'])  # Save model to specified path
            early_stop_cnt = 0
        else:
            early_stop_cnt += 1

        epoch += 1
        loss_record['dev'].append(dev_mse)
        if early_stop_cnt > config['early_stop']:
            # Stop training if your model stops improving for "config['early_stop']" epochs.
            break

    print('Finished training after {} epochs'.format(epoch))
    return min_mse, loss_record

训练模块中，除了完成前向传播、反向传播及更新，在每轮epoch结束后，都会在validation set上进行验证。

validation计算得到的loss如果小于之前训练中最小的loss，则说明network的性能得到了提升，将此network参数保存；如果连续若干轮（early_stop）模型性能没有得到提升，则可以提前结束训练。

Validation

validation module

def dev(dv_set, model, device):
    model.eval()                                # set model to evalutation mode
    total_loss = 0
    for x, y in dv_set:                         # iterate through the dataloader
        x, y = x.to(device), y.to(device)       # move data to device (cpu/cuda)
        with torch.no_grad():                   # disable gradient calculation
            pred = model(x)                     # forward pass (compute output)
            mse_loss = model.cal_loss(pred, y)  # compute loss
        total_loss += mse_loss.detach().cpu().item() * len(x)  # accumulate loss
    total_loss = total_loss / len(dv_set.dataset)              # compute averaged loss

    return total_loss

验证Module的作用是判断训练的结果。该函数的主要步骤与Train主要步骤一致，区别在于Train中network是在model.train()模式，validation中是在model.eval()模式。同时，validation中关闭了梯度计算。

Test

test module

def test(tt_set, model, device):
    model.eval()                                # set model to evalutation mode
    preds = []
    for x in tt_set:                            # iterate through the dataloader
        x = x.to(device)                        # move data to device (cpu/cuda)
        with torch.no_grad():                   # disable gradient calculation
            pred = model(x)                     # forward pass (compute output)
            preds.append(pred.detach().cpu())   # collect prediction
    preds = torch.cat(preds, dim=0).numpy()     # concatenate all predictions and convert to a numpy array
    return preds

Test模块的流程更为简洁，与Validation基本一致，区别在于没有计算loss。

Setup Hyper-parameters

hyper-parameters

config = {
    'n_epochs': 3000,                # maximum number of epochs
    'batch_size': 128,               # mini-batch size for dataloader
    # 使用SGD会出现loss无法减小的问题
    'optimizer': 'Adam',              # optimization algorithm (optimizer in torch.optim)
    'optim_hparas': {                # hyper-parameters for the optimizer (depends on which optimizer you are using)
        'lr': 0.001,                 # learning rate of SGD
        # 'momentum': 0.9,             # momentum for SGD
        'weight_decay':1e-3          # L2 regularization
    },
    'early_stop': 200,               # early stopping epochs (the number epochs since your model's last improvement)
    'save_path': 'models/model.pth'  # your model will be saved here
}

超参数主要包括：batch_size、优化器类型、学习率、早停计数器等。在这部分，将optimizer从SGD修改为了Adam，设置了用于L2正则化的参数weight_decay。

未解决的问题：在本次作业中，如果使用SGD，会出现训练的loss无法减小的问题，停留在57左右，似乎是停在了saddle point。

Result

提交到kaggle上后，得分为：

private score: 0.90250
public score: 0.88411

BLOG

ml2021spring_hw1

Introduction

Task Description

Data Analysis

Data Features

Data Format

Program

Module

Data Preprocess

Dataset

DataLoader

Construct Neural Network

Train & Validation & Test Module

Train

Validation

Test

Setup Hyper-parameters

Result

Reference