Child Mind Institute - Detect Sleep States(2023年第一次Kaggle拿到了银牌总结)

感谢

感谢艾兄(大佬带队)、rich师弟(师弟通过这次比赛机械转码成功、耐心学习)、张同学(也很有耐心的在学习),感谢开源方案(开源就是银牌),在此基础上一个月不到收获到了很多,运气很好。这个是我们比赛的总结: 

我们队Kaggle CMI银牌方案,欢迎感兴趣的伙伴upvote:https://www.kaggle.com/competitions/child-mind-institute-detect-sleep-states/discussion/459610


计划 (系统>结果,稳健>取巧)

团队计划表,每个人做的那部分工作,避免重复,方便交流,提高效率,这个工作表起了很大的作用。


具体方案 

75th Place Detailed Solution - Spec2DCNN + CenterNet + Transformer + NMS

First of all, I would like to thank @tubotubo for sharing your high-quality code, and also thank my teammates @liruiqi577 @brickcoder @xtzhou for their contributions in the competition. Here, I am going to share our team’s “snore like thunder” solution from the following aspects:

  1. Data preprocessing
  2. Feature Engineering
  3. Model
  4. Post Processing
  5. Model Ensemble

1. Data preprocessing

We made EDA and readed open discussions found that there are 4 types of data anomalies:

  • Some series have a high missing rate and some of them do not even have any event labels;
  • In some series , there are no event annotations in the middle and tail (possibly because the collection activity has stopped);
  • The sleep record is incomplete (a period of sleep is only marked with onset or wakeup).
  • There are outliers in the enmo value.

To this end, we have some attempts, such as:

  • Eliminate series with high missing rates;
  • Cut the tail of the series without event labels;
  • Upper clip enmo to 1.

But the above methods didn't completely work. In the end, our preprocessing method was:

We split the dataset group by series into 5 folds. For each fold, we eliminate series with a label missing rate of 100% in the training dataset while without performing any data preprocessing on the validation set. This is done to avoid introducing noise to the training set, and to ensure that the evaluation results of the validation set are more biased towards the real data distribution, which improve our LB score + 0.006.

Part of our experiments as below:

ExperimentFold0Public (single fold)Private (5-fold)
No preprocess missing data0.7510.7180.744
Eliminate unlabeled data at the end of train_series & series with missing rate >80%0.7390.7090.741
Drop train series which don’t have any event labels0.7520.7240.749

2. Feature Engineering

  • Sensor features: After smoothing the enmo and anglez features, a first-order difference is made to obtain the absolute value. Then replace the original enmo and anglez features with these features, which improve our LB score + 0.01.
train_series['enmo_abs_diff'] = train_series['enmo'].diff().abs()
train_series['enmo'] = train_series['enmo_abs_diff'].rolling(window=5, center=True, min_periods=1).mean()
train_series['anglez_abs_diff'] = train_series['anglez'].diff().abs()
train_series['anglez'] = train_series['anglez_abs_diff'].rolling(window=5, center=True, min_periods=1).mean()
  • Time features: sin and cos hour.

In addition, we also made the following features based on open notebooks and our EDA, such as: differential features with different orders, rolling window statistical features, interactive features of enmo and anglez (such as anglez's differential abs * enmo, etc.), anglez_rad_sin/cos, dayofweek/is_weekend (I find that children have different sleeping habits on weekdays and weekends). But strangely enough, too much feature engineering didn’t bring us much benefit.

ExperimentFold0Public (5-fold)Private (5-fold)
anglez + enmo + hour_sin + hour_cos0.7630.7310.768
anglez_abs_diff + enmo_abs_diff + hour_sin + hour_cos0.7710.7410.781

3. Model

We used 4 models:

  • CNNSpectrogram + Spec2DCNN + UNet1DDecoder;
  • PANNsFeatureExtractor + Spec2DCNN + UNet1DDecoder.
  • PANNsFeatureExtractor + CenterNet + UNet1DDecoder.
  • TransformerAutoModel (xsmall, downsample_rate=8).

Parameter Tunning: Add more kernel_size 8 for CNNSpectrogram can gain +0.002 online.

Multi-Task Learning Objectives: sleep status, onset, wake.

Loss Function: For Spec2DCNN and TransformerAutoModel, we use BCE, but with multi-task target weighting, sleep:onset:wake = 0.5:1:1. The purpose of this is to allow the model to focus on learning the last two columns. We tried to train only for the onset and wake columns, but the score was not good. The reason is speculated that the positive samples in these two columns are sparse, and MTL needs to be used to transfer the information from positive samples in the sleep status to the prediction of sleep activity events. Also, I tried KL Loss but it didn't work that well.

self.loss_fn = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([0.5,1.,1.]))

At the same time, we adjusted epoch to 70 and added early stopping with patience=15. The early stopping criterion is the AP of the validation dataset, not the loss of the validation set. batch_size=32.

ExperimentFold0Public (single fold)Private (5-fold)
earlystop by val_loss0.7500.6970.742
earlystop by val_score0.7510.7180.744
loss_wgt = 1:1:10.7520.7240.749
loss_wgt = 0.5:1:10.7550.7230.753

Note: we used the model_weight.pth with the best offline val_score to submit our LB instead of using the best_model.pth with the best offline val_loss。

4. Post Processing

Our post-processing mainly includes:

  • find_peaks(): scipy.signal.find_peaks;
  • NMS: This task can be treated as object detection. [onset, wakeup] is regarded as a bounding boxes, and score is the confident of the box. Therefore, I used a time-series NMS. Using NMS can eliminate redundant boxes with high IOU, which increase our AP.
def apply_nms(dets_arr, thresh):x1 = dets_arr[:, 0]x2 = dets_arr[:, 1]scores = dets_arr[:, 2]areas = x2 - x1order = scores.argsort()[::-1]keep = []while order.size > 0:i = order[0]keep.append(i)xx1 = np.maximum(x1[i], x1[order[1:]])xx2 = np.minimum(x2[i], x2[order[1:]])inter = np.maximum(0.0, xx2 - xx1 + 1)ovr = inter / (areas[i] + areas[order[1:]] - inter)inds = np.where(ovr <= thresh)[0]order = order[inds + 1]dets_nms_arr = dets_arr[keep,:]onset_steps = dets_nms_arr[:, 0].tolist()wakeup_steps = dets_nms_arr[:, 1].tolist()nms_save_steps = np.unique(onset_steps + wakeup_steps).tolist()return nms_save_steps

In addition, we set score_th=0.005 (If it is set too low, a large number of events will be detected and cause online scoring errors, so it is fixed at 0.005 here), and use optuna to simultaneously search the parameter distance in find_peaks and the parameter iou_threshold of NMS. Finally, when distance=72 and iou_threshold=0.995, the best performance is achieved.

import optunadef objective(trial):score_th = 0.005 # trial.suggest_float('score_th', 0.003, 0.006)distance = trial.suggest_int('distance', 20, 80)thresh = trial.suggest_float('thresh', 0.75, 1.)# find peakval_pred_df = post_process_for_seg(keys=keys,preds=preds[:, :, [1, 2]],score_th=score_th,distance=distance,)# nmsval_pred_df = val_pred_df.to_pandas()nms_pred_dfs = NMS_prediction(val_pred_df, thresh, verbose=False)score = event_detection_ap(valid_event_df.to_pandas(), nms_pred_dfs)return -scorestudy = optuna.create_study()
study.optimize(objective, n_trials=100)
print('Best hyperparameters: ', study.best_params)
print('Best score: ', study.best_value)
ExperimentFold0Pubic (5-fold)Private (5-fold)
find_peak-0.7450.787
find_peak+NMS+optuna-0.7460.789

5. Model Ensemble

Finally, we average the output probabilities of the following models and then feed into the post processing methods to detect events. By the way, I tried post-processing the detection events for each model and then concating them, but this resulted in too many detections. Even with NMS, I didn't get a better score.

The number of ensemble models: 4 (types of models) * 5 (fold number) = 20.

ExperimentFold0Pubic (5-fold)Private (5-fold)
model1: CNNSpectrogram + Spec2DCNN + UNet1DDecoder0.772090.7430.784
model2: PANNsFeatureExtractor + Spec2DCNN + UNet1DDecoder0.7770.7430.782
model3: PANNsFeatureExtractor + CenterNet + UNet1DDecoder0.759680.6340.68
model4: TransformerAutoModel0.74680--
model1 + model2(1:1)-0.7460.789
model1 + model2+model3(1:1:0.4)-0.750.786
model1 + model2+model3+model4(1:1:0.4:0.2)0.7520.787

Unfortunately, we only considered CenterNet and Transformer to model ensemble with a tentative attitude on the last day, but surprisingly found that a low-CV-scoring model still has a probability of improving final performance as long as it is heterogeneous compared with your previous models. But we didn’t have more opportunities to submit more, which was a profound lesson for me.

Thoughts not done:

  • Data Augmentation: Shift the time within the batch to increase more time diversity and reduce dependence on hour features.

  • Model: Try more models. Although we try transformer and it didn’t work for us. I am veryyy looking forward to the solutions from top-ranking players.

Thanks again to Kaggle and all Kaggle players. This was a good competition and we learned a lot from it. If you think our solution is useful for you, welcome to upvote and discuss with us.

In addition, this is my first 🥈 silver medal. Thank you everyone for letting me learn a lot. I will continue to work hard. :)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/209324.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于Lucene的全文检索系统的实现与应用

文章目录 一、概念二、引入案例1、数据库搜索2、数据分类3、非结构化数据查询方法1&#xff09; 顺序扫描法(Serial Scanning)2&#xff09;全文检索(Full-text Search) 4、如何实现全文检索 三、Lucene实现全文检索的流程1、索引和搜索流程图2、创建索引1&#xff09;获取原始…

模板与泛型编程

函数模板 显示实例化 区别定义与声明 T是模板形参 int是模板实参 inpunt是函数形参 3是函数实参 显示实例化 模板必须实例化可见 翻译单元一处定义原则 与内联函数异同 引入原因&#xff1a;函数模板是为了编译器两个阶段的处理 内联函数是为了能在编译期展开 模板实参的类…

Android Kotlin语言下的文件存储

目录 将数据存储到文件中 创建文件和保存数据 读取文件 SharedPreferences存储 存储数据到SharedPreferences中 Context类中的getSharedPreferences()方法 Activity类中的getPreferences()方法 从SharedPreferences中读取数据 SQLite数据库存储 创建数据库 调用数据…

Java导出word

原文地址 传入的值不能为null,否则会报错&#xff0c;IXDocReport 有自己的判null规则&#xff0c;比较麻烦&#xff0c;建议代码直接把null替换成"" public void exportWord1(WeeklyMeetDataDto dto, HttpServletResponse response) {ServletOutputStream downLoad…

Ignis - Interactive Fire System

Ignis - 点火、蔓延、熄灭、定制! 全方位火焰系统。 这个插件在21年的项目中使用过很好用值使用概述 想玩火吗?如果想的话,那么Ignis就是你的最佳工具。有了Ignis,你可以把任何物体、植被或带皮带骨的网状物转换为可燃物体,它就会自动着火。然后,火焰可以蔓延,点燃其他物…

Java 一对多

前言 Internet 协议集支持一个无连接的传输协议&#xff0c;该协议称为用户数据报协议&#xff08;UDP&#xff0c;User Datagram Protocol&#xff09;。UDP 为应用程序提供了一种无需建立连接就可以发送封装的 IP 数据包的方法。 此代码就是基于UDP协议编写。 通常把一对多的…

【docker 】centOS 安装docker

官网 docker官网 github源码 卸载旧版本 sudo yum remove docker \docker-client \docker-client-latest \docker-common \docker-latest \docker-latest-logrotate \docker-logrotate \docker-engine 安装软件包 yum install -y yum-utils \device-mapper-persistent-data…

【优选算法系列】【专题二滑动窗口】第四节.30. 串联所有单词的子串和76. 最小覆盖子串

文章目录 前言一、串联所有单词的子串 1.1 题目描述 1.2 题目解析 1.2.1 算法原理 1.2.2 代码编写 1.2.3 题目总结二、最小覆盖子串 2.1 题目描述 2.2 题目解析 2.2.1 算法原理 2.2.2 代码编写 …

【Docker】进阶之路:(四)操作容器

【Docker】进阶之路&#xff1a;&#xff08;四&#xff09;Docker容器 容器的生命周期创建容器docker createdocker run 管理容器查看运行的容器&#xff1a;查看所有容器&#xff1a; 启动与终止启动容器终止容器 进入容器docker attachdocker exec 导出和导入导出导入 容器的…

浅谈5G基站节能及数字化管理解决方案的设计与应用-安科瑞 蒋静

截至2023年10月&#xff0c;我国5G基站总数达321.5万个&#xff0c;占全国通信基站总数的28.1%。然而&#xff0c;随着5G基站数量的快速增长&#xff0c;基站的能耗问题也逐渐日益凸显&#xff0c;基站的用电给运营商带来了巨大的电费开支压力&#xff0c;降低5G基站的能耗成为…

actitivi自定义属性(二)

声明&#xff1a;此处activiti版本为6.0 此文章介绍后端自定义属性解析&#xff0c;前端添加自定义属性方法连接&#xff1a;activiti自定义属性&#xff08;一&#xff09;_ruoyi activiti自定义标题-CSDN博客 1、涉及到的类如下&#xff1a; 简介&#xff1a;DefaultXmlPar…

在 JavaScript 中导入和导出 Excel XLSX 文件:SpreadJS

在 JavaScript 中导入和导出 Excel XLSX 文件 2023 年 12 月 5 日 使用 MESCIUS 的 SpreadJS 将完整的 JavaScript 电子表格添加到您的企业应用程序中。 SpreadJS 是一个完整的企业 JavaScript 电子表格解决方案&#xff0c;用于创建财务报告和仪表板、预算和预测模型、科学、工…

【华为OD】给定一个只包括 ‘(‘,‘)‘,‘{‘,‘}‘,‘[‘,‘]‘ 的字符串

给定一个只包括 (,),{,},[,] 的字符串 s , 判断字符串是否有效。 有效字符串需满足: 左括号必须用相同类型的右括号闭合。 左括号必须以正确的顺序闭合。 示例 1 : 输入:s="()" 输出 : true 示例 2 :

图的搜索(一):广度优先搜索算法和深度优先搜索算法

图的搜索&#xff08;一&#xff09;&#xff1a;广度优先搜索算法和深度优先搜索算法 本章主要记录了图的搜索算法&#xff0c;和可以解决图的基本问题——最短路径问题的算法。本章主要对图搜索的相关算法进行了介绍&#xff1a;广度优先搜索算法、深度优先搜索算法。 下一…

公网域名如何解析到内网IP服务器——快解析域名映射外网访问

在本地搭建主机应用后&#xff0c;由于没有公网IP或没有公网路由权限&#xff0c;在需要发布互联网时&#xff0c;就需要用到外网访问内网的一些方案。由于内网IP在外网不能直接访问&#xff0c;通常就用通过外网域名来访问内网的方法。那么&#xff0c;公网域名如何解析到内网…

Tmux中使用Docker报错 - 解决方案

问题 进入Tmux会话后&#xff0c;在其中使用Docker可能会出现如下报错&#xff1a; Got permission denied while trying to connect to the Docker ……解决方案 退出tmux会话: tmux detach在tmux会话外部杀掉tmux进程&#xff1a; pkill -f tmux重新进入tmux&#xff1a…

权威认证!景联文科技入选杭州市2023年第二批省级“专精特新”中小企业认定名单

为深入贯彻党中央国务院和省委省政府培育专精特新的决策部署&#xff0c;10月7日&#xff0c;杭州市经济和信息化委员会公示了2023年杭州“专精特新”企业名单&#xff08;第二批&#xff09;。 根据工业和信息化部《优质中小企业梯度培育管理暂行办法》&#xff08;工信部企业…

【Vue3+Ts项目】硅谷甄选 — 路由配置+登录模块+layout组件+路由鉴权

一、路由配置 项目一共需要4个一级路由&#xff1a;登录&#xff08;login&#xff09;、主页&#xff08;home&#xff09;、404、任意路由&#xff08;重定向到404&#xff09;。 1.1 安装路由插件 pnpm install vue-router 1.2 创建路由组件 在src目录下新建views文件…

Graphpad Prism10.1.0 安装教程 (含Win/Mac版)

GraphPad Prism GraphPad Prism是一款非常专业强大的科研医学生物数据处理绘图软件&#xff0c;它可以将科学图形、综合曲线拟合&#xff08;非线性回归&#xff09;、可理解的统计数据、数据组织结合在一起&#xff0c;除了最基本的数据统计分析外&#xff0c;还能自动生成统…

Python:核心知识点整理大全8-笔记

目录 ​编辑 4.5 元组 4.5.1 定义元组 dimensions.py 4.5.2 遍历元组中的所有值 4.5.3 修改元组变量 4.6 设置代码格式 4.6.1 格式设置指南 4.6.2 缩进 4.6.3 行长 4.6.4 空行 4.6.5 其他格式设置指南 4.7 小结 第5章 if语句 5.1 一个简单示例 cars.py 5.2 条…