0 回归-海上风电出力预测

https://www.dcic-china.com/competitions/10098

分析一下:特征工程如何做。


  1. 时间特征: 小时、分钟、一个星期中的第几天、一个月中的第几天。这些可以作为周期特征的标识。比如周六周日的人流会有很大的波动,这些如果不告诉模型它是很难学习到知识的。
  2. 业务特征: 这方面需要查阅相关的知识点了。操作基本都是在 对单个特征特殊处理f(x),两个特征之间做四则运算。同一业务特征做加减,不同领域特征做乘除。最好做出来的特征有实际的物理意义。
  3. 历史序列特征:滑动窗口、移动平均等等;我之前参加过一个 做的特征工作是爆炸式的,也是惊讶了我,但是别人的结果是真的好。这玩意真有点迷,做尝试吧。
  4. label处理。比如回归,如果能降低当前标签的量纲一定要做。可以与某个及其相关的特征做除法(减法),缩小变化,这样防止模型预测的结果不可控。

import numpy as np
import pandas as pd
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier, CatBoostRegressor
from sklearn.model_selection import StratifiedKFold, KFold, GroupKFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import tqdm
import sys
import os
import gc
import argparse
import warnings
warnings.filterwarnings('ignore')# 读取数据
train_info = pd.read_csv('../data/first_data/A榜-训练集_海上风电预测_基本信息.csv', encoding='gbk')
train_df = pd.read_csv('../data/first_data/A榜-训练集_海上风电预测_气象变量及实际功率数据.csv', encoding='gbk')test_info = pd.read_csv('../data/first_data/B榜-测试集_海上风电预测_基本信息.csv', encoding='gbk')
test_df = pd.read_csv('../data/first_data/B榜-测试集_海上风电预测_气象变量数据.csv', encoding='gbk')submit_example = pd.read_csv('../data/first_data/submit_example.csv')train_df = train_df.merge(train_info[['站点编号','装机容量(MW)']], on=['站点编号'], how='left')
test_df = test_df.merge(test_info[['站点编号','装机容量(MW)']], on=['站点编号'], how='left')train_df['站点编号'] = train_df['站点编号'].apply(lambda x:int(x[1]))
test_df['站点编号'] = test_df['站点编号'].apply(lambda x:int(x[1]))train_df.columns = ['stationId','time','airPressure','relativeHumidity','cloudiness','10mWindSpeed','10mWindDirection','temperature','irradiation','precipitation','100mWindSpeed','100mWindDirection','power','capacity']test_df.columns = ['stationId','time','airPressure','relativeHumidity','cloudiness','10mWindSpeed','10mWindDirection','temperature','irradiation','precipitation','100mWindSpeed','100mWindDirection','capacity']# 特征组合
train_df['100mWindSpeed/10mWindSpeed'] = train_df['100mWindSpeed'] / (train_df['10mWindSpeed'] + 0.0000001)
test_df['100mWindSpeed/10mWindSpeed'] = test_df['100mWindSpeed'] / (test_df['10mWindSpeed'] + 0.0000001)train_df['100mWindDirection/10mWindDirection'] = train_df['100mWindDirection'] / (train_df['10mWindDirection'] + 0.0000001)
test_df['100mWindDirection/10mWindDirection'] = test_df['100mWindDirection'] / (test_df['10mWindDirection'] + 0.0000001)train_df['10mWindDirection_new'] = train_df['10mWindDirection'] - 180
test_df['10mWindDirection_new'] = test_df['10mWindDirection'] - 180# 差值
train_df['100mWindSpeed_10mWindSpeed'] = train_df['100mWindSpeed'] - train_df['10mWindSpeed'] 
test_df['100mWindSpeed_10mWindSpeed'] = test_df['100mWindSpeed'] - test_df['10mWindSpeed']train_df['100mWindDirection_10mWindDirection'] = train_df['100mWindDirection'] - train_df['10mWindDirection']
test_df['100mWindDirection_10mWindDirection'] = test_df['100mWindDirection'] - test_df['10mWindDirection']# 风切变指数
train_df['WindSpeed/WindDirectio'] = train_df['100mWindSpeed/10mWindSpeed'] / train_df['100mWindDirection/10mWindDirection']
test_df['WindSpeed/WindDirectio'] = test_df['100mWindSpeed/10mWindSpeed'] / test_df['100mWindDirection/10mWindDirection']train_df['100mWindSpeed/10mWindSpeed_2'] = train_df['100mWindSpeed/10mWindSpeed'].apply(lambda x:np.log10(x)) / 10
test_df['100mWindSpeed/10mWindSpeed_2'] = test_df['100mWindSpeed/10mWindSpeed'].apply(lambda x:np.log10(x)) / 10# 湿度/温度
train_df['relativeHumidity/temperature'] = train_df['relativeHumidity'] / (train_df['temperature'] + 0.0000001)
test_df['relativeHumidity/temperature'] = test_df['relativeHumidity'] / (test_df['temperature'] + 0.0000001)# 辐射/温度
train_df['irradiation/temperature'] = train_df['irradiation'] / (train_df['temperature'] + 0.0000001)
test_df['irradiation/temperature'] = test_df['irradiation'] / (test_df['temperature'] + 0.0000001)# 辐射/云量
train_df['irradiation/cloudiness'] = train_df['irradiation'] / (train_df['cloudiness'] + 0.0000001)
test_df['irradiation/cloudiness'] = test_df['irradiation'] / (test_df['cloudiness'] + 0.0000001)# 是否降水
train_df['is_precipitation'] = train_df['precipitation'].apply(lambda x:1 if x>0 else 0)
test_df['is_precipitation'] = test_df['precipitation'].apply(lambda x:1 if x>0 else 0)def get_time_feature(df, col):df_copy = df.copy()prefix = col + "_"df_copy[col] = df_copy[col].astype(str)df_copy[col] = pd.to_datetime(df_copy[col], format='%Y-%m-%d %H:%M')df_copy[prefix + 'month'] = df_copy[col].dt.monthdf_copy[prefix + 'day'] = df_copy[col].dt.daydf_copy[prefix + 'hour'] = df_copy[col].dt.hourdf_copy[prefix + 'minute'] = df_copy[col].dt.minutedf_copy[prefix + 'weekofyear'] = df_copy[col].dt.weekofyeardf_copy[prefix + 'dayofyear'] = df_copy[col].dt.dayofyearreturn df_copy   train_df = get_time_feature(train_df, 'time')
test_df = get_time_feature(test_df, 'time')# 合并训练数据和测试数据
train_df['is_test'] = 0
test_df['is_test'] = 1
df = pd.concat([train_df, test_df], axis=0).reset_index(drop=True)# 构建特征
num_cols = ['airPressure','relativeHumidity','cloudiness','10mWindSpeed','10mWindDirection','temperature','irradiation','precipitation','100mWindSpeed','100mWindDirection']for col in tqdm.tqdm(num_cols):# 历史平移/差分特征for i in [1,2,3,4,5,6,7,15,30,50] + [1*96,2*96,3*96,4*96,5*96]:df[f'{col}_shift{i}'] = df.groupby('stationId')[col].shift(i)df[f'{col}_feture_shift{i}'] = df.groupby('stationId')[col].shift(-i)df[f'{col}_diff{i}'] = df[f'{col}_shift{i}'] - df[col]df[f'{col}_feture_diff{i}'] = df[f'{col}_feture_shift{i}'] - df[col]df[f'{col}_2diff{i}'] = df.groupby('stationId')[f'{col}_diff{i}'].diff(1)df[f'{col}_feture_2diff{i}'] = df.groupby('stationId')[f'{col}_feture_diff{i}'].diff(1)# 均值相关df[f'{col}_3mean'] = (df[f'{col}'] + df[f'{col}_feture_shift1'] + df[f'{col}_shift1'])/3df[f'{col}_5mean'] = (df[f'{col}_3mean']*3 + df[f'{col}_feture_shift2'] + df[f'{col}_shift2'])/5df[f'{col}_7mean'] = (df[f'{col}_5mean']*5 + df[f'{col}_feture_shift3'] + df[f'{col}_shift3'])/7df[f'{col}_9mean'] = (df[f'{col}_7mean']*7 + df[f'{col}_feture_shift4'] + df[f'{col}_shift4'])/9df[f'{col}_11mean'] = (df[f'{col}_9mean']*9 + df[f'{col}_feture_shift5'] + df[f'{col}_shift5'])/11df[f'{col}_shift_3_96_mean'] = (df[f'{col}_shift{1*96}'] + df[f'{col}_shift{2*96}'] + df[f'{col}_shift{3*96}'])/3df[f'{col}_shift_5_96_mean'] = (df[f'{col}_shift_3_96_mean']*3 + df[f'{col}_shift{4*96}'] + df[f'{col}_shift{5*96}'])/5df[f'{col}_future_shift_3_96_mean'] = (df[f'{col}_feture_shift{1*96}'] + df[f'{col}_feture_shift{2*96}'] + df[f'{col}_feture_shift{3*96}'])/3df[f'{col}_future_shift_5_96_mean'] = (df[f'{col}_future_shift_3_96_mean']*3 + df[f'{col}_feture_shift{4*96}'] + df[f'{col}_feture_shift{5*96}'])/3# 窗口统计for win in [3,5,7,14,28]:df[f'{col}_win{win}_mean'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').mean().valuesdf[f'{col}_win{win}_max'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').max().valuesdf[f'{col}_win{win}_min'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').min().valuesdf[f'{col}_win{win}_std'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').std().valuesdf[f'{col}_win{win}_skew'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').skew().valuesdf[f'{col}_win{win}_kurt'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').kurt().valuesdf[f'{col}_win{win}_median'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').median().valuesdf = df.sort_values(['stationId','time'], ascending=False)df[f'{col}_future_win{win}_mean'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').mean().valuesdf[f'{col}_future_win{win}_max'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').max().valuesdf[f'{col}_future_win{win}_min'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').min().valuesdf[f'{col}_future_win{win}_std'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').std().valuesdf[f'{col}_future_win{win}_skew'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').skew().valuesdf[f'{col}_future_win{win}_kurt'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').kurt().valuesdf[f'{col}_future_win{win}_median'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').median().valuesdf = df.sort_values(['stationId','time'], ascending=True)# 二阶特征df[f'{col}_win{win}_mean_loc_diff'] = df[col] - df[f'{col}_win{win}_mean']df[f'{col}_win{win}_max_loc_diff'] = df[col] - df[f'{col}_win{win}_max']df[f'{col}_win{win}_min_loc_diff'] = df[col] - df[f'{col}_win{win}_min']df[f'{col}_win{win}_median_loc_diff'] = df[col] - df[f'{col}_win{win}_median']df[f'{col}_future_win{win}_mean_loc_diff'] = df[col] - df[f'{col}_future_win{win}_mean']df[f'{col}_future_win{win}_max_loc_diff'] = df[col] - df[f'{col}_future_win{win}_max']df[f'{col}_future_win{win}_min_loc_diff'] = df[col] - df[f'{col}_future_win{win}_min']df[f'{col}_future_win{win}_median_loc_diff'] = df[col] - df[f'{col}_future_win{win}_median']for col in ['is_precipitation']:for win in [4,8,12,20,48,96]:df[f'{col}_win{win}_mean'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').mean().valuesdf[f'{col}_win{win}_sum'] = df.groupby('stationId')[col].rolling(window=win, min_periods=3, closed='left').sum().valuestrain_df = df[df.is_test==0].reset_index(drop=True)
test_df = df[df.is_test==1].reset_index(drop=True)
del df
gc.collect()train_df = train_df[train_df['power']!='<NULL>'].reset_index(drop=True)
train_df['power'] = train_df['power'].astype(float)
cols = [f for f in test_df.columns if f not in ['time','power','is_test']] # capacity
def cv_model(clf, train_x, train_y, test_x, capacity, seed=2024):folds = 5kf = KFold(n_splits=folds, shuffle=True, random_state=seed)oof = np.zeros(train_x.shape[0])test_predict = np.zeros(test_x.shape[0])cv_scores = []for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):print('************************************ {} ************************************'.format(str(i+1)))trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]# 转化目标,进行站点目标归一化trn_y = trn_y / capacity[train_index]val_y = val_y / capacity[valid_index]train_matrix = clf.Dataset(trn_x, label=trn_y)valid_matrix = clf.Dataset(val_x, label=val_y)params = {'boosting_type': 'gbdt','objective': 'regression','metric': 'rmse','min_child_weight': 5,'num_leaves': 2 ** 8,'lambda_l2': 10,'feature_fraction': 0.8,'bagging_fraction': 0.8,'bagging_freq': 4,'learning_rate': 0.1,'seed': 2023,'nthread' : 16,'verbose' : -1,}model = clf.train(params, train_matrix, 3000, valid_sets=[train_matrix, valid_matrix],categorical_feature=[], verbose_eval=500, early_stopping_rounds=200)val_pred = model.predict(val_x, num_iteration=model.best_iteration)test_pred = model.predict(test_x, num_iteration=model.best_iteration)oof[valid_index] = val_predtest_predict += test_pred / kf.n_splitsscore = 1/(1+np.sqrt(mean_squared_error(val_pred * capacity[valid_index], val_y * capacity[valid_index])))cv_scores.append(score)print(cv_scores)if i == 0:imp_df = pd.DataFrame()imp_df["feature"] = colsimp_df["importance_gain"] = model.feature_importance(importance_type='gain')imp_df["importance_split"] = model.feature_importance(importance_type='split')imp_df["mul"] = imp_df["importance_gain"]*imp_df["importance_split"]imp_df = imp_df.sort_values(by='mul',ascending=False)imp_df.to_csv('feature_importance.csv', index=False)print(imp_df[:30])return oof, test_predictlgb_oof, lgb_test = cv_model(lgb, train_df[cols], train_df['power'], test_df[cols], train_df['capacity'])

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/824137.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

NLP地点位置抽取python库实现

在自然语言处理(NLP)中,抽取文本中的地点信息通常涉及到命名实体识别(NER,Named Entity Recognition)任务。Python 中常用的库如spaCy、NLTK、StanfordNLP、Hugging Face Transformers等均提供了相应的功能来识别文本中的地理位置实体。以下是一个使用spaCy库抽取地理位置…

^ SyntaxError: iterable unpacking cannot be used in comprehension

这个错误是Python中的一个语法错误&#xff0c;意思是在列表推导式中不能使用可迭代对象的解包操作符。列表推导式是一种简洁的创建列表的方式&#xff0c;但是在使用解包操作符时会导致语法错误。 解包操作符&#xff08;*&#xff09;用于将可迭代对象拆分为单独的元素&…

如何关掉地址空间随机化

如果我们的ru没有密码的话我们要先建一个密码 接着输入 su 进入root 接着输入 echo 0 > /proc/sys/kernel/randomize_va_space 就可以啦 接着可以cat查看是不是显示0&#xff0c;是的话就成功了

【Linux的git操作】

Linux学习笔记---010 Linux的git操作1、什么是gitee2、git 准备工作2.1、查看是否安装了 git 版本工具2.2、安装 git 工具/更新成最新版本2.3、在gitee上创建远程仓库&#xff08;略&#xff09;2.4、提交file的初始化操作 3、git的“三板斧”3.1、add3.2、commit3.3、push3.4、…

Flutter 像素编辑器#03 | 像素图层

theme: cyanosis 本系列&#xff0c;将通过 Flutter 实现一个全平台的像素编辑器应用。源码见开源项目 【pix_editor】 《Flutter 像素编辑器#01 | 像素网格》《Flutter 像素编辑器#02 | 配置编辑》《Flutter 像素编辑器#03 | 像素图层》 上一篇我们实现了编辑配置&#xff0c;…

React + 项目(从基础到实战) -- 第八期

ajax 请求的搭建 引入mockAP接口设计AJAX 通讯 前置知识 HTTP 协议 , 前后端通讯的桥梁API : XMLHttpRequest 和 fetch常用工具axios mock 引入 Mock.js (mockjs.com) 使用 mockJS 前端代码中引入 mockJs定义要模拟的路由 , 返回结果mockJs 劫持ajax请求(返回模拟的结果)…

2024运营级租房源码管理PHP后台+uniapp前端(app+小程序+H5)

内容目录 一、详细介绍二、效果展示1.部分代码2.效果图展示 一、详细介绍 房产系统 一款基于ThinkPHPUniapp开发的房产管理系统&#xff0c;支持小程序、H5、APP&#xff1b;包含房客、房东、经纪人三种身份。核心功能有&#xff1a;新盘销售、房屋租赁、地图找房、房源代理、…

RestFul 风格(SpringMVC学习笔记三)

1、什么是Restful风格&#xff1a; Restful就是一个资源定位及资源操作的风格。不是标准也不是协议&#xff0c;只是一种风格。基于这个风格设计的软件可以更简洁&#xff0c;更有层次&#xff0c;更易于实现缓存等机制。 2、使用Restful风格 接上一个笔记的测试类 package…

『Django』创建app(应用程序)

theme: smartblue 本文简介 点赞 关注 收藏 学会了 在《『Django』环境搭建》中介绍了如何搭建 Django 环境&#xff0c;并且创建了一个 Django 项目。 在刚接触 Django 时有2个非常基础的功能是需要了解的&#xff0c;一个是“app”(应用程序)&#xff0c;另一个是 url(路由…

Java工具类:封装Okhttp实现:Get、Post、上传/下载文件、Stream响应、代理ip

不好用请移至评论区揍我 原创代码,请勿转载,谢谢! 一、介绍 本文代码是引入Okhttp_v4.11.0,在这个基础上进行二次封装使调用方更加容易,只关注业务,而无需处理各种请求相关的重复性操作,类似文件类型请求体封装或者Form表单构造及body传参等一系列处理工具代码包括但不限…

AIGC算法2:LLM的复读机问题

1. 什么是LLM的复读机问题 字符级别重复&#xff0c;指大模型针对一个字或一个词重复不断的生成例如在电商翻译场景上&#xff0c;会出现“steckdose steckdose steckdose steckdose steckdose steckdose steckdose steckdose…”&#xff1b;语句级别重复&#xff0c;大模型针…

中文编程入门(Lua5.4.6中文版)第十二章 Lua 协程 参考《愿神》游戏

在《愿神》的提瓦特大陆上&#xff0c;每一位冒险者都拥有自己的独特力量——“神之眼”&#xff0c;他们借助元素之力探索广袤的世界&#xff0c;解决谜题&#xff0c;战胜敌人。而在提瓦特的科技树中&#xff0c;存在着一项名为“协同程序”的高级秘术&#xff0c;它使冒险者…

Spark面试整理-Spark集成HBase

Apache Spark与Apache HBase的集成允许Spark直接从HBase读取和写入数据,利用Spark的强大计算能力处理存储在HBase中的大规模数据。以下是Spark集成HBase的关键方面: 1. 添加HBase依赖 要在Spark项目中使用HBase,需要在项目的构建文件中添加HBase客户端的依赖。例如,在Maven…

【第十二届“泰迪杯”数据挖掘挑战赛】【2024泰迪杯】B题基于多模态特征融合的图像文本检索—更新(正式比赛)

【第十二届“泰迪杯”数据挖掘挑战赛】【2024泰迪杯】B题基于多模态特征融合的图像文本检索—更新&#xff08;正式比赛&#xff09; 往期链接&#xff1a; 【第十二届“泰迪杯”数据挖掘挑战赛】【2024泰迪杯】B题基于多模态特征融合的图像文本检索—解题全流程&#xff08;…

设计模式-访问者模式(Visitor)

1. 概念 访问者模式&#xff08;Visitor Pattern&#xff09;是一种行为型设计模式。是一种将数据操作与数据结构分离的设计模式&#xff0c;其主要目的是将数据结构与数据操作解耦。 2. 原理结构图 图1 Visitor&#xff08;访问者&#xff09;&#xff1a;接口或抽象类&am…

47.基于SpringBoot + Vue实现的前后端分离-校园外卖服务系统(项目 + 论文)

项目介绍 本站是一个B/S模式系统&#xff0c;采用SpringBoot Vue框架&#xff0c;MYSQL数据库设计开发&#xff0c;充分保证系统的稳定性。系统具有界面清晰、操作简单&#xff0c;功能齐全的特点&#xff0c;使得基于SpringBoot Vue技术的校园外卖服务系统设计与实现管理工作…

【前端Vue】Vue从0基础完整教程第7篇:组件化开发,组件通信【附代码文档】

Vue从0基础到大神学习完整教程完整教程&#xff08;附代码资料&#xff09;主要内容讲述&#xff1a;vue基本概念&#xff0c;vue-cli的使用&#xff0c;vue的插值表达式&#xff0c;{{ gaga }}&#xff0c;{{ if (obj.age > 18 ) { } }}&#xff0c;vue指令&#xff0c;综合…

Python基于Django搜索的目标站点内容监测系统设计,附源码

博主介绍&#xff1a;✌程序员徐师兄、7年大厂程序员经历。全网粉丝12w、csdn博客专家、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌ &#x1f345;文末获取源码联系&#x1f345; &#x1f447;&#x1f3fb; 精彩专栏推荐订阅&#x1f447;…

全国产化无风扇嵌入式车载电脑在车队管理嵌入式车载行业应用

车队管理嵌入式车载行业应用 车队管理方案能有效解决车辆繁多管理困难问题&#xff0c;配合调度系统让命令更加精确有效执行。实时监控车辆状况、行驶路线和位置&#xff0c;指导驾驶员安全有序行驶&#xff0c;有效降低保险成本、事故概率以及轮胎和零部件的磨损与损坏。 方…

【算法刷题day25】Leetcode:216.组合总和III 17.电话号码的字母组合

216.组合总和III 文档链接&#xff1a;[代码随想录] 题目链接&#xff1a;216.组合总和III 题目&#xff1a; 找出所有相加之和为 n 的 k 个数的组合&#xff0c;且满足下列条件&#xff1a; 只使用数字1到9 每个数字 最多使用一次 返回 所有可能的有效组合的列表 。该列表不能…