机器学习 来源框架_机器学习的秘密来源:策展

机器学习 来源框架

成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence)

It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of machine learning models will significantly depend on the quality of the data; I’m not saying anything new here.

人们普遍认为,数据是新的石油,就像石油一样,数据需要进行适当的精炼才能发展以得到完美利用。 机器学习模型的功能将在很大程度上取决于数据的质量。 我不是在这里说新的话。

As AI development and its subsequent applications become even more pervasive, ML engineers everywhere are confronted with a grim reality. Once stakeholders overcome biases or skepticisms and finally buy-in, identify a use case with proven ROI, and now are eager to jump onto the AI ship, data curation is usually neglected and suffers from not attracting its due importance — often due to a quick win mentality and the fact it’s not sexy!

随着AI开发及其后续应用变得越来越普遍,各地的ML工程师都面临严峻的现实。 一旦利益相关者克服了偏见或怀疑并最终接受了投资,确定了具有良好ROI的用例,现在又急于跳入AI船上,数据管理通常会被忽略,并且由于无法快速获得数据,因此无法发挥应有的重要性。赢得心态和事实,那就不是性感!

There are many assumptions even within technology groups, that AI only needs to be fed data collected and combined on a large measure; in most cases, this gravely backfires. Inaccurate datasets can come in many forms ranging from factually incorrect information to knowledge gaps to wrong guidelines. Among many other problems, an uncurated dataset can be:

即使在技术小组内部,也有许多假设,即只需要向AI提供大量收集和合并的数据即可。 在大多数情况下,这会适得其反。 不准确的数据集可能以多种形式出现,从事实不正确的信息到知识鸿沟再到错误的准则。 除许多其他问题外,未整理的数据集可能是:

  • Biased: recently, several popular AI’s used for image recognition displayed disturbing gender and racial bias.

    偏见:最近,几种流行的用于图像识别的AI显示出令人不安的性别和种族偏见。

  • Inaccurate, unreliable or falsely represented

    不准确,不可靠或虚假陈述

  • Error-ridden or ambiguous

    错误缠身或模棱两可

The lack of using refined or curated raw datasets are universally known to decrease feature quality and limit the evaluation and applications of transfer tasks. So how should datasets be treated in a way that they serve the exact purpose ML needs to work, this is highly dependant on the use cases the ML engineers are trying to address.

众所周知,缺乏使用精炼或精选原始数据集会降低要素质量并限制传输任务的评估和应用。 因此,应如何以满足ML工作所需确切目的的方式对待数据集,这在很大程度上取决于ML工程师试图解决的用例。

机器学习的数据集类型 (Types of Datasets for Machine Learning)

ML engineers depend on data throughout each step of their AI journey — from model choice, training, and testing. These datasets typically fall under three classifications:

机器学习工程师在AI历程的每个步骤中都依赖于数据,包括模型选择,培训和测试。 这些数据集通常分为三类:

  • Training sets

    训练套

  • Validation sets

    验证集

  • Testing sets.

    测试装置。

Every ML project starts with two data set categories; the training data set and the testing data set.

每个ML项目都以两个数据集类别开始; 训练数据集和测试数据集。

  • The training data set is used to train an algorithm, implement concepts, discover, and give results.

    训练数据集用于训练算法,实现概念,发现并给出结果。
  • Testing data is used to examine the validity of the training data set. Training data is not used for testing because it will produce expected outputs.

    测试数据用于检查训练数据集的有效性。 训练数据不用于测试,因为它将产生预期的输出。
Image for post
Image created by Author Steve Leven
图片由作者Steve Leven创建

机器学习的数据需求 (Data needs for Machine Learning)

Data scientists collect data from various sources, integrate it into one form, validate, manipulate, archive, preserve, retrieve, and express it.

数据科学家从各种来源收集数据,将其集成为一种形式,然后进行验证,操作,存档,保存,检索和表达。

The process of curating datasets for machine learning starts well before availing datasets.

整理用于机器学习的数据集的过程在使用数据集之前就已经开始了。

My suggestion:

我的建议:

  • Identify the aim of the AI

    确定AI的目标

  • Identify what dataset you will require to solve the problem

    确定解决问题所需的数据集

  • Create a record of your hypotheses while selecting the Data

    选择数据时创建假设记录

  • Strive for collecting assorted and meaningful data from both external and internal sources

    努力从外部和内部来源收集各种有意义的数据

  • Create datasets that are hard for your competitors to copy (defendability)

    创建难以被竞争对手复制的数据集(可防御性)

If you have a small dataset, applying a model pre-trained on large datasets can be a great approach and use your small dataset to fine-tune.

如果您的数据集较小,则对大型数据集应用预训练的模型可能是一种不错的方法,并使用小型数据集进行微调。

Once you have accumulated the correct Data, you can progress with creating the training set. This step of putting data in the optimal format is called feature transformation, and it involves four stages:

一旦积累了正确的数据,就可以继续创建训练集。 将数据以最佳格式放置的这一步骤称为特征转换,它涉及四个阶段:

Formatting: Data discovery is in different formats. Formatting will bring it together in one sheet. For example, consumer Data can come with different currencies, semantics and so on. These need to be compiled under one format for foundation uniformity.

格式:数据发现采用不同的格式。 格式化会将其合并到一张纸中。 例如,消费者数据可以带有不同的币种,语义等。 这些需要以一种格式进行编译以实现基础均匀性。

Labelling: Labelling ensures the Data set works for the specific model choice. For example, an autonomous car requires data labelled as images of cars, pedestrians, road signs, walkways.

贴标签:贴标签可确保数据集适用于特定的模型选择。 例如,自动驾驶汽车需要标记为汽车,行人,道路标志,人行道图像的数据。

Cleansing: Suboptimal characters need to be removed, and missing values are managed based on the weighting of need.

清理:需要删除次优字符,并根据需要的权重来管理缺失值。

Extraction: Several features are examined and optimised — features that are essential for predictive capability and faster computation and less memory consumption.

提取:已检查和优化了几个功能-这些功能对于预测功能,更快的计算和更少的内存消耗至关重要。

底线 (The Bottom Line)

A dataset solely can ensure the success or failure of a machine learning model. Data curation is one of the fundamental aspects of machine learning, and if exercised correctly, it can unleash tremendous potential. The methods and subsequent processes can appear time-consuming; however, this will guarantee your dataset’s calibration with the goals of your machine learning at each step.

数据集仅可以确保机器学习模型的成功或失败。 数据管理是机器学习的基本方面之一,如果正确执行,它可以释放巨大的潜力。 方法和后续过程可能很耗时。 但是,这将确保您的数据集的校准符合每一步的机器学习目标。

Introducing data curation processes into your data team and the following procedures will appear time-consuming and expensive in the short term; therefore, organisations must carefully analyse current objectives and develop a strategy to support the relevance for curation-as-a-function. Managed services and Unsupervised methods trained on curated data are available and marketed by advisory and technology firms, be careful and choose carefully; this will play a key role in your AI future.

在您的数据团队中引入数据管理流程,以下过程在短期内将显得既耗时又昂贵。 因此,组织必须仔细分析当前的目标并制定策略,以支持与策展即功能有关。 咨询和技术公司可以使用托管的服务和不受监管的方法进行策划的数据培训,并且要谨慎行事并谨慎选择; 这将在您的AI未来中发挥关键作用。

翻译自: https://towardsdatascience.com/machine-learnings-secret-source-curation-e8c3107dcc13

机器学习 来源框架

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390907.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

WebLogic调用WebService提示Failed to localize、Failed to create WsdlDefinitionFeature

在本地Tomcat环境下调用WebService正常&#xff0c;但是部署到WebLogic环境中&#xff0c;则提示警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_2 ......警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_1 ..…

呼吁开放外网_服装数据集:呼吁采取行动

呼吁开放外网Getting a dataset with images is not easy if you want to use it for a course or a book. Yes, there are many datasets with images, but few of them are suitable for commercial or educational use.如果您想将其用于课程或书籍&#xff0c;则获取带有图像…

React JS 组件间沟通的一些方法

刚入门React可能会因为React的单向数据流的特性而遇到组件间沟通的麻烦&#xff0c;这篇文章主要就说一说如何解决组件间沟通的问题。 1.组件间的关系 1.1 父子组件 ReactJS中数据的流动是单向的&#xff0c;父组件的数据可以通过设置子组件的props传递数据给子组件。如果想让子…

数据可视化分析票房数据报告_票房收入分析和可视化

数据可视化分析票房数据报告Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天&#xff0c;我将研究Kaggle上提供的TM…

先知模型 facebook_Facebook先知

先知模型 facebook什么是先知&#xff1f; (What is Prophet?) “Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series …

搭建Maven私服那点事

摘要&#xff1a;本文主要介绍在CentOS7.1下使用nexus3.6.0搭建maven私服&#xff0c;以及maven私服的使用&#xff08;将自己的Maven项目指定到私服地址、将第三方项目jar上传到私服供其他项目组使用&#xff09; 一、简介 Maven是一个采用纯Java编写的开源项目管理工具, Mave…

gan训练失败_我尝试过(但失败了)使用GAN来创作艺术品,但这仍然值得。

gan训练失败This work borrows heavily from the Pytorch DCGAN Tutorial and the NVIDA paper on progressive GANs.这项工作大量借鉴了Pytorch DCGAN教程 和 有关渐进式GAN 的 NVIDA论文 。 One area of computer vision I’ve been wanting to explore are GANs. So when m…

19.7 主动模式和被动模式 19.8 添加监控主机 19.9 添加自定义模板 19.10 处理图形中的乱码 19.11 自动发现...

2019独角兽企业重金招聘Python工程师标准>>> 19.7 主动模式和被动模式 • 主动或者被动是相对客户端来讲的 • 被动模式&#xff0c;服务端会主动连接客户端获取监控项目数据&#xff0c;客户端被动地接受连接&#xff0c;并把监控信息传递给服务端 服务端请求以后&…

华盛顿特区与其他地区的差别_使用华盛顿特区地铁数据确定可获利的广告位置...

华盛顿特区与其他地区的差别深度分析 (In-Depth Analysis) Living in Washington DC for the past 1 year, I have come to realize how WMATA metro is the lifeline of this vibrant city. The metro network is enormous and well-connected throughout the DMV area. When …

Windows平台下kafka环境的搭建

近期在搞kafka&#xff0c;在Windows环境搭建的过程中遇到一些问题&#xff0c;把具体的流程几下来防止后面忘了。 准备工作&#xff1a; 1.安装jdk环境 http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.下载kafka的程序安装包&#xff1a; http://kafk…

铺装s路画法_数据管道的铺装之路

铺装s路画法Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere in the world with customers and small business owners, a platform that connects to thousands of institutions and aggregates fin…

IBM推全球首个5纳米芯片:计划2020年量产

IBM日前宣布&#xff0c;该公司已取得技术突破&#xff0c;利用5纳米技术制造出密度更大的芯片。这种芯片可以将300亿个5纳米开关电路集成在指甲盖大小的芯片上。 IBM推全球首个5纳米芯片 IBM表示&#xff0c;此次使用了一种新型晶体管&#xff0c;即堆叠硅纳米板&#xff0c;将…

async 和 await的前世今生 (转载)

async 和 await 出现在C# 5.0之后&#xff0c;给并行编程带来了不少的方便&#xff0c;特别是当在MVC中的Action也变成async之后&#xff0c;有点开始什么都是async的味道了。但是这也给我们编程埋下了一些隐患&#xff0c;有时候可能会产生一些我们自己都不知道怎么产生的Bug&…

项目案例:qq数据库管理_2小时元项目:项目管理您的数据科学学习

项目案例:qq数据库管理Many of us are struggling to prioritize our learning as a working professional or aspiring data scientist. We’re told that we need to learn so many things that at times it can be overwhelming. Recently, I’ve felt like there could be …

react 示例_2020年的React Cheatsheet(+真实示例)

react 示例Ive put together for you an entire visual cheatsheet of all of the concepts and skills you need to master React in 2020.我为您汇总了2020年掌握React所需的所有概念和技能的完整视觉摘要。 But dont let the label cheatsheet fool you. This is more than…

查询数据库中有多少个数据表_您的数据中有多少汁?

查询数据库中有多少个数据表97%. That’s the percentage of data that sits unused by organizations according to Gartner, making up so-called “dark data”.97 &#xff05;。 根据Gartner的说法&#xff0c;这就是组织未使用的数据百分比&#xff0c;即所谓的“ 暗数据…

数据科学与大数据技术的案例_作为数据科学家解决问题的案例研究

数据科学与大数据技术的案例There are two myths about how data scientists solve problems: one is that the problem naturally exists, hence the challenge for a data scientist is to use an algorithm and put it into production. Another myth considers data scient…

Spring-Boot + AOP实现多数据源动态切换

2019独角兽企业重金招聘Python工程师标准>>> 最近在做保证金余额查询优化&#xff0c;在项目启动时候需要把余额全量加载到本地缓存&#xff0c;因为需要全量查询所有骑手的保证金余额&#xff0c;为了不影响主数据库的性能&#xff0c;考虑把这个查询走从库。所以涉…

leetcode 1738. 找出第 K 大的异或坐标值

本文正在参加「Java主题月 - Java 刷题打卡」&#xff0c;详情查看 活动链接 题目 给你一个二维矩阵 matrix 和一个整数 k &#xff0c;矩阵大小为 m x n 由非负整数组成。 矩阵中坐标 (a, b) 的 值 可由对所有满足 0 < i < a < m 且 0 < j < b < n 的元素…

商业数据科学

数据科学 &#xff0c; 意见 (Data Science, Opinion) “There is a saying, ‘A jack of all trades and a master of none.’ When it comes to being a data scientist you need to be a bit like this, but perhaps a better saying would be, ‘A jack of all trades and …