机器学习 建立模型_建立生产的机器学习系统

机器学习 建立模型

When businesses plan to start incorporating machine learning to enhance their solutions, they more often than not think that it is mostly about algorithms and analytics. Most of the blogs/training on the matter also only talk about taking fixed format files, training models and printing the result. Naturally, businesses tend to think that hiring good data scientists should get the job done. What they often fail to appreciate is that it is also a good old system and data engineering problem, with the data models and algorithms sitting at the core.

当企业计划开始合并机器学习以增强其解决方案时,他们常常会认为这主要与算法和分析有关。 关于此事的大多数博客/培训也只谈论获取固定格式的文件,培训模型并打印结果。 自然,企业倾向于认为雇用优秀的数据科学家应该完成这项工作。 他们经常不能理解的是,这也是一个很好的旧系统和数据工程问题,数据模型和算法是核心。

About a few years ago, at an organisation I was working in, business deliberated on using machine learning models to enhance user engagement. The use cases that were initially planned revolved around content recommendations. However later, as we worked more in the field we started using it for more diverse problems like, Topic Classification, Keyword Extraction, Newsletter Content Selection etc.

大约几年前,在我工作过的一个组织中,企业考虑使用机器学习模型来增强用户参与度。 最初计划的用例围绕内容建议。 但是后来,随着我们在该领域的工作越来越深入,我们开始将其用于更多样化的问题,例如主题分类,关键字提取,新闻稿内容选择等。

I would use our experience in designing and incorporating machine learnt models in production to illustrate the engineering and human aspects in building a data science application and team.

我将利用我们在生产中设计和合并机器学习模型的经验来说明在构建数据科学应用程序和团队时的工程和人为方面。

In big data analysis, training models were the crux of our data science application. But to make things work in production, many missing pieces of the puzzle were also required to be solved.

在大数据分析中,训练模型是我们数据科学应用程序的关键。 但是要使事情在生产中可行,还需要解决许多缺失的难题。

These were:

这些曾经是:

  1. Getting data into the system on a regular basis from multiple sources.

    定期从多个来源将数据导入系统。
  2. Cleaning and transformation in more than one structures for use.

    清洁和改造多个结构以供使用。
  3. Training and retraining models, saving and reusing as required.

    培训和再培训模型,根据需要保存和重用。
  4. How to apply incremental changes.

    如何应用增量更改。
  5. Exposing model outputs for consumption through API’s.

    公开模型输出以供通过API使用。

Scaling consumption API’s was also a concern for us. In our existing system, Content was mostly static and served from CDN cache. Certain content related data were served by application servers, but then all the users get the same data. Data was served from cache, which was updated every 5–10 seconds. Also the data pertained to around 7000 odd items on any particular day. Hence, overall low memory consumption and low number of writes.

扩展消耗API也是我们关注的问题。 在我们现有的系统中,内容大部分是静态的,由CDN缓存提供。 某些与内容相关的数据由应用程序服务器提供,但是所有用户都获得相同的数据。 数据由缓存提供,缓存每5-10秒更新一次。 此外,该数据在任何特定日期涉及大约7000个奇数物品。 因此,总体上内存消耗低,写入次数少。

Now, the personalized content output was for around 35 million users. Also new content was available every 10 minutes or so. Everything needed to be served by our application servers. Thus, this meant a far higher number of writes as well cache size to be handled than what we had handled earlier.

现在,个性化内容输出可用于约3500万用户。 每10分钟左右就有新内容可用。 我们的应用程序服务器需要提供所有服务。 因此,这意味着要处理的写入数量和缓存大小比我们之前处理的要多得多。

The challenge for us was to design a system to do all these. So a data-science / ML project was not limited to building vectors and running models, but involved designing a complete system with data as the lead player.

我们面临的挑战是设计一个能够完成所有这些任务的系统。 因此,数据科学/机器学习项目不仅限于构建向量和运行模型,还涉及设计一个以数据为主要参与者的完整系统。

When we started building our solution, we found that we had three facets that our decisions needed to cater too, namely System, Data and Team . Thus we would discuss our approach to these all three aspects separately.

在开始构建解决方案时,我们发现我们的决策也需要满足三个方面的需求,即System,Data和Team。 因此,我们将分别讨论这三个方面的方法。

Data Architecture:

数据架构:

We had data in multiple types of database, which backed our various applications. Data structures ranged from tabular to Document to Key-Value. Also, we had decided to use Hadoop ecosystem frameworks like Spark, Flink etc for our processing. Therefore, we chose HDFS for our data storage system for analytics.

我们在多种类型的数据库中拥有数据,这些数据支持了我们的各种应用程序。 数据结构从表格到文档再到键值。 此外,我们决定使用Spark,Flink等Hadoop生态系统框架进行处理。 因此,我们为数据存储系统选择了HDFS进行分析。

We built a 3 tier data storage system.

我们构建了3层数据存储系统。

  1. Raw Data Layer: This essentially is our Data lake and the foundation layer. Data is ingested from all our sources into this layer. Data ingestion is done from Databases as well as Kafka Streams.

    原始数据层:本质上这是我们的数据湖和基础层。 数据从我们所有的源中提取到该层中。 数据提取是从数据库以及Kafka Streams完成的。
  2. Cleaned / Transformed / Enriched Data Layer: This layer stores data in structures which are directly consumable by our analytics or machine learning applications. Jobs take data from our lake, clean and transform it into standardised structures creating Primary Data. There are jobs which also merged changes to create an updated state. Primary Data is further enriched to create Secondary or Tertiary Data. Jobs also create and save Feature vectors in this layer. Feature vectors are designed to be used in multiple subsequent algorithms. For example, the Content Feature vector is used for Section/Topic classification. Same feature vector, enhanced over time to include consumption information, was used for Newsletter candidate selection & recommendation.

    清洁/转换/丰富的数据层:此层将数据存储在结构中,这些结构可直接由我们的分析或机器学习应用程序使用。 作业从我们的湖泊中获取数据,进行清理并将其转换为标准化结构,从而创建基本数据。 有些作业还合并了更改以创建更新状态。 进一步丰富了主数据以创建辅助或第三数据。 作业还会在此层中创建和保存要素向量。 特征向量被设计用于多种后续算法。 例如,内容特征向量用于部分/主题分类。 随时间推移增强的相同特征向量包括消费信息,用于新闻通讯候选人的选择和推荐。
  3. Processed Output Layer: Analytics and Model outputs are stored in this layer. Also trained models too are stored in this layer, for subsequent use.

    处理的输出层:分析和模型输出存储在此层中。 训练有素的模型也存储在此层中,以备后用。

System Architecture:

系统架构:

All the jobs/applications that we made catered to either data ingestion, processing or output consumption. Thus we built a 3 tiered application layer for all our processing.

我们提供的所有作业/应用程序都可以满足数据摄取,处理或输出消耗的需求。 因此,我们为所有处理构建了一个3层的应用程序层。

  • Data Ingestion Layer: This layer includes batch jobs to import data from RDBMS and Document storage. We used Apache Sqoop for ingestion jobs. There are a bunch of jobs to ingest data from Kafka message streams. For example user activity data. Apache Netty based Rest API Server collects activity data, which is pushed to Kafka. Apache Flink based jobs consume the activity data from Kafka, generate basic statistics and also push the data to HDFS.

    数据摄取层:该层包括批处理作业,以从RDBMS和文档存储导入数据。 我们使用Apache Sqoop进行提取作业。 有很多作业可以从Kafka消息流中提取数据。 例如用户活动数据。 基于Apache Netty的Rest API Server收集活动数据,该数据被推送到Kafka。 基于Apache Flink的作业会使用来自Kafka的活动数据,生成基本统计信息,还将数据推送到HDFS。
  • Data Processing: We used Apache Spark jobs for all our processing. It includes jobs for cleaning, enhancements, feature vector builders, ML models and model output generation. The jobs are written in Java, Scala as well as Python.

    数据处理:我们使用Apache Spark作业进行所有处理。 它包括清洁,增强,特征向量生成器,ML模型和模型输出生成的工作。 这些作业用Java,Scala和Python编写。
  • Result Consumption: Processed output was pushed to RDBMS as well as Redis for consumption. The jobs were either built on Spark or Sqoop. The output is further exposed by Spring Boot Rest API endpoints for consumption. Further to this the same results were pushed out in event streams for further downstream processing or consumption.

    结果消耗:处理后的输出被推送到RDBMS以及Redis进行消耗。 这些作业是基于Spark或Sqoop构建的。 Spring Boot Rest API端点进一步公开了输出以供使用。 除此之外,在事件流中推出了相同的结果,以供进一步的下游处理或使用。

Team Setup:

团队设置:

This was the most crucial aspect for the success of the entire enterprise. There were two needs that were required to be fulfilled:

这是整个企业成功的最关键方面。 需要满足两个需求:

  1. Skill in understanding and applying machine learning principles and tools.

    理解和应用机器学习原理和工具的技能。
  2. Knowledge of our domain. Deep understanding of the content types, important aspects of content, how it matures, dies, what all things affect it etc.

    我们的领域知识。 对内容类型,内容的重要方面,如何成熟,消亡,所有事物对其产生什么影响等方面的深刻理解。

Also, when it came to be known that we were planning ML based products, there were a lot of people in our existing team who wanted to be part of such an initiative. Also, it was important for us to ensure that we cater to the aspirations of our existing team members too.

另外,当我们得知我们正在计划基于ML的产品时,我们现有团队中的很多人都希望成为此类计划的一部分。 同样,对我们来说,确保我们也能满足现有团队成员的愿望也很重要。

Moreover, the overall system design meant that there were two distinct parts of the problem, the core ML section and the peripherals which were more like good old software engineering.

而且,整个系统的设计意味着问题有两个截然不同的部分,核心ML部分和外围设备,它们更像是好的旧软件工程。

We decided to build our team with a combination two sets of people:

我们决定由两组人组成一个团队:

  1. Data Science experts, whom we hired. They were entrusted with the data science part of the puzzle. They also taught other team members and mentored their learning process.

    我们聘请的数据科学专家。 他们被赋予了难题中的数据科学部分。 他们还教了其他团队成员,并指导了他们的学习过程。
  2. System development team, who were people picked from our existing team. They built the ingestion pipelines, stream processing engines, output consumption API’s etc.

    系统开发团队是从我们现有团队中挑选出来的。 他们建立了摄取管道,流处理引擎,输出消耗API等。

Also, by taking people from our existing team, we were able to get the ingestion pipelines development going while we were hiring the data science people. Thus, we were able to kick start work from day one, figuratively speaking.

此外,通过从现有团队中聘用人员,我们能够在雇用数据科学人员的同时推动提取管道的开发。 因此,可以说,从第一天开始我们就可以开始工作。

As is illustrated through our experience, building a bunch of applications for training models and generating output is only a beginning. Building a system and team to harness them is an entirely different proposition.

正如我们的经验所表明的那样,构建大量用于训练模型的应用程序并生成输出只是一个开始。 建立一个系统和团队来利用它们是完全不同的主张。

翻译自: https://medium.com/@bmallick/building-a-ml-system-for-production-667923c4389e

机器学习 建立模型

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392433.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

CDH使用秘籍(一):Cloudera Manager和Managed Service的数据库

背景从业务发展需求,大数据平台须要使用spark作为机器学习、数据挖掘、实时计算等工作,所以决定使用Cloudera Manager5.2.0版本号和CDH5。曾经搭建过Cloudera Manager4.8.2和CDH4,在搭建Cloudera Manager5.2.0版本号的时候,发现对…

leetcode 455. 分发饼干(贪心算法)

假设你是一位很棒的家长,想要给你的孩子们一些小饼干。但是,每个孩子最多只能给一块饼干。 对每个孩子 i,都有一个胃口值 g[i],这是能让孩子们满足胃口的饼干的最小尺寸;并且每块饼干 j,都有一个尺寸 s[j]…

压缩/批量压缩/合并js文件

写在前面 如果文件少的话,直接去网站转化一下就行。 http://tool.oschina.net/jscompress?type3 1.压缩单个js文件 cnpm install uglify-js -g 安装 1>压缩单个js文件打开cmd,目录引到当前文件夹,cduglifyjs inet.js -o inet-min.js 或者 uglifyjs i…

angular依赖注入_Angular依赖注入简介

angular依赖注入by Neeraj Dana由Neeraj Dana In this article, we will see how the dependency injection of Angular works internally. Suppose we have a component named appcomponent which has a basic and simple structure as follows:在本文中,我们将看…

leetcode 85. 最大矩形(dp)

给定一个仅包含 0 和 1 、大小为 rows x cols 的二维二进制矩阵,找出只包含 1 的最大矩形,并返回其面积。 示例 1: 输入:matrix [[“1”,“0”,“1”,“0”,“0”],[“1”,“0”,“1”,“1”,“1”],[“1”,“1”,“1”,“1”,“…

如何查看系统版本

1. winR,输入cmd,确定,打开命令窗口,输入msinfo32,注意要在英文状态下输入,回车。然后在弹出的窗口中就可以看到系统的具体版本号了。 2.winR,输入cmd,确定,打开命令窗口,输入ver&am…

java activemq jmx_通过JMX 获取Activemq 队列信息

首先在 activemq.xml 中新增以下属性在broker 节点新增属性 useJmx"true"在managementContext 节点配置断开与访问服务iP配置成功后启动下面来看测试代码/*** Title: ActivemqTest.java* Package activemq* Description: TODO(用一句话描述该文件做什么)* author LYL…

风能matlab仿真_发现潜力:使用计算机视觉对可再生风能发电场的主要区域进行分类(第1部分)

风能matlab仿真Github Repo: https://github.com/codeamt/WindFarmSpotterGithub回购: https : //github.com/codeamt/WindFarmSpotter This is a series:这是一个系列: Part 1: A Brief Introduction on Leveraging Edge Devices and Embedded AI to …

【Leetcode_easy】821. Shortest Distance to a Character

problem 821. Shortest Distance to a Character 参考 1. Leetcode_easy_821. Shortest Distance to a Character; 完转载于:https://www.cnblogs.com/happyamyhope/p/11214805.html

tdd测试驱动开发课程介绍_测试驱动开发的实用介绍

tdd测试驱动开发课程介绍by Luca Piccinelli通过卢卡皮奇内利 测试驱动开发很难! 这是不为人知的事实。 (Test Driven Development is hard! This is the untold truth about it.) These days you read a ton of articles about all the advantages of doing Test …

软件安装(JDK+MySQL+TOMCAT)

一,JDK安装 1,查看当前Linux系统是否已经安装了JDK 输入 rpm -qa | grep java 如果有: 卸载两个openJDK,输入rpm -e --nodeps 要卸载的软件 2,上传JDK到Linux 3,安装jdk运行需要的插件yum install gl…

leetcode 205. 同构字符串(hash)

给定两个字符串 s 和 t,判断它们是否是同构的。 如果 s 中的字符可以被替换得到 t ,那么这两个字符串是同构的。 所有出现的字符都必须用另一个字符替换,同时保留字符的顺序。两个字符不能映射到同一个字符上,但字符可以映射自己…

Java core 包_feilong-core 让Java开发更简便的工具包

## 背景在JAVA开发过程中,经常看到小伙伴直接从网上copy一长段代码来使用,又或者写的代码很长很长很长...**痛点在于:*** 难以阅读* 难以维护* sonar扫描结果债务长* codereview 被小伙伴鄙视* ....feilong-core focus on J2SE,是[feilong platform](https://github.com/venusd…

TensorFlow 2.X中的动手NLP深度学习模型准备

简介:为什么我写这篇文章 (Intro: why I wrote this post) Many state-of-the-art results in NLP problems are achieved by using DL (deep learning), and probably you want to use deep learning style to solve NLP problems as well. While there are a lot …

静态代码块

静态代码块 静态代码块:定义在成员位置,使用static修饰的代码块{ }。位置:类中方法外。执行:随着类的加载而执行且执行一次,优先于main方法和构造方法的执行。格式:作用: 给类变量进行初始化赋值…

异步api_如何设计无服务器异步API

异步apiby Garrett Vargas通过Garrett Vargas 如何设计无服务器异步API (How To Design a Serverless Async API) I recently ran a workshop to teach developers how to create an Alexa skill. The workshop material centered around a project to return car rental sear…

C# 序列化与反序列化json

与合作伙伴讨论问题,说到的c与c#数据的转换调用,正好就说到了序列化与反序列化,同样也可用于不同语言间的调用,做了基础示例,作以下整理: 1 using System.Data;2 using System.Drawing;3 using System.Linq…

学java 的要点_零基础学Java,掌握Java的基础要点

对于程序员群体来说,了解一定的技巧会对学习专业技能更有帮助,也更有助于在自己的职业发展中处于有利地位,无限互联Java培训专家今天就为大家总结Java程序员入门时需要掌握的基础要点:掌握静态方法和属性静态方法和属性用于描述某…

实验人员考评指标_了解实验指标

实验人员考评指标In the first part of my series on experimental design Thinking About Experimental Design, we covered the foundations of an experiment: the goals, the conditions, and the metrics. In this post, we will move away from the initial experimental…

leetcode 188. 买卖股票的最佳时机 IV(dp)

给定一个整数数组 prices ,它的第 i 个元素 prices[i] 是一支给定的股票在第 i 天的价格。 设计一个算法来计算你所能获取的最大利润。你最多可以完成 k 笔交易。 注意:你不能同时参与多笔交易(你必须在再次购买前出售掉之前的股票&#xf…