前置交换机数据交换_我们的数据科学交换所

前置交换机数据交换

The DNC Data Science team builds and manages dozens of models that support a broad range of campaign activities. Campaigns rely on these model scores to optimize contactability, volunteer recruitment, get-out-the-vote, and many other pieces of modern campaigning. One of our responsibilities is to deliver the best available model scores in an accessible, actionable form.

DNC数据科学团队构建和管理数十种模型,以支持广泛的竞选活动。 竞选活动依靠这些模型评分来优化联系能力,志愿者招募,投票表决和现代竞选活动的许多其他方面。 我们的职责之一是以可访问,可操作的形式提供最佳的可用模型评分。

As part of Phoenix, the DNC’s data warehouse, we developed infrastructure that keeps our focus on delivering products to win elections instead of on ever-growing technical complexity. In this post, we’ll walk through the infrastructure that manages over 70 billion (and counting!) model scores for the country’s 200+ million registered voters.

作为DNC数据仓库Phoenix的一部分,我们开发了基础架构,使我们始终专注于交付赢得选举的产品,而不是不断增长的技术复杂性。 在这篇文章中,我们将介绍为该国200亿以上注册选民管理的700亿(甚至更多)模型评分的基础架构。

挑战 (The Challenge)

At this point in the 2020 cycle, we’re managing about 20 different models. These models come from a mix of sources (our internal modeling infrastructure and multiple vendor syncs), and are a mix of regression, binary classification, and multi-class classification model types. Across those 20 models, we have around 80 distinct model versions that comprise more than 70 billion point estimates.

在2020年周期的这一点上,我们正在管理约20种不同的模型。 这些模型来自多种来源(我们的内部建模基础结构和多个供应商同步),并且是回归,二进制分类和多分类分类模型类型的混合。 在这20个模型中,我们有大约80个不同的模型版本,其中包括超过700亿个点估计。

So, how do we get from the complexity of mixing so many model sources, model types, and versions-per-model to the clean, accessible set of scores our users can seamlessly integrate into their campaign programs?

那么,如何从混合这么多模型源,模型类型和每个模型版本的复杂性,到用户可以无缝地集成到他们的广告系列计划中的干净,可访问的分数集,如何变得复杂呢?

模型分数交换所 (A Clearinghouse for Model Scores)

Our solution is a pair of carefully designed tables for model versions and model scores, and an accompanying code base to cleanly manage model and score life-cycles.

我们的解决方案是为模型版本和模型评分精心设计的一对表格,以及用于干净地管理模型和评分生命周期的随附代码库。

Together, these tables are a clearinghouse for model score publication. Just as a financial clearinghouse ensures a clean exchange between parties in a transaction, our model score clearinghouse sits between a model’s source data and its downstream pipelines to ensure a clean hand-off from one to the other.

这些表格一起构成了模型评分发布的交换所 。 就像金融票据交换所确保交易双方之间的干净交换一样,我们的模型分数票据交换所也位于模型的源数据及其下游管道之间,以确保从一个人到另一个人的彻底交接。

模型版本分类帐 (Model Version Ledger)

The first table of our clearinghouse, model_versions, keeps metadata on model versions. Vendor-sourced model version metadata is merged to this table as part of loading processes. Models we score in-house have their versions checked against and merged into this table as part of every scoring job. With many models and versions spread across our small data science team, we are thrilled to have this bookkeeping maintained programmatically.

我们的票据交换所的第一个表model_versions保留了模型版本的元数据。 供应商来源的模型版本元数据在加载过程中会合并到此表中。 我们内部评分的模型会对照其版本进行检查,并作为每次评分工作的一部分合并到此表中。 我们的小型数据科学团队拥有许多模型和版本,我们很高兴以编程方式维护此簿记。

权威的模型预测表 (An Authoritative Table of Model Predictions)

The second table, scores, holds, well, scores. As we load incoming model scores, we tag both the model version and scoring job that generated them. For multi-class classification, we store scores in a normalized form and note the predicted class label in the score_name column.

第二张表, scores ,保持得分。 加载传入的模型评分时,我们会同时标记模型版本和生成评分的评分工作。 对于多类别分类,我们以标准化形式存储分数,并在score_name列中注明预测的类别标签。

A key part of this table’s architecture is that it’s partitioned by the model version’s date. This keeps all of the scores for a given model version on the same partition, allowing us to query just the few gigabytes of data for the scores we need instead of scanning the entire multi-terabyte table.

该表的体系结构的关键部分是按模型版本的日期进行分区。 这样可以将给定模型版本的所有分数保留在同一分区上,从而使我们可以仅查询几GB数据以获得所需的分数,而不用扫描整个多TB的表。

Screenshot of scores table in database
Second table of our clearinghouse, scores.
我们交换所的第二张桌子,分数。

Once the new scores have passed automated checks, their current_score_flag is flipped to TRUE and their datetime_approved field is set to the current timestamp. In the same step, previous scores of the same model_version that overlap with the new scores by external_id are flipped to FALSE and have their datetime_deprecated set to the current timestamp.

新分数通过自动检查后,它们的current_score_flag会翻转为TRUE并且datetime_approved字段会设置为当前时间戳。 在相同的步骤,相同的分数以前model_version与新成绩通过重叠external_id翻转到FALSE ,并有自己的datetime_deprecated设置为当前的时间戳。

This operation makes the current_score_flag an authoritative marker for which scores are current within the model_run_id, a key assumption for when it’s time to materialize the scores for downstream use.

此操作使current_score_flag成为权威性标记,其分数在model_run_id内为当前分数,这是何时将分数具体化以供下游使用的关键假设。

最后-简化模型版本的操作 (Finally — Simplified Operations on Model Versions)

All of the bookkeeping in the tables above pays huge dividends when it comes time to query our model scores. This short script extracts the current scores for the current production model version of “My Model” with only the name of the model as an input!

上表中的所有簿记都是在查询我们的模型分数时要付出的巨大努力。 这个简短的脚本仅使用模型名称作为输入来提取当前生产模型版本“My Model”的当前分数!

DECLARE CURRENT_MODEL_VERSION_DATE DATE;
DECLARE CURRENT_MODEL_RUN_ID STRING;-- check the model version ledger for the current model metadata
SET (CURRENT_MODEL_VERSION_DATE, CURRENT_MODEL_RUN_ID) = (SELECT AS STRUCT model_version_date, model_run_idFROM `modeling.model_versions`WHERE model_name = "My Model"AND current_model_flag is TRUE
);-- pull the model's current scores
SELECT external_id, score_name, score_value, datetime_approvedFROM `modeling.scores`WHERE model_version_date = CURRENT_MODEL_VERSION_DATEAND model_run_id = CURRENT_MODEL_RUN_IDAND current_score_flag is TRUE
;

我们如何使用它 (How We Use It)

Having a strong, consolidated schema for model versions and scores makes downstream use cases much cleaner and simpler. Here are a few examples of how this infrastructure is used in our work in the 2020 cycle:

具有用于模型版本和评分的强大,统一的架构,可以使下游用例更加简洁。 以下是在2020年周期的工作中如何使用此基础架构的一些示例:

采购发布管道 (Sourcing Publishing Pipelines)

The most mission-critical use of this infrastructure is to serve the most current scores of each model to downstream pipelines in a consistent location and format. Our scoring and loading pipelines run a “materialize model” task that writes a query similar to the one above to a dataset holding one “current” table per model.

此基础架构最关键的用途是以一致的位置和格式将每个模型的最新分数提供给下游管道。 我们的计分和加载流水线运行“物化模型”任务,该任务将与上面类似的查询写入到每个模型包含一个“当前”表的数据集中。

With our model and score version bookkeeping managed upstream, our downstream code can always find the model’s current scores in the same place. But more importantly, this approach insulates downstream processes from modeling issues: If a scoring job fails for some reason, or if newly loaded scores do not pass automated quality checks, the model version will not be re-materialized with the problematic scores.

通过在上游管理我们的模型和分数版本簿记,我们的下游代码始终可以在同一位置找到模型的当前分数。 但更重要的是,这种方法将下游流程与建模问题隔离开来:如果计分工作由于某种原因而失败,或者如果新加载的分数未通过自动质量检查,则模型版本将不会与有问题的分数重新实现。

Flowchart showing score clearinghouse

模型生命周期的面包屑 (Breadcrumbs for a Model’s Life Cycle)

With Election Day right around the corner, we have to respond quickly and confidently to potential problems with our models. We maintain a complete picture of a model and its scores by linking the model_run_id and scoring_run_id fields to the metadata and artifacts we store in an MLflow tracking instance.

即将到来的选举日,我们必须对我们模型的潜在问题做出Swift而自信的回应。 通过将model_run_idscoring_run_id字段链接到我们存储在MLflow跟踪实例中的元数据和工件,我们可以维护模型及其分数的完整图片。

This allows us to trace a published score back to its scoring job, its training job, and, through our other metadata systems, the exact training data that built the underlying model. This piece is critical for diagnosing issues and anomalies users encounter in the field.

这使我们可以将已发布的分数追溯到评分工作,培训工作,以及通过我们的其他元数据系统,来构建基础模型的确切培训数据。 这对于诊断用户在现场遇到的问题和异常现象至关重要。

It’s also a gift to our future selves: when it’s time to revisit our models and improve their future versions, we’ll be working with a complete understanding of their inputs and outputs.

这也是我们未来自我的礼物:当需要重新审视我们的模型并改进其未来版本时,我们将全面了解其输入和输出。

模型还原 (Model Reversions)

Sometimes, we detect a problem with a model after it’s been published. When this happens, we can adjust the current_model_flag in the model_versions table to toggle a model version out of production, or simply suppress scores from a problematic scoring job. After adjusting the metadata in the clearinghouse, we can then re-materialize the model for downstream pipelines and address the lingering issues.

有时,我们会在模型发布后检测出问题。 发生这种情况时,我们可以调整model_versions表中的current_model_flag来切换模型版本的停产状态,或仅抑制评分工作有问题的分数。 在票据交换所中调整了元数据之后,我们可以为下游管道重新实现模型并解决长期存在的问题。

时间点快照 (Point-In-Time Snapshots)

Political data folks are fanatics for historical analysis, and model estimates can be a key part of that. With the datetime_approved and datetime_deprecated fields, we can reconstruct a model version’s ‘current’ scores from any point in time just as we materialize ‘current’ tables. This can give us a quick view into a model’s predictions over time, or even a snapshot of predictions as of a past Election Day.

政治数据人员是历史分析的狂热者,模型估计可能是其中的关键部分。 使用datetime_approveddatetime_deprecated字段,我们可以在实现“当前”表的同时从任何时间点重建模型版本的“当前”分数。 这可以使我们快速查看模型随时间变化的预测,甚至可以追溯到过去选举日的预测快照。

结论 (Conclusion)

Our investment in this infrastructure has allowed us to develop and deliver data science products that are more transparent and sustainable than ever before, and we’ll continue building on this infrastructure for the 2022 election cycle and beyond.

我们在基础设施方面的投资使我们能够开发和交付比以往任何时候都更加透明和可持续的数据科学产品,并且我们将在2022年选举周期及以后的基础设施上继续建设。

Interested in making a difference? Join our team.

有兴趣改变吗? 加入我们的团队 。

翻译自: https://medium.com/democratictech/our-data-science-clearinghouse-e9f12fd4a86

前置交换机数据交换

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391317.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

在Centos中安装mysql

下载mysql这里是通过安装Yum源rpm包的方式安装,所以第一步是先下载rpm包 1.打开Mysql官网 https://www.mysql.com/, 点击如图选中的按钮 点击如图框选的按钮 把页面拉倒最下面,选择对应版本下载,博主这里用的是CentOS7 下载完成后上传到服务器,由于是yum源的安装包,所以…

Docker 入门(1)虚拟化和容器

1 虚拟化 虚拟化是为一些组件(例如虚拟应用、服务器、存储和网络)创建基于软件的(或虚拟)表现形式的过程。它是降低所有规模企业的 IT 开销,同时提高其效率和敏捷性的最有效方式。 1.1 虚拟化用于程序跨平台兼容 要…

量子相干与量子纠缠_量子分类

量子相干与量子纠缠My goal here was to build a quantum deep neural network for classification tasks, but all the effort involved in calculating errors, updating weights, training a model, and so forth turned out to be completely unnecessary. The above circu…

Python -- xlrd,xlwt,xlutils 读写同一个Excel

最近开始学习python,想做做简单的自动化测试,需要读写excel,然后就找到了xlrd来读取Excel文件,使用xlwt来生成Excel文件(可以控制Excel中单元格的格式),需要注意的是,用xlrd读取excel是不能对其进行操作的&…

知识力量_网络分析的力量

知识力量The most common way to store data is in what we call relational form. Most systems get analyzed as collections of independent data points. It looks something like this:存储数据的最常见方式是我们所谓的关系形式。 大多数系统作为独立数据点的集合进行分析…

SCCM PXE客户端无法加载DP(分发点)映像

上一篇文章我们讲到了一个比较典型的PXE客户端无法找到操作系统映像的故障,今天再和大家一起分享一个关于 PXE客户端无法加载分发点映像的问题。具体的报错截图如下:从报错中我们可以看到,PXE客户端已经成功的找到了SCCM服务器,并…

Docker 入门(2)技术实现和核心组成

1. Docker 的技术实现 Docker 的实现,主要归结于三大技术: 命名空间 ( Namespaces )控制组 ( Control Groups )联合文件系统 ( Union File System ) 1.1 Namespace 命名空间可以有效地帮助Docker分离进程树、网络接口、挂载点以及进程间通信等资源。L…

marlin 三角洲_带火花的三角洲湖:什么和为什么?

marlin 三角洲Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:首先,我介绍一下我在Apache Spark上的经历反复解决的两个问题: Data “overwrite” on the same path causing data l…

eda分析_EDA理论指南

eda分析Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to ch…

基于ssm框架和freemarker的商品销售系统

项目说明 1、项目文件结构 2、项目主要接口及其实现 (1)Index: 首页页面:展示商品功能,可登录或查看商品详细信息 (2)登录:/ApiLogin 3、dao层 数据持久化层,把商品和用户…

简·雅各布斯指数第二部分:测试

In Part I, I took you through the data gathering and compilation required to rank Census tracts by the four features identified by Jane Jacobs as the foundation of a great neighborhood:在第一部分中 ,我带您完成了根据简雅各布斯(Jacobs Jacobs)所确定…

Docker 入门(3)Docke的安装和基本配置

1. Docker Linux下的安装 1.1 Docker Engine 的版本 社区版 ( CE, Community Edition ) 社区版 ( Docker Engine CE ) 主要提供了 Docker 中的容器管理等基础功能,主要针对开发者和小型团队进行开发和试验企业版 ( EE, Enterprise Edition ) 企业版 ( Docker Engi…

python:单元测试框架pytest的一个简单例子

之前一般做自动化测试用的是unitest框架,发现pytest同样不错,写一个例子感受一下 test_sample.py import cx_Oracle import config from send_message import send_message from insert_cainiao_oracle import insert_cainiao_oracledef test_cainiao_mo…

抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系

抑郁症损伤神经细胞吗The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social…

Docker 入门(4)镜像与容器

1. 镜像与容器 1.1 镜像 Docker镜像类似于未运行的exe应用程序,或者停止运行的VM。当使用docker run命令基于镜像启动容器时,容器应用便能为外部提供服务。 镜像实际上就是这个用来为容器进程提供隔离后执行环境的文件系统。我们也称之为根文件系统&a…

python:pytest中的setup和teardown

原文:https://www.cnblogs.com/peiminer/p/9376352.html  之前我写的unittest的setup和teardown,还有setupClass和teardownClass(需要配合classmethod装饰器一起使用),接下来就介绍pytest的类似于这类的固件。 &#…

如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的数据科学之旅从在德国最大的汽车制造商之一…

iHealth基于Docker的DevOps CI/CD实践

本文由1月31日晚iHealth运维技术负责人郭拓在Rancher官方技术交流群内所做分享的内容整理而成,分享了iHealth从最初的服务器端直接部署,到现在实现全自动CI/CD的实践经验。作者简介郭拓,北京爱和健康科技有限公司(iHealth)。负责公…

机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类

机器学习图像源代码RoomNet is a very lightweight (700 KB) and fast Convolutional Neural Net to classify pictures of different rooms of a house/apartment with 88.9 % validation accuracy over 1839 images. I have written this in python and TensorFlow.RoomNet是…

leetcode 938. 二叉搜索树的范围和

给定二叉搜索树的根结点 root,返回值位于范围 [low, high] 之间的所有结点的值的和。 示例 1: 输入:root [10,5,15,3,7,null,18], low 7, high 15 输出:32 示例 2: 输入:root [10,5,15,3,7,13,18,1,nul…