mardown 标题带数字_标题中带有数字的故事更成功吗?

mardown 标题带数字

统计 (Statistics)

I have read a few stories on Medium about writing advice, and there were some of them which, along with other tips, suggested that putting numbers in your story’s title will increase the number of views, as people tend to be more attracted by such headlines, and therefore, more people will click on your story.

我已经阅读了有关“撰写建议”的“媒体”上的一些故事,其中有一些故事以及其他技巧建议将数字放在故事标题中会增加观看次数,因为人们通常会被此类标题吸引,因此会有更多人点击您的故事。

It seems interesting that people are attracted by such headlines. But I don’t like to take things for granted. I want to convince myself that this fact is actually true.

人们被这样的头条新闻吸引似乎很有趣。 但是我不喜欢把事情当作理所当然。 我想说服自己,这个事实确实是真的。

So, what I have been thinking? Let’s use Statistics to check if this thing is actually true. But Statistics is useless without data. I first need to obtain some data about Medium articles and use that to do hypothesis testing. Therefore, I used Python and Beautiful Soup to scrape data about a random set of 6K+ Medium articles from 7 different publications. This dataset can be found on Kaggle. If you want to see how I scraped this data, I have an article about that here:

所以,我一直在想什么? 让我们使用统计信息来检查这件事是否真的正确。 但是,如果没有数据,统计信息将无用。 我首先需要获取有关中型文章的一些数据,并使用这些数据进行假设检验。 因此,我使用Python和Beautiful Soup来刮取来自7个不同出版物的6K +中型随机文章集的数据。 该数据集可以在Kaggle上找到。 如果您想了解如何抓取这些数据,请在此处发表有关此内容的文章:

What we are going to do now is to split this dataset into 2 groups (or samples): one that has numbers in headlines and one without numbers. Then, we will do a hypothesis test on the expected value for the number of claps in these 2 groups. We use the number of claps as a measure of “how successful” a story is, although a more logical variable for our scenario would be the number of views as it is the one that is more directly affected by our choice for the title. People typically click on a story because of the preview that they see (including headline and image), and then after they read the story, they decide whether to clap or not. But, because the number of views is not publicly shown on Medium, we use the number of claps as it should be highly correlated with views (the more the views, the more likely is that someone would clap).

现在,我们要做的就是将此数据集分为2组(或样本):一组在标题上有数字,而另一组没有数字。 然后,我们将对这两组的拍手数量的期望值进行假设检验。 我们使用拍手次数来衡量故事的“成功”程度,尽管对于我们的场景而言,更合乎逻辑的变量是观看次数,因为观看次数会直接受到我们对标题的选择的影响。 人们通常会因为看到的预览(包括标题和图像)而点击故事,然后在阅读故事后决定是否拍手。 但是,由于视图的数量未在“媒体”上公开显示,因此我们使用拍手的数量,因为它应该与视图高度相关(视图越多,有人拍手的可能性就越大)。

If you are not familiar with hypothesis testing, here is an article you can read:

如果您不熟悉假设检验,则可以阅读以下文章:

That being said, we will consider the following model:

话虽如此,我们将考虑以下模型:

Sample 1: Articles with numbers in headlines

示例1:标题中带有数字的文章

We will model the number of claps inside this group as n i.i.d. (independent and identically distributed) random variables: X₁, X₂, …, Xₙ with expected value µ₁ and variance σ₁², both of which are finite.

我们将这个组中拍手的数量建模为n iid(独立且分布均匀)的随机变量:X 1,X 2,…,X 1,其期望值为μ1,方差为σ2,这两个都是有限的。

Sample 2: Articles without numbers in headlines

示例2:标题中没有数字的文章

We will model the number of claps inside this group as m i.i.d. random variables: Y₁, Y₂, …, Yₘ with expected value µ₂ and variance σ₂², both of which are finite.

我们会为m IID随机变量本组内拍手的次数型号:Y 1 Y 2,...,Yₘ与期望值μ₂和方差σ₂²,这两者都是有限的。

We formulate the null hypothesis as “articles with numbers in headlines bring no improvement over articles that have no numbers in headlines”, and the alternative hypothesis as “articles with numbers in headlines are more successful compared to articles without numbers in headlines”.

我们将无效假设表述为“标题中带有数字的文章与标题中没有数字的文章没有任何改进”,替代假设为“标题中带有数字的文章比标题中没有数字的文章更成功”。

Mathematically this means:

从数学上讲,这意味着:

Image for post

We will consider the following test statistic:

我们将考虑以下测试统计信息:

Image for post

Where Xn bar and Ym bar are the averages of sample 1, respectively sample 2.

其中Xn bar和Ym bar是样本1的平均值,分别是样本2的平均值。

Because the sample sizes are pretty large and due to the Central Limit Theorem, the probability distribution of our test statistic Z can be approximated very well by a standard normal distribution, and the true variances σ₁², σ₂² should be very close to the estimated variances from our data. So, when we compute the test statistic, we can just substitute the estimated variances for σ₁², σ₂².

由于样本量非常大并且由于中心极限定理,所以我们的测试统计量Z的概率分布可以通过标准正态分布很好地近似,并且真实方差σ₁²,σ²²应该非常接近于估计的方差我们的数据。 因此,当我们计算检验统计量时,我们可以仅将估计方差替换为σ₁²,σ²²。

But, what about µ₁ - µ₂? By assuming H₀ to be true, it follows that µ₁ - µ₂ ≤ 0. And we choose µ₁ - µ₂ = 0 as this value is the worst-case scenario for the probability of type I error (we don’t want to underestimate the error).

但是,μ₁-μ2呢? 通过假设H₀为真,可得出µ₁-µ²≤0。我们选择µ₁-µ² = 0,因为该值是I型错误概率的最坏情况(我们不想低估该错误) )。

Now, let’s run some Python code. We start by importing the required packages and defining a utility function: like(x, pattern). This function is used to match regular expressions in pandas data frames; x is the column, and pattern is a regular expression. I named this function after SQL’s LIKE operator as it is meant to do something similar, but for pandas data frames.

现在,让我们运行一些Python代码。 我们首先导入所需的包并定义一个实用程序函数: like(x, pattern) 。 该函数用于匹配熊猫数据帧中的正则表达式。 x是列,而pattern是正则表达式。 我将此函数命名为SQL的LIKE运算符,因为它的意思是做类似的事情,但适用于熊猫数据帧。

Image for post

After that, we read the CSV file into a pandas data frame:

之后,我们将CSV文件读入pandas数据框中:

Image for post

We make sure we don’t have missing values in the “title” or “claps” columns:

我们确保在“标题”或“拍子”列中没有缺失的值:

Image for post

Then, we create 2 new data frames (numbers/no-numbers) using the like() function defined earlier:

然后,我们使用前面定义的like()函数创建2个新的数据帧(数字/无数字):

Image for post

These 2 new data frames are shown below:

这两个新数据帧如下所示:

Image for post
Image for post

After that, we compute the quantities that we need for the test statistic:

之后,我们计算测试统计所需的数量:

Image for post

Now, we compute the test statistic and the p-value. In our case, because we’re doing a one-sided test, the p-value is the area to the right of our test statistic under a standard gaussian:

现在,我们计算检验统计量和p值。 在我们的例子中,因为我们正在进行单面测试,所以p值是标准高斯下测试统计量右侧的面积:

Image for post

And we got a p-value much smaller than the usual threshold of 0.05. That’s good news, we can reject the null hypothesis very confidently.

而且我们得到的p值比通常的阈值0.05小得多。 这是个好消息,我们可以非常有信心地拒绝原假设。

For a significance level of α = 0.001, it follows that p ≈ 0.0009 < α, and therefore we reject the null hypothesis and accept the alternative. In plain English, this means: “We are 99.9% confident that stories with numbers in their headlines are expected to have more claps than stories without numbers in headlines”.

对于显着性水平α= 0.001,可以得出p≈0.0009 <α,因此我们拒绝零假设并接受替代假设。 用简单的英语来说,这意味着:“ 我们相信99.9%的人相信标题中带有数字的故事比标题中没有数字的故事更有拍手声 ”。

You can find the Jupyter notebook on Kaggle.

您可以在Kaggle上找到Jupyter笔记本。

Thanks for reading!

谢谢阅读!

翻译自: https://medium.com/towards-artificial-intelligence/are-stories-with-numbers-in-headlines-more-successful-b925cae2f6b4

mardown 标题带数字

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388645.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

使用Pandas 1.1.0进行稳健的2个DataFrames验证

Pandas is one of the most used Python library for both data scientist and data engineers. Today, I want to share some Python tips to help us do qualification checks between 2 Dataframes.Pandas是数据科学家和数据工程师最常用的Python库之一。 今天&#xff0c;我…

置信区间的置信区间_什么是置信区间,为什么人们使用它们?

置信区间的置信区间I’m going to try something a little different today, in which I combine two (completely unrelated) topics I love talking about, and hopefully create something that is interesting and educational.今天&#xff0c;我将尝试一些与众不同的东西…

php中wlog是什么意思,d-log模式是什么意思

D-Log是一种高动态范围的视频素材记录格式&#xff0c;总而言之这个色彩模式为后期调色提供了更大的空间。在相机和摄影机拍摄时&#xff0c;一颗高性能的传感器通常支持11档以上的动态范围&#xff0c;而在8bit的照片或视频上&#xff0c;以符合人眼感知的Gamma进行机内处理和…

PowerShell入门(三):如何快速地掌握PowerShell?

如何快速地掌握PowerShell呢&#xff1f;总的来说&#xff0c;就是要尽可能多的使用它&#xff0c;就像那句谚语说的&#xff1a;Practice makes perfect。当然这里还有一些原则和方法让我们可以遵循。 有效利用交互式环境 一般来说&#xff0c;PowerShell有两个主要的运行环境…

pca 主成分分析_通过主成分分析(PCA)了解您的数据并发现潜在模式

pca 主成分分析Save time, resources and stay healthy with data exploration that goes beyond means, distributions and correlations: Leverage PCA to see through the surface of variables. It saves time and resources, because it uncovers data issues before an h…

UML-- plantUML安装

plantUML安装 因为基于intellid idea,所以第一步自行安装.setting->plugins 搜索plantUML安装完成后&#xff0c;重启idea 会有如下显示安装Graphviz 下载地址 https://graphviz.gitlab.io/_pages/Download/Download_windows.html配置Graphviz环境变量&#xff1a; dot -ver…

rstudio 关联r_使用关联规则提出建议(R编程)

rstudio 关联r背景 (Background) Retailers typically have a wealth of customer transaction data which consists of the type of items purchased by a customer, their value and the date they were purchased. Unless the retailer has a loyalty rewards system, they …

jquery数据折叠_通过位折叠缩小大数据

jquery数据折叠Sometimes your dataset is just too large, and you need a way to shrink it down to a reasonable size. I am suffering through this right now as I work on different machine learning techniques for checkers. I could work for over 18 years and buy…

新鬼影病毒

今天和明天是最后两天宿舍有空调的日子啦,暑假宿舍没空调啊,悲催T__T 好吧,今天是最精华的部分啦对于鬼影3的分析,剩下的都是浮云啦,alg.exe不准备分析了,能用OD调试的货.分析起来只是时间问题.但是MBR和之后的保护模式的代码就不一样啦同学们,纯静态分析,伤不起啊,各种硬编码,…

Silverlight:Downloader的使用(event篇)

(1)Downloader的使用首先我们看什么是Downloader,就是一个为描述Silverlight plug-in下载功能的集合.Downloader能异步的通过HTTP GET Request下载内容.他是一个能帮助Silverlight下载内容的一个对象,这些下载内容包括(XMAL content,JavaScript content,ZIP packages,Media,ima…

决策树信息熵计算_决策树熵|熵计算

决策树信息熵计算A decision tree is a very important supervised learning technique. It is basically a classification problem. It is a tree-shaped diagram that is used to represent the course of action. It contains the nodes and leaf nodes. it uses these nod…

Free SQLSever 2008的书

Introducing SQL Server 2008 http://csna01.libredigital.com/?urss1q2we6这是一本提供自由使用书&#xff01;我把它翻译&#xff0c;或转送有什么关系&#xff01;这样的书还是有几本吧&#xff0c;Introducing Linq,Introducting Silverlight,都是啊&#xff01;嘿嘿。。。…

流式数据分析_流式大数据分析

流式数据分析The recent years have seen a considerable rise in connected devices such as IoT [1] devices, and streaming sensor data. At present there are billions of IoT devices connected to the internet. While you read this article, terabytes and petabytes…

Jenkins自动化CI CD流水线之8--流水线自动化发布Java项目

一、前提 插件&#xff1a;Maven Integration plugin 环境&#xff1a; maven、tomcat 用的博客系统代码&#xff1a; git clone https://github.com/b3log/solo.git 远端git服务器&#xff1a; [gitgit repos]$ mkdir -p solo [gitgit repos]$ cd solo/ [gitgit solo]$ git --…

数据科学还是计算机科学_数据科学101

数据科学还是计算机科学什么是数据科学&#xff1f; (What is data science?) Well, if you have just woken up from a 10-year coma and have no idea what is data science, don’t worry, there’s still time. Many years ago, statisticians had some pretty good ideas…

开机流程与主引导分区(MBR)

由于操作系统会提供所有的硬件并且提供内核功能&#xff0c;因此我们的计算机就能够认识硬盘内的文件系统&#xff0c;并且进一步读取硬盘内的软件文件与执行该软件来完成各项软件的执行目的 问题是你有没有发现&#xff0c;既然操作系统也是软件&#xff0c;那么我的计算机优势…

肤色检测算法 - 基于二次多项式混合模型的肤色检测。

由于CSDN博客和博客园的编辑方面有不一致的地方&#xff0c;导致文中部分图片错位&#xff0c;为不影响浏览效果&#xff0c;建议点击打开链接。 由于能力有限&#xff0c;算法层面的东西自己去创新的很少&#xff0c;很多都是从现有的论文中学习&#xff0c;然后实践的。 本文…

oracle解析儒略日,利用to_char获取当前日期准确的周数!

总的来说周数的算法有两种&#xff1a;算法一&#xff1a;iw算法&#xff0c;每周为星期一到星期日算一周&#xff0c;且每年的第一个星期一为第一周&#xff0c;就拿2014年来说&#xff0c;2014-01-01是星期三&#xff0c;但还是算为今年的第一周&#xff0c;可以简单的用sql函…

js有默认参数的函数加参数_函数参数:默认,关键字和任意

js有默认参数的函数加参数PYTHON开发人员的提示 (TIPS FOR PYTHON DEVELOPERS) Think that you are writing a function that accepts multiple parameters, and there is often a common value for some of these parameters. For instance, you would like to be able to cal…

2018大数据学习路线从入门到精通

最近很多人问小编现在学习大数据这么多&#xff0c;他们都是如何学习的呢。很多初学者在萌生向大数据方向发展的想法之后&#xff0c;不免产生一些疑问&#xff0c;应该怎样入门&#xff1f;应该学习哪些技术&#xff1f;学习路线又是什么&#xff1f;今天小编特意为大家整理了…