分享一篇关于奇异值分解的文章[Eng]

原文地址:http://www.igvita.com/2007/01/15/svd-recommendation-system-in-ruby/

One day, a bunch of friends, who happened to be big Family Guy fans, decided to put together a site to rank and share their thoughts on the show. Soon thereafter they had a Rails site up and running, and all was well, and other fans joined in hordes. A web 2.0 success! Then one day they realized that they could no longer track everyone's ratings, their user-base was too large, and so it occurred to one of the developers: "Wouldn't it be cool if we could use the collective knowledge of our whole community to recommend and rank episodes for each user individually?"

Sounds familiar, right? In fact, recommendation systems are a billion-dollar industry, and growing. In academic jargon this problem is known as Collaborative Filtering, and a lot of ink has been spilled on the matter. Netflix, for one, announced a 1 million dollar competitionlast year for a system that beats their algorithm by +10% percent. It goes without saying that a lot of different systems have been proposed and explored in theory and practice. However, one of the most successful and widely used approaches to this day also happens to be one of the simplest: Singular Value Decomposition (SVD), also affectionately referred to in the literature as LSI (Latent Semantic Indexing), dimensionality reduction, or projection.

Linear Algebra Refresher

SVD methods are a direct consequence of a theorem in linear algebra:

Any MxN matrix A whose number of rows M is greater than or equal to its number of columns N, can be written as the product of an MxM column-orthogonal matrix U, and MxN diagonal matrix W with positive or zero elements (singular values), and the transpose of an NxN orthogonal matrix V.

More intuitively, assume that we have a matrix where every column represents a user, and every row represents a product (or a Family Guy season, in our case). Thus, with M users and N products, we are looking at an MxN matrix. The theorem simply states that we can decompose such a matrix into three components: (MxM) call it U, (MxN) call it S, and (NxN) call it V. More importantly, we can use this decomposition to approximate the original MxN matrix. By taking the first k eigenvalues of the matrix S, we can effectively obtain a compressed representation of the data. So why do we care? (Mathies click here, we'll wait.)

Machine Learning & Information Retrieval

One of the most fundamental, and fun properties of Machine Learning is its close correlation to the concept of data compression - if we can identify significant concepts (clusters of users, for example) then we can represent a large dataset with fewer bits. However, this logic also works in reverse! If we can represent our data with fewer bits (compress our data), then we have identified 'significant' concepts! I bet you see where we're headed - SVD's allow us to compress a large matrix by approximating it in a smaller-dimensional space.

SVD's found wide application in the field of Information Retrieval (IR) where this process is often referred to as Latent Semantic Indexing (LSI). In these applications the columns of the matrix are the documents, and the rows are the individual words. Running SVD allows us to collapse this matrix into a smaller-dimensional space where highly correlated items (for example, words that often occur together) are captured as a single feature. Essentially, we are discarding the noise, and keeping the signal. In practice, the IR guys usually collapse their ginormous matrices to 100, 200, or 300 dimensions (from original 10000+) and then perform similarity calculations. In case you're curious, this same method has also found many uses in image compression and computer vision applications.

Dimensionality Reduction

Back to our Family Guy developers. For the sake of brevity we will use a very simple example with only 4 users, and 6 seasons (User x Rating matrix shown above). Cranking this matrix through the SVD yields three different components: matrix U (6x6), matrix S (6x6), matrix V (4x4). Now, we will collapse this matrix from a (6x4) space into a 2-Dimensional one. To do this, we simply take the first two columns of U, S and V. The end result:

Now, because we are working with a 2-Dimensional space, we can plot our results (below). We can treat the first column of U can as x , and the second column as y - these are the seasons. Same process is repeated for matrix V - these are the users.

Do you see what happened? Because we are working with a small example it's hard to call two users a 'cluster' but you will nonetheless notice that Ben and Fred are located very close to each other - now compare their respective ratings in our original matrix. Very cool, huh! Same pattern re-occurs for Seasons 5 and 6. Our dimensionality reduction technique effectively captured the fact that Ben and Fred seem to have similar taste - we're halfway there!

Finding Similar Users

Next, Bob joins the site and shares with us a few of his season ratings ([5,5,0,0,0,5] for seasons 1-6) - it's our goal to give him a recommendation based on this data. Intuitively, we want to find users similar to Bob, thus if we can 'embed' Bob into our 2-Dimensional space and look where he is located, we will be able to answer this question. To do this, we perform the following calculation:

First line is the general formula to project a new user into our space - I won't motivate the math behind it, but if you're interested, check the document I referenced in the Linear Algebra Refresher section. The important result is that we have the x, and y coordinates for Bob. Let's add them to our earlier graph:

The green triangle represents Bob. It's not immediately evident which user is closest, but if we extend the vector (from the origin - green line), we can see that Ben's and Fred's vectors are, in fact, very similar. A common way to judge similarity between any two vectors is to look at the angles separating them: cosine similarity. From our graph we can intuitively tell that the angle between Ben and Bob is smaller than the one between Ben and Fred. To determine this, let's iterate over all users and compute their cosine similarities. Furthermore, let's discard anyone whose similarity is less than 0.90 (outside of the shaded region). We get: Ben (0.987), Fred (0.955). Hence, we conclude that Ben and Bob have the most similar tastes, though Fred is pretty close also!

What happens now is up to you. Here is one very simple strategy: find the most similar user and compare his/her items against that of the new user; take the items that the similar user has rated and the new user has not and return them in decreasing order of ratings. Thus, Ben rated every season except 4, and Bob rated seasons 1,2 and 6. We take the set difference ([1,2,3,5,6] - [1,2,6] = [3,5]) which are the seasons Ben rated but Bob hasn't seen and return them in the decreasing order of Ben's ratings: Season 5 (5 stars), Season 3 (3 stars).

Will you just give me the code already?

For the brave ones that made it to here, below is the equivalent of what we just did on paper.. in Ruby. First, install the linalg library, and now you're ready to roll:

require 'linalg'users = { 1 => "Ben", 2 => "Tom", 3 => "John", 4 => "Fred" }
m = Linalg::DMatrix[#Ben, Tom, John, Fred[5,5,0,5], # season 1[5,0,3,4], # season 2[3,4,0,3], # season 3[0,0,5,3], # season 4[5,4,4,5], # season 5[5,4,5,5]  # season 6]# Compute the SVD Decomposition
u, s, vt = m.singular_value_decomposition
vt = vt.transpose# Take the 2-rank approximation of the Matrix
#   - Take first and second columns of u  (6x2)
#   - Take first and second columns of vt (4x2)
#   - Take the first two eigen-values (2x2)
u2 = Linalg::DMatrix.join_columns [u.column(0), u.column(1)]
v2 = Linalg::DMatrix.join_columns [vt.column(0), vt.column(1)]
eig2 = Linalg::DMatrix.columns [s.column(0).to_a.flatten[0,2], s.column(1).to_a.flatten[0,2]]# Here comes Bob, our new user
bob = Linalg::DMatrix[[5,5,0,0,0,5]]
bobEmbed = bob * u2 * eig2.inverse# Compute the cosine similarity between Bob and every other User in our 2-D space
user_sim, count = {}, 1
v2.rows.each { |x|user_sim[count] = (bobEmbed.transpose.dot(x.transpose)) / (x.norm * bobEmbed.norm)count += 1}# Remove all users who fall below the 0.90 cosine similarity cutoff and sort by similarity
similar_users = user_sim.delete_if {|k,sim| sim < 0.9 }.sort {|a,b| b[1] <=> a[1] }
similar_users.each { |u| printf "%s (ID: %d, Similarity: %0.3f) \n", users[u[0]], u[0], u[1]  }# We'll use a simple strategy in this case:
#   1) Select the most similar user
#   2) Compare all items rated by this user against your own and select items that you have not yet rated
#   3) Return the ratings for items I have not yet seen, but the most similar user has rated
similarUsersItems = m.column(similar_users[0][0]-1).transpose.to_a.flatten
myItems = bob.transpose.to_a.flattennot_seen_yet = {}
myItems.each_index { |i|not_seen_yet[i+1] = similarUsersItems[i] if myItems[i] == 0 and similarUsersItems[i] != 0
}printf "\n %s recommends: \n", users[similar_users[0][0]]
not_seen_yet.sort {|a,b| b[1] <=> a[1] }.each { |item|printf "\tSeason %d .. I gave it a rating of %d \n", item[0], item[1]
}print "We've seen all the same seasons, bugger!" if not_seen_yet.size == 0
svd-recommender-gsl.rb - Ruby/GSL version, courtesy of Joshua Bassett

Running our algorithm produces:

Ben (ID: 1, Similarity: 0.987)
Fred (ID: 4, Similarity: 0.955)Ben recommends:Season 5 .. I gave it a rating of 5Season 3 .. I gave it a rating of 3

That's it! A 50 line SVD recommendation / collaborative filtering system for a Rails app. with the help of some simple linear algebra.

转载于:https://www.cnblogs.com/taylorwesley/archive/2013/04/20/3031984.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/360806.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

为雅安祈福

四川是个多灾多难的省份&#xff0c;更是个多地震的省份&#xff0c;十年之内发生了两次大地震。我们能做的就是为雅安的人们祈福。 淘宝给互联网带头了&#xff1a; 天猫也跟着祈福了&#xff0c;因为他们都属于阿里巴巴&#xff0c;这里就不上图了。 百度作为国内互联网企业的…

在运行时打开GC日志记录

总是有下一个JVM表现不佳。 而且&#xff0c;您内心深知&#xff0c;如果您只有少数启动选项可以公开一些有关正在发生的事情的信息&#xff0c;那么您可能就有机会真正修复该死的东西。 但是不&#xff0c;您需要的标志&#xff08; -XX&#xff1a; HeapDumpOnOutOfMemoryErr…

jpannel设置位置xy_实用的摄影技巧!10种常见摄影场景的单反相机设置技巧!

相机是爱拍一族必不可少的东西&#xff0c;对于摄影爱好者对于一些相机设置技巧可能也不是太了解&#xff0c;在摄影过程中&#xff0c;有很多的场景需要不同的设置&#xff0c;根据光线变化和周围环境&#xff0c;一般来说都会特定的摄影技巧&#xff0c;今天红视觉和大家一起…

Only digits (0-9) can be put inside [] in the path string: formData.XXX

使用uniapp开发时微信小程序中爆出的问题&#xff0c;问题在于form表单组件中绑定表单值,去除报错值则正常。 解决办法如下&#xff1a; 修改注释行内容&#xff0c; 转载连接

外包以小时计算金额的费用_2020年初级会计各大税种的计算公式,请收藏!

一、各种税的计算方式增值税1、一般纳税人应纳税额销项税额—进项税销项税额销售额税率组成计税价格成本(1成本利润率)组成计税价格成本(1成本利润率)(1-消费税税率)2、进口货物应纳税额组成计税价格税率组成计税价格关税完税价格关税(消费税)3、小规模纳税人应纳税额销售额征收…

VSCode设置ESLint语法检查

转载 "eslint.validate": ["javascript","javascriptreact","vue-html",{"language": "vue","autoFix": true}],"eslint.run": "onSave","eslint.autoFixOnSave": true…

次要GC,主要GC与完整GC

在使用Plumbr中的GC暂停检测功能时&#xff0c;我被迫通过大量有关该主题的文章&#xff0c;书籍和演示工作。 在整个旅程中&#xff0c;我多次对次要&#xff0c;主要和完全GC事件的使用&#xff08;误用&#xff09;感到困惑。 这导致了这篇博客文章&#xff0c;我希望我设法…

怎么改字段名称_精装房这么改!换门框,封阳台,效果出来比毛坯房还好

最近有朋友后台留言&#xff0c;称自己原本想一步到位购买精装房&#xff0c;但收房验收时才发现根本不合心意&#xff0c;空间利用率太低配色也老气&#xff0c;已经打算重新改装了。所谓精装房就是开发商将地板、门、厨房、卫生间、少量天花和部分柜子进行安装,业主只要添加些…

ZF2系列 – Zend Framework 2 MVC實作 (Part 3)

我之所以這麼喜歡Zend Framework的關係&#xff0c;其中一項就是它可以幫你很簡單的實現MVC的架構&#xff0c;所以今天的任務就是要建立一個以MVC為架構的基本網頁。 首先當然要先建立一個測試用的資料庫&#xff0c;因此我簡單的建立了一個名為Employee的資料表來進行展示&am…

方向盘左右能摇动_学车这么久了,你还不会打方向盘呢?

刚刚学车的朋友肯定有过这样的问题&#xff1a;教练说&#xff1a;往左打两圈&#xff0c;往右打一圈&#xff0c;往左打半圈……你乖乖听话照做&#xff0c;然后教练说&#xff1a;回正吧。是不是一下就蒙住了&#xff1f;&#xff1f;我刚才打几圈来着&#xff1f;&#xff1…

Asp.Net基础 - 9.Web开发原则

目录&#xff1a; 9.1.Web开发的一些基本原则 9.2 原则一 9.2.1 C#代码是运行在服务器端的&#xff0c;JS代码是运行在浏览器客户端的 9.2.2 在服务器端“弹出消息窗口” 9.2.3 案例说明&#xff08;客户端与服务端互不影响&#xff09; 9.3 原则二 9.4 原则三 9.4.1 客户端…

重置手机_【轻松办税】ITS客户端申报密码重置不会操作?看过来,3分钟包你会!...

ITS扣缴客户端实名办税功能已经上线了&#xff0c;并且在2020年1月1日后将不再支持以CA方式进行登录&#xff0c;可是公司申报密码是很久前去大厅办理的&#xff0c;早就忘记了&#xff0c;这可咋好&#xff1f;别急&#xff0c;快来关注“上海税务”&#xff0c;包你3分钟学会…

Drools 6.2.0.Final发布

我们很高兴宣布最新&#xff0c;最出色的Drools 6.2.0.Final版本。 特别是此发行版更加注重改进的可用性和功能&#xff0c;这些功能使项目更易于使用&#xff08;和采用&#xff09;。 新功能包括对工作台UI的大量改进&#xff0c;对社交活动和插件管理的支持以及规则的全新E…

对OIM Web(UI)层进行压力测试

Oracle IDM中的默认配置保留20个专用于服务前端&#xff08;UI&#xff09;请求的线程 。 从根本上讲&#xff0c;这意味着应用程序服务器具有20个线程池&#xff0c;可用于为通过Web控制台&#xff08;/ identity或/ sysadmin&#xff09;访问OIM的用户提供服务。 对于Weblog…

java 拼接html_程序员用1.5小时写出的Java代码,让同事瞠目结舌!直呼优秀

1.曾经不止一次在生产中见过类似这样的代码&#xff1a;这有很多变种&#xff0c;例如用 Integer.valueOf(1)、 (Integer)1 之类的&#xff0c;那些细节都不重要。重要的是&#xff1a;凭空用一个 Integer 对象作为锁对象。2.AbstractComponentBuilderTemplateFactory3.HelloWo…

本地项目antd 修改.less文件导致内存溢出

项目场景&#xff1a; antd 项目&#xff0c;修改less文件会导致内存溢出 问题描述 本地环境antd 项目&#xff0c;修改less文件会导致内存溢出&#xff1b;如下 FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory原因分析&#xff1a; 这…

序列化代理模式

在上一篇文章中 &#xff0c;我谈到了一般的序列化。 这是更加集中的内容&#xff0c;并提供了一个细节&#xff1a; 序列化代理模式 。 这是处理序列化许多问题的一种好方法&#xff0c;通常是最好的方法。 如果开发人员只想了解有关该主题的一件事&#xff0c;我会告诉他。 …

函数分组学通MongoDB——第三天 细说高级操作

改章节个人在广东喝咖啡的时候突然想到的...明天就有想写几篇关于函数分组的文章&#xff0c;所以回家到后之就奋笔疾书的写出来发布了 明天跟大家分享一下mongodb中比拟好玩的识知&#xff0c;要主括包&#xff1a;聚合&#xff0c;标游。 一&#xff1a; 聚合 见常的聚合作操…

eps如何建立立体白模_服装立体裁剪教程 结构都是“立裁”出来的 才智服装

核心提示&#xff1a;基础立裁服装立体裁剪是指用白坯布为常用替代物&#xff0c;在人台上直接塑造服装样式&#xff0c;并进行样板制作的技术。由于立体裁剪是设计师主要依靠视觉进行的直观操作的过程&#xff0c;所以它具有激发和展开新的设计思维的功能。一、基础立裁服装立…

平衡抽象原理

使代码复杂易读和理解的一件事是&#xff0c;方法内部的指令处于不同的抽象级别。 假设我们的应用程序仅允许登录用户查看其朋友的旅行。 如果用户不是朋友&#xff0c;则不会显示任何行程。 一个例子&#xff1a; public List<Trip> tripsByFriend(User user, User l…