Lecture1 Introduction
不考!
“Data Mining is the study of collecting, processing, analyzing, and gaining useful insights from data”
EXPLORATORY ANALYSIS
Make measurements to understand what the data looks like
first steps when collecting data. – Metrics: Deciding what to measure is important
– How do we compute similarity?
– How do we group similar users? Clustering
-- Similar users like items similarly: Recommendation systems
Triadic closure principle: Links are created in a way that usually closes a triangle
– If both Bob and Charlie know Alice, then they are likely to meet at some point.
predictions
types:
– Predicting a real value (e.g. number of clicks): Regression
– Predicting a YES/NO value (e.g., will the user click?): Binary classification
– Predicting over multiple classes (e.g., what is the topic of a post): Classification
Lecture2 What is data and big data
Data
Collection of data objects and their attributes
An attribute is a property or characteristic of an object
• Examples: name, date of birth, height, occupation.
• Attribute is also known as variable, field, characteristic, or feature
For each object the attributes take some values.
The collection of attribute-value pairs describes a specific object
• Object is also known as record, point, case, sample, entity, or instance
Attribute type 属性类型
Numeric
• Examples: dates, temperature, time, length, value, count.
• Discrete (counts) vs Continuous (temperature)
• Special case: Binary/Boolean attributes (yes/no, exists/not exists)
Categorical
• Examples: eye color, zip codes, strings, rankings (e.g, good, fair, bad), height in {tall, medium, short}
• Nominal (no order or comparison) vs Ordinal (order but not comparable)
Relational data 关系型数据
• The term comes from DataBases, where we assume data is stored in a relational table with a fixed schema (fixed set of attributes)
In Databases, it is usually assumed that the table is dense (few null values)
• There are a lot of data in this form
• There are also a lot of data which do not fit well in this form
Sparse data: Many missing values
Not easy to define a fixed schema
Numeric Relational Data
• If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points/vectors in a multi-dimensional space, where each dimension represents a distinct attribute
• Such data set can be represented by an n-by-d data matrix, where there are n rows, one for each object, and d columns, one for each attribute
Numeric Data
• Thinking of numeric data as points or vectors is very convenient
- For small dimensions we can plot the data
- We can use geometric analogues to define concepts like distance or similarity
- We can use linear algebra to process the data matrix
Categorical Relational Data
consists of a fixed set of categorical attributes
Mixed Relational Data
consists of a fixed set of both numeric and categorical attributes
注意:
zip code 这里其实是 categorical
Boolean既可以 numerical 也可以 categorical,Sometimes it is convenient to represent categorical attributes as Boolean.
Sometimes it is convenient to represent numerical attributes as categorical.
Binning/Bucketization 分类**
• Idea: split the range of the domain of the numerical attribute into bins (intervals).
• Every bucket defines a categorical value
decide number of bins?
• Depends on the granularity of the data that we want
• Sometimes domain knowledge is also used
decide the size of the bucket?
Depends on the data and our application
• Equi-width bins(等宽分箱): All bins have the same size 等宽分箱将数据范围划分成大小相等的多个区间
Example: split time into decades
Problem: some bins may be very sparse or empty
• Equi-size/depth/frequency bins(等大小分箱): Select the bins so that they all contain the
same number of elements 每个桶包含相同数量的数据点
This splits data into quantiles: top-10%, second 10% etc
Problem: Some bins may be very small
• Optimized bins(优化分箱): 优化分箱通过使用单维聚类算法来创建更适合数据分布的分箱,减少无意义的空桶或稀疏桶
Use a 1-dimensional clustering algorithm to create the bins
e.g
数据:[2, 3, 5, 6, 7, 9, 10, 12, 15, 18]
,分为 3 个分箱
Equal Width 分箱:[2, 7), [7, 12), [12, 17)
数据分布:[2, 3, 5, 6], [7, 9, 10], [12, 15, 18]
Equal Size 分箱:
数据分布:[2, 3, 5], [6, 7, 9], [10, 12, 15, 18]
Set data 集合形数据
• Each record is a set of items from a space of possible items
• Sets can also be represented as binary vectors, or vectors of counts
Vector representation of Transaction data(market-basket data)
Vector representation of Document data(bag-of-words)
Dependent data
• In some cases, there are explicit dependencies between the data
• Ordered/Temporal data: We know the time order of the data
• Spatial data: Data that is placed on specific locations
• Spatiotemporal data: data with location and time
• Networked/Graph data: data with pairwise relationships between entities
Genomic sequence data
long ordered string
e.g GGTTCCGCCTTCAG... (核苷酸序列)
Time series Data
Sequence of ordered (over “time”) numeric values.
e.g 股票
Sequence data
Sequence data: Similar to the time series but in this case, we have
categorical values rather than numerical ones.
• Example: Event logs
Spatial data
Attribute values that can be arranged with geographic co-ordinates
Such data can be nicely visualized
e.g 地图坐标
Spatiotemporal data 时空数据
Data that have both spatial and temporal aspects
• Measurements in different locations over time
Pressure, Temperature, Humidity
• Measurements that move in space over time
Traffic, Trajectories of moving objects
Graph Data
Graph data: a collection of entities and their pairwise relationships
In this case the data consists of pairs: Who links to whom Or directed links
representation : Adjacency matrix-- Very sparse, very wasteful, but useful conceptually
e.g
• Web pages and hyperlinks
• Facebook users and friendships
• The connections between brain neurons
• Genes that regulate each oterh
The data matrix
In almost all types of data we can find a way to transform the data into a matrix,
where the rows correspond to different records, and the columns to numeric attributes
Data Mining Pipeline**
Data Quality
Examples of data quality problems:
• Noise and outliers
• Missing values
• Duplicate data
• Units may be different in different parts of the data (centimeters vs meters, or meters vs feet)
• Time measurements should be brought into the same time zone.
• Normalize names:
• Panayiotis Tsaparas, P. Tsaparas, and P. N. Tsaparas are all the same person
• Financial units should be normalized
• Different currencies
• Prices over time
Preprocessing
steps:
• Reducing the data: Sampling, Dimensionality Reduction
• Data cleaning: deal with missing or inconsistent information
• Feature extraction and selection: create a useful representation of the data by extracting useful features
Sampling**
• main technique employed for data selection often used for both the preliminary investigation and the final data analysis.
• Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming.
Types
Simple Random Sampling 简单随机抽样:每个项被选中的概率相等。
Sampling without replacement 不放回抽样:一旦选中一个数据项,它不会再次被选中,因此选中某个特定数据项的概率取决于总选择的数量和数据集大小。
Sampling with replacement 放回抽样:抽样时数据项不会被移出样本集,同一个数据项可以多次被选中,这使概率计算更简单。
Stratified sampling 分层抽样:将数据分成多个组,然后从每组中随机抽取样本。
Biased sampling 偏性抽样:指在抽样过程中,样本选择存在偏倚或不公平的情况。
• Example: When sampling temporal data, we want to increase the probability of sampling recent data
• Introduce recency bias
• Make the sampling probability to be a function of time, or the age of an item
• Typical: Probability decreases exponentially with time
• For item 𝑥! after time 𝑡 select with probability 𝑝 𝑥! ∝ 𝑒"!
Reservoir Sampling 水塘抽样: 是一种在数据流中进行随机抽样的算法,特别适用于无法预先知道流的大小,且内存有限的情况。
Claim: Every item has probability 1/N to be selected after N items have been read.
算法步骤:
- 初始阶段:对于前
k
个数据项,将它们全部保留。 - 后续阶段:从第
k+1
个数据项开始,每次到达第i
个数据项时,以k/i
的概率保留该项,并随机替换掉之前选择的k
个数据项之一(每个项被替换的概率是1/k
)。
这种方法确保,即使在流结束时,最后一个项被选中的概率也是 1/k
,实现了均匀抽样。
Dimensionality Reduction 降维
• Reduce the amount of data
• Extract the useful information.
e.g
Data Cleaning**
Missing Values
• Ignore the data
• Replace with random value
• Not a good idea, but you can understand how the missing value affects the output
• Replace with nearest neighbor value
• Relatively common practice.
• Should be careful for cases where this does not make sense (e.g., year of birth/death)
• Replace with cluster mean
Outliers
• Remove them
• Try to correct them using common sense
• Transform the data
When using the data, we should be careful of cases where our results are too good to be true, or too bad to be true
-
Too Good to Be True:
- You train a machine learning model, and it achieves 99% accuracy on the test set. Upon further inspection, you realize the test set contains samples that were also present in the training set.
-
Too Bad to Be True:
- A model trained on customer purchase data performs poorly, and upon investigation, you find that 30% of the purchase records have missing or incorrect customer IDs, leading to mismatched features and labels.
Lecture 3 Data exploration and statistical analysis
Feature Extraction
We need to extract the features/attributes from the data and build the data matrix
Text Data
Text normalization
First cut
Second cut:Remove stop words
erm Frequency (TF)
captures how often a word appears in a document
𝑇𝐹(t, 𝑑): term frequency of word t in document 𝑑
= (Number of times term t appears in document d) / (Total number of terms in document d)
• Number of times that the word appears in the document
• Natural measure of importance of the word for the document
Inverse Document Frequency (IDF)
Natural measure of the uniqueness of the word w
Important words are the frequent words that are unique to the document (differentiating) compared to the rest of the collection
• All reviews use the word “like”. This is not interesting
• We want the words that characterize the specific restaurant
Document Frequency 𝐷𝐹(𝑤)
fraction of documents that contain word 𝑤.
Inverse Document Frequency 𝐼𝐷𝐹(𝑤):
• Maximum when unique to one document : 𝐼𝐷𝐹(𝑤) = log(𝐷)
• Minimum when the word is common to all documents: 𝐼𝐷𝐹(𝑤) = 0
TF-IDF
The words that are best for describing a document are the ones that are important for the document, but also unique to the document.
𝑇𝐹-𝐼𝐷𝐹(𝑤, 𝑑) = 𝑇𝐹(𝑤, 𝑑) X 𝐼𝐷𝐹(𝑤)
e.g
Document 1: "the cat in the hat"
Document 2: "the quick brown fox"
We want to calculate the TF-IDF for the words "cat" and "quick" in these documents.
1. compute TF
TF(cat,d1)=1/5=0.2
TF(quick,d2)=1/4=0.25
2. compute IDF
IDF(cat): The word "cat" appears in 1 document (Document 1).
IDF(cat)=log(2/1)=log(2)≈0.3010
IDF(quick): The word "quick" appears in 1 document (Document 2).
IDF(quick)=log(2/1)=log(2)≈0.3010
3. compute TF-IDF
TF-IDF(cat,d1)=0.2×0.3010=0.0602
TF-IDF(quick,d2)=0.25×0.3010=0.0753
Third cut
• TF-IDF takes care of stop words as well
• We do not need to remove the stopwords since they will get 𝐼𝐷𝐹(𝑤) = 0
Data Normalization**
Column normalization
用于将数据集中的每一列(特征)缩放到相似的范围或分布,以便不同特征在后续分析或建模中具有相等的权重。通过对每一列单独进行归一化处理,可以减少不同特征值域差异对分析结果的影1.响,使模型更稳定并加快收敛速度。以下是几种用于normalization的方法
1. Divide (the values of a column) by the maximum value for each attribute
• Brings everything in the [0,1] range, maximum is 1
new value = old value / max value in the column
2. Subtract the minimum value and divide by the difference of the maximum value and minimum value for each attribute
• Brings everything in the [0,1] range, maximum is one, minimum is zero
new value = (old value – min column value) / (max col. value –min col. value)
3. Subtract the mean value for each column – centering of features
• All columns have mean zero
new value = (old value – avg column value)
4. Divide with the length of the centered column vector
• All columns are unit vectors
5. Divide with the standard deviation of the column vector
• Computes the z-score
• Number of standard deviations away from the mean
Row normalization
课件解释如下
1. documents similar
Divide by the sum of values for each document (row in the matrix)
• Transform a vector into a distribution*
new value = old value / Σ old values in the row
2. rate in a similar way
Subtract the mean value for each user (row) – centering of data
• Captures the deviation from the average behavior
new value = (old value – mean row value) [/ (max row value –min row value)]
3. Measures the number of standard deviations away from the mean
Softmax
如果我们想要将分数转换为概率分布(所有类别的概率总和为1)
transform the scores into a probability distribution
Logistics Function
如果我们想把分数转换成用户再次光顾这家餐厅的概率probability(表示事件属于正类的概率)。与“用户将在三个餐厅中选择一个的概率”不同。•这不是餐馆的分布,而是事件“会再来”/“不会再来”的分布
• Maps reals to the [0,1] range
• Mimics the step function
• In the class of sigmoid functions
sigmoid function
Logarithm function(对数函数)
Sometimes a data row/column may have a very wide range of values.
Normalizing in this case will obliviate small values.
We can bring the values to the same scale by applying a logarithm to the column values.
对数函数用于指数数据的缩放,值域通常为实数,曲线不断上升但无上限。
Sometimes a data row/column may have a very wide range of values. Normalizing in this case will obliviate small values.We can bring the values to the same scale by applying a logarithm to
the column values.
Normalization vs Standardization
不是课件内的但概念有点混,当然不止我一个人会混
Similarity & Distance**
• For many different problems we need to quantify how close two objects are.
• To solve these problems we need a definition of similarity, or distance.
• The definition depends on the type of data that we have
Similarity
Numerical measure of how alike two data objects are.
• A function that maps pairs of objects to real values
• Higher when objects are more alike.
• Often falls in the range [0,1], sometimes in [-1,1]
Desirable properties for similarity
1. 𝑠(𝑝, 𝑞) = 1 (or maximum similarity) only if 𝑝 = 𝑞. (Identity)
2. 𝑠(𝑝, 𝑞) = 𝑠(𝑞, 𝑝) for all 𝑝 and 𝑞. (Symmetry)
Intersection
Jaccard Similarity(Similarity between sets)
e.g
1. set
A={1,2,3,4}
B={3,4,5,6}
A∩B = {3, 4}, ∣A∩B∣=2
A∪B = {1, 2, 3, 4, 5, 6}, ∣A∪B∣=6
JSim(A,B) = ∣A∩B∣/ ∣A∪B∣ = 2/6 = 0.33
2. documents
sentence1: "I love programming"
sentence2: "I enjoy programming"
A={I,love,programming}
B={I,enjoy,programming}
A∩B={I ,programming} ∣A∩B∣=2
A∪B = {I,love, enjoy,programming} ∣A∩B∣=4
JSim(A,B) = 2/4 = 0.5
Cosine Similarity(Similarity between vectors)
e.g
1.
2.
Cos(D1,D2)
D1==[10,20,0,0]
D2⃗=[30,60,0,0]
其它同理
Cos (D3,D1) = Cos(D3,D2) = 4/5
Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0
Application :
-
Text Similarity: In natural language processing, documents or sentences can be represented as vectors (using methods like TF-IDF or word embeddings). Cosine similarity helps measure the similarity between them.
-
Recommendation Systems: Cosine similarity is used in collaborative filtering, where products/items are represented by user ratings, and similar items can be recommended.
-
Clustering: In clustering, cosine similarity can help group similar data points together, especially when comparing feature vectors.
Correlation Coefficient
The correlation coefficient measures correlation between two random variables
result explaination:
- 1, there's a perfect positive correlation.
- 0, there's no correlation.
- -1, there's a perfect negative correlation.
e.g
CorrCoeff(D1,D2) = 1
D1 = [-5,5,0,0] D2 = [-15,15,0,0]
mean(D1) = 0 = mean(D2)
Numerator :∑(xi−μX)(yi−μY) = (−5−0)(−15−0)+(5−0)(15−0)+(0−0)(0−0)+(0−0)(0−0) =(−5)(−15)+(5)(15)+0+0=75+75+0+0=150
denominator:
CorrCoeff(D1,D2)=150/150=1
后面的省略计算过程
CorrCoeff(D1,D3) = CorrCoeff(D2,D3) = -1
CorrCoeff(D1,D4) = CorrCoeff(D2,D4) = CorrCoeff(D3,D4) = 0
Distance
• Numerical measure of how different two data objects are
• A function that maps pairs of objects to real values
• Lower when objects are more alike
• Higher when two objects are different
• Minimum distance is 0, when comparing an object with itself.
• Upper limit varies
我才知道L1 L2 正则是和距离算法等同的....
Jaccard distance
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)
Cosine distance
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)
Hamming Distance
Hamming distance is the number of positions in which bit-vectors differ.
e.g
p1 = 10101
p2 = 10011.
d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th positions.
The L1 norm for the binary vectors
Hamming distance between two vectors of categorical attributes is the number of positions in which they differ.
Example:
x = (married, low income, cheat)
y = (single, low income, not cheat)
d(x,y) = 2
Distances between distributions ↓↓↓
Variational distance
The 𝐿1 distance between the distribution vectors
用来计算概率!概率!概率!
Dist(D1,D2) = |0.35- 0.4| + |0.5-0.4| + |0.1-0.1| + |0.05-0.1| = 0.05+0.1+0.05 =0.2
Dist(D2,D3) = 0.35+0.35+0.5+ 0.2 = 1.4
Dist(D1,D3) = 0.3+0.45+0.5+ 0.25 = 1.5
Information theoretic distance
Ranking distance
The input in this case is two rankings/orderings of the same N items.
𝑅1 = <𝑥, 𝑦, 𝑧, 𝑤>
𝑅2 = <𝑦, 𝑤, 𝑧, 𝑥>
怎么解释这个图呢,就是x 在r 1上的排名是1, y在r1上的排名是2...
x在r2 的排名上是4, y在r2上的排名是1...
Spearman rank distance: L 1 distance between the ranks
用于评估两个变量之间的单调关系,即它们是否倾向于同时增加或减少,而不考虑它们之间的具体函数形式。
斯皮尔曼等级相关系数的值范围在-1到+1之间,其中:
- +1 表示完全的单调递增关系,即一个变量的增加总是伴随着另一个变量的增加。
- -1 表示完全的单调递减关系,即一个变量的增加总是伴随着另一个变量的减少。
- 0 表示没有单调关系,即两个变量的增减没有明显的一致性。
逐个计算元素的排名位置差值
SR(R1,R2) = |1-4| +|2-1| + |3-3| + |4-2| = 6
所以Spearman Rank Distance 和 Variational Distance 有什么区别呢?
Spearman rank distance is more for ranking data in terms of correlation, while Variational distance is for comparing probability distributions directly.
Eudidean distance
For two points P=(x1,y1)= and Q=(x2,y2), the Euclidean distance d between them is calculated as:
Coclusion
- Jaccard distance 适用于集合数据或二进制特征。
- Cosine distance 适用于高维稀疏数据,如文本向量。
- Hamming distance 适用于二进制数据或固定长度字符串。
- variational distance 适用于概率分布比较。
- Information Theoretic Distance 适用于概率分布的KL散度等比较。
- Ranking Distance 适用于排名数据或顺序数据比较。
- Eulicdeann distance 适用于几何空间中的点距离或数值数据的相似度计算。
Exploratory Data Analysis (EDA)
一些pandas的操作,考试会考代码
High-level viewing
• head() – first N observations
• tail() – last N observations
• columns() – names of the columns
• describe() – statistics of the quantitative data
• dtypes() – the data types of the columns
Accessing/processing:
• df[“column_name”]
• Df.column_name
• .max(), .min(), .idxmax(), .idxmin()
• <dataframe> <conditional statement>
• .loc[] – label-based accessing
• .iloc[row_indices, column_indices] – index-based accessing
• .sort_values()
• .isnull(), .notnull()
• .isna(), .notna
• .any() 只要对象中存在至少一个 True
,那么它就返回 True
。如果没有任何 True
,它才返回 False
• .fillna()- Fill NA/NaN values using the specified method.
• .drop_duplicates()
Grouping/Splitting/Aggregating:
• groupby()-
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
• get_group() 输出按照groupby的结果
• .merge()- Merging the dataset is the process of combining two datasets in one. 总之就是可以根据id合并多个属性
• .concat() 这个直接是按行和列连接的,不需要索引
• .aggegate() 对分组数据或整个 DataFrame 执行聚合操作(如求和、均值等)
Lecture 4 Data collection and Data scraping ( 不考)
这个不考,通过cw1考察,那我还学什么喵,不学了喵
Web Service
• A server is a long running process (also called daemon) which listens on a pre-specified port
• and responds to a request, which is sent using a protocol called HTTP
• A browser parses the url.
Data Scraping
Challenges
Which data?
The internet is dynamic
Data is volatile
Privacy
Netiquette
STEP 1: INSPECT YOUR DATA SOURCE
STEP 2: SCRAPE HTML CONTENT FROM A PAGE
STEP 3: PARSE HTML CODE WITH BEAUTIFUL SOUP
FIND ELEMENTS BY ID
FINDALL VS. FIND
FIND ELEMENTS BY HTML CLASS NAME
EXTRACT TEXT FROM HTML ELEMENTS
FIND ELEMENTS BY CLASS NAME AND TEXT CONTENT
PASS A FUNCTION TO A BEAUTIFUL SOUP METHOD
ACCESS PARENT ELEMENTS
EXTRACT ATTRIBUTES FROM HTML ELEMENTS
Gathering data from APIs
AnyAPI
Lecture 5 Data Virtualization
Visualization motivation
• Identify hidden patterns and trends
• Formulate/test hypotheses
• Communicate any modeling results
–Present information and ideas succinctly
–Provide evidence and support
–Influence and persuade
• Determine the next step in analysis/modeling
Principle of Visualization
1. Maximize data to ink ratio: show the data
2. Don’t lie with scale (Lie Factor)
3. Minimize chart-junk: show data variation, not design
variation
4. Clear, detailed and thorough labeling
Types of Visualization
• Distribution: how a variable or variables in the dataset
distribute over a range of possible values.
• Relationship: how the values of multiple variables in the
dataset relate
• Composition: how a part of your data compares to the whole.
• Comparison: how trends in multiple variable or datasets
compare
Distribution
• When studying how quantitative values are located along an
axis, distribution charts are the way to go.
• By looking at the shape of the data, the user can identify
features such as value range, central tendency and outliers.
Histogram
A histogram is a way to visualize how 1-dimensional data
is distributed across certain values.
Trends in histograms are sensitive to number of bins.
Relationship
• They are used to find correlations, outliers, and clusters in your data.
• While the human eye can only appreciate three dimensions together, you can visualize additional variables by mapping them to the size, color or shape of your data points.
When we are trying to show the relationship between 2 variables, scatter plots or charts are used. When we are trying to show “relationship” between three variables, bubble charts
are used.
Scatter Plot
A scatter plot is a chart used to plot a correlation between two or more variables at the same time. It’s usually used for numeric data.
It is a way to visualize how multi-dimensional data are distributed across certain values. A scatter plot is also a way to visualize the relationship between two different attributes of multi-dimensional data.
3D data
For 3D data, a quantitative attribute can be encoded by size in a bubble chart
Relationships may be easier to spot by producing multiple
plots of lower dimensionality
Comparison
• These are used to compare the magnitude of values to each other and to easily identify the lowest or highest values in the data.
• If you want to compare values over time, line or bar charts are often the best option.
– Bar or column charts àComparisons among items,.
– Line charts àA sense of continuity.
– Pie charts for comparison as well
MULTIPLE HISTOGRAMS
Plotting multiple histograms (and kernel density estimates of the distribution, here) on the same axes is a way to visualize how different variables compare (or how a variable differs over specific
groups).
BOXPLOTS
A boxplot is a simplified visualization to compare a quantitative variable across groups. It highlights the range, quartiles, median and any outliers present in a data set.
Composition
• Composition charts are used to see how a part of your data compares to the whole.
• Show relative and absolute values.
• They can be used to accurately represent both static and time-series data.
PIE CHART
A pie chart is a way to visualize the static composition (aka, distribution) of a variable (or single group).
STACKED AREA GRAPH 堆叠面积图
A stacked area graph is a way to visualize the composition of a group as it changes over time (or some other quantitative variable). This shows the relationship of a categorical variable (AgeGroup) to a quantitative variable (year).
Lecture 6 Infrastructure that supports Big Data processing
What is cluster architecture
集群架构
Distributed file system
– Data kept in “chunks” spread across machines
– Each chunk replicated on different machines
– Seamless recovery from disk or machine failure
MapReduce**
use for:
– Easy parallel programming
– Invisible management of hardware and software failures
– Easy management of very-large-scale data
MapReduce 的核心思想是“分而治之”,将复杂任务分解为许多小任务,分别计算后再汇总。其主要流程包括以下几个步骤:
1. Map: Transform input data into key-value pairs. Each mapper works on a subset of data in parallel.
input: A line from the input List File.
Output: The grocery item as the key and a count of 1 as the value.
整个过程可以理解为拆
2. Group by Key: Sort and shuff. System automatically sorts and groups key-value pairs by their key. 混
3. Reduce: Aggregate or process grouped data to produce the final output.
input: Key-value pairs from the mappers, where thekey is a grocery item and the value is a list.
Output:Key-Value Pairs: The grocery item as the key and the sum of the values as the value.
根据用户需要再把混合的数据组合
structure
Master worker 作为用户代理负责协调整个过程
一些判断正误
1) Each mapper/reducer must generate the same number of output key/value pairs as it receives on the input. (Wrong)
2) The output type of keys/values of mappers/reducers must be of the same type as their input. (Wrong
在许多数据处理任务中,Mapper 和 Reducer 的任务是对输入数据进行转换。例如,Mapper 接收文本数据行,输出单词及其计数 (String, Integer)。输入类型和输出类型不同是因为 MapReduce 的核心目标是对数据进行转换,而不仅仅是简单的传递数据。
灵活性是 MapReduce 编程模型的一大特点,用户可以根据需求自由转换数据的格式和类型。
3) The inputs to reducers are grouped by key. (True)
4) It is possible to start reducers while some mappers are still running . (Wrong)
5) The correct sequence of data flow in MapReduce is InputFormat, Parti-tioning Function, Reducer, Sort and Group, OutputFormat. (Wrong)
The correct sequence is InputFormat, Mapper, Partitioning Function, Shuffle and Sort, Reducer, OutputFormat
6)Reducer maps input key/value pairs to a set of intermediate key/value pairs. (True)
e.g word count
Word Count(单词计数)
- Mapper 输入:
(行号, 文本行)
,类型为(Long, String)
。 - Mapper 输出:
(单词, 1)
,类型为(String, Integer)
。 - Reducer 输出:
(单词, 出现次数)
,类型为(String, Integer)
。
Extends MapReduce
Spark
Apache Spark 是一个开源的分布式数据处理框架,旨在对大规模数据集进行快速的计算。相比于传统的大数据处理框架(如 Hadoop MapReduce),Spark 提供了更高的计算速度、丰富的功能和更友好的编程接口。
Key construct/idea: Resilient Distributed Dataset (RDD)
Higher-level APIs:
·Different APIs for aggregate data, which allowed to introduce SQL support
Hadoop
Hadoop 是一个实现了 MapReduce 模型的开源框架。Hadoop 的 MapReduce 组件用于分布式计算,搭配 HDFS(Hadoop Distributed File System) 实现数据的存储与访问。
Disk-based computation where the results of each step are written to disk.
RDD
- Distributed computing in memory
- Greatly improving the speed of data processing
Key construct/idea: Resilient Distributed Dataset (RDD)
Conclusion
Lecture 7 Dimensionality Reduction
Interaction Terms and Unique Parameterizations
Interaction Terms are combinations of predictor variables used to capture joint effects on the outcome variable. They help the model account for relationships that are not just additive but also dependent on the interaction between variables.
Parameterization参数化 in statistical models refers to how you define and estimate the model's parameters. In the context of regression models (or other models), parameters are the coefficients that quantify the effect of each predictor variable (or interaction term) on the outcome.
Unique Parameterizations refer to the separate parameters that need to be estimated for each variable and interaction term in the model. A model is "unidentifiable" when there aren't enough observations to estimate all these parameters uniquely.
A model is unidentifiable无法辨认的 when there aren't enough observations to estimate all of its parameters. This happens if you try to estimate more parameters than the data can support. For example:
- If you have only 100,000 observations but you have 8.3 million parameters in the model (as mentioned in the passage), the model might not be identifiable, meaning you won't be able to estimate the unique values of all 8.3 million parameters reliably.
- Unidentifiable models are problematic because they can lead to overfitting, poor generalization, and invalid conclusions.
How Interaction Terms Affect Parameterization:
When you add interaction terms to your model, the number of parameters increases because you're essentially introducing new combinations of features to be estimated. This can quickly lead to a high-dimensional model (many parameters) that requires a large amount of data to estimate them accurately.
For example:
- A model with just two variables (Education and Work Experience) will have parameters for each individual variable, but adding an interaction term adds a third parameter.
- As you keep adding more variables and their interactions, the number of parameters grows exponentially, leading to models that may become unidentifiable with limited data.
Big Data and High Dimensionality
A rectangular data set has two dimensions: number of observations (n) and the number of predictors (p). Both can play a part in defining a problem as a Big Data problem.
In a dataset, an observation refers to a single data point or a row in the dataset.Each observation typically represents an individual entity or instance that is being studied.个体,csv数据的每行
A predictor is a variable or feature in your dataset that is used to predict the outcome or dependent variable. These are the columns in your dataset that provide information to help build a model. 特征,csv数据的每列
• n is big (and p is small to moderate)?
Keep in mind, big n doesn’t solve everything 数据要有代表性才有用
• p is big (and n is small to moderate)?
Matrices may not be invertible (issue in OLS).
Multicollinearity is likely to be present
Models are susceptible to overfitting
This situation is called High Dimensionality, and needs to be accounted for when performing data analysis and modeling.
solution↓
Framework For Dimensionality Reduction
One way to reduce the dimensions of the feature space is to create a new, smaller set of predictors by taking linear combinations of the original predictors.
将原始的 p 个特征通过线性组合,生成一个新的、更小维度的特征集 Z1,Z2,...,Zm(m<p)。
- ϕji 是固定的系数(weights),它们决定了原始特征 Xj 在 Zi中的贡献。
- m 是降维后的特征数量。
基于降维后的新特征 Z1,Z2,...,Zm,可以构建一个新的线性回归模型
- Y 是目标变量(待预测的结果)。
- β0,β1,...,βm 是模型的系数。
- ε是误差项。
Principal Components Analysis (PCA)
Principal Components Analysis (PCA) is a method to identify a new set of predictors, as linear combinations of the original ones, that captures the 'maximum amount' of variance in the observed data.
That is, the observed data shows more variance in the direction of Z1than in the direction of Z2.
Toperform dimensionality reduction we select the top m principle components of PCA as our new predictors and express our observed data in terms of these predictors.
Top PCAcomponents capture the most of amount of variation (interesting features) of the data. Each component is a linear combination of the original predictors - we visualize them as vectors in the feature space. Transforming our observed data means projecting our dataset onto the space defined by the top m PCA components, these components are our new predictors.
The Math behind PCA
PCA关键点
1. Capturing Variation
2. p=p, m<p: PCA produces a list of p principal components. The first principal component (Z1) corresponds to the direction along which the data exhibits the highest variance. Subsequent components (Z2, Z3, etc.) capture progressively less variance but are still important sources of variation. To perform dimensionality reduction we select the top m principle components of PCA as our new predictors and express our observed data in terms of these predictors.
3. Linear Combinations:Each principal component (Zi) is a linear combination
of the original predictors, indicating how each predictor contributes to that component‘s direction. The principal components (Z's) are pairwise orthogonal, meaning they are uncorrelated with each other.
PCA过程
虽然ppt上没说的很仔细,但是我看往年考试会考(好吧不考)
用最直观的方式告诉你:什么是主成分分析PCA_哔哩哔哩_bilibili
总之就是通过找新坐标轴同时线性变换保留最大方差Variation
哈哈,线代有点忘干净了打算再去补一遍
covariance 协方差:两个变量在变化过程中是同方向变化还是反方向变化,同向或反向程度如何?
Singular Value Decomposition (SVD)
如果直接协方差矩阵可能计算量比较大,所以可以先分解
pros and cons
PCA is an unsupervised algorithm. It is done independent of the outcome variable.
• Note: the component vectors as predictors might not be ordered from best to worst!
Disadvantage:
1. Direct Interpretation of coefficients in PCRis completely lost. So do not do if interpretation is important.
2. Will often not improve predictive ability of a model.
Advantage:
1. Reducing dimensionality in high dimensional settings.
2. Visualizing how predictive your features can be of your response, especially in the classification setting.
3. Reducing multicollinearity, and thus may improve the computational time of fitting models.
PCA for Regression (PCR)
If we use all p of the new Zj, then we have not improved the dimensionality. Instead, we select the first M PCAvariables, Z1,...,ZM, to use as predictors in a regression model. And we use Cross Validation to check performance on a specified problem.
PCA for Imputation 归算
*自学部分
Imptation
Imputation refers to the process of filling in missing data (or "missing values") in a dataset.
we want our imputations to take into account (1) links between variables and (2) global similarity between individual observations. This is the idea behind the iterative PCA algorithm for imputation.
Iterative PCA
Notice that the 1st PCAcomponent changes less and less with each iteration. We simply iterate this process until we converge.
Algorithm
1. Initialize imputation with reasonable value (e.g., mean)
2. Iterate until convergence:
a. perform PCA on the complete data
b. retain first M components of PCA(in example M =1)
c. project imputation into PCA space
d. update imputation using projection value
we use Cross-validation to select the number of components to use for our projections
In practice this method can overfit, especially in a sparse matrix with many missing values. So often a regularized PCA approach is used which shrinks the singular values of the PCA , making updates less aggressive.
Alternative Interpretation of PCA
An alternative interpretation is that PCA finds a low-dimensional linear surface which is closest to the data points.
Matrix Completion
*自学部分
Matrix Completion is another application of PCA and is suitable for imputing data which are missing at random. This approach is commonly used in recommender systems which deal in very large sparse matrices.
Consider an n x p matrix of movie ratings by Netflix customers where n is the number of customers and p is the number of movies. Such a matrix will certainly contain many missing entries.
Imputing these missing values well is equivalent to predicting what customers will think of movies they haven’t seen yet.
Lecture 8-9 Large-scale Machine Learning
Supervised Learning
In supervised learning, except for the feature variables that describe the
data, we also have a target variable. The goal is to learn a function (model) that can estimate/predict the value of the target variable given the features
Task:
Regression: The target variable (but also the features) is numerical and continuous
Classification: The target variable is discrete
Descriptive modeling: Explanatory tool to understand the data
Regression: How does the change in the value of different factors affect our target variable?
• What factors contribute to the price of a stock?
• What factors contribute to the GDP of a country?
Classification: Understand what attributes distinguish between objects of different classes
• Why people cheat on their taxes?
• What words make a post offensive?
确实很注重解释性
Predictive modeling: Predict a class of a previously unseen record
Regression: What will the life-expectancy of a patient be?
Classification: Is this a cheater or not? Will the stock go up or not. Is this an offensive post?
Regression
• We assume that we have 𝑘 feature variables (numeric):
Also known as covariates, or independent variables
• The target variable is also known as dependent variable.
• We are given a dataset of the form {(𝒙1, 𝑦1) , … , (𝒙𝑛, 𝑦𝑛) } where, 𝒙𝒊 is a 𝑘-dimensional feature vector, and 𝑦𝑖 a real value
• We want to learn a function 𝑓 which given a feature vector 𝒙𝒊 predicts a value 𝑦𝑖′ = 𝑓 𝒙𝒊 that is as close as possible to the value 𝑦𝑖
• Minimize sum of squares:
Linear regression
-
Best suited for:
- Predicting a continuous numerical target (e.g., house prices, sales revenue).
- understanding the effect of the independent variables to the value of the dependent variable
Example: Predicting a company's monthly revenue based on advertising spend.
The simplest form of 𝑓 is a linear function
One-dimensional linear regression
Multiple linear regression:Matrix inversion may be too expensive. Other optimization techniques are often used to find the optimal vector (e.g., Gradient Descent)
Problems
1) Regression is sensitive to outliers -- remove the outliers
2) features may have very different scales -- Normalize the features by replacing the values with the z- scores -- Remove the mean and divide by the standard deviation
3) • The model we have is linear with respect to the parameters 𝒘, but the features we consider may be non-linear functions of the 𝒙𝒊 values. -- To capture more complex relationships, we can take a transformation of the input (e.g., logarithm log xij), or add polynomial terms (e.g., x^2ij).
Logistics Regression
-
Best suited for:
• Produces a probability estimate for the class membership which is often very useful.
• The weights can be useful for understanding the feature importance.
• Works for relatively large datasets.
• Fast to apply.
• Instead of predicting the class of a record we want to predict the probability of the class given the record
• Transform the classification problem into a regression problem.
Estimating the coefficients
Classification
Classification is the task of learning a target function f that maps attribute set x to one of
the predefined class labels y
Evaluation
Classification Techniques ↓
Decision Trees
• A flow-chart-like tree structure
• Internal node denotes a test on an attribute
• Branch represents an outcome of the test
• Leaf nodes represent class labels or class distribution
-
Best suited for:
- Problems requiring interpretability.
- Both classification and regression tasks.
- Situations with nonlinear relationships or interactions between variables.
Example: Classifying loan applicants as likely to default or not based on financial history.
Induction
Goal: Find the tree that has low classification error in the training data (training error)
Finding the best decision tree (lowest training error) is NP-hard
Greedy strategy: Split the records based on an attribute test that optimizes certain criterion. (e.g Hunt’s Algorithm (one of the earliest), CART, ID3, C4.5, SLIQ,SPRINT)
Hunt’s Algorithm
• How to Classify a leaf node
• Assign the majority class
• If leaf is empty, assign the default class – the class that has the highest popularity (overall or in the parent node).
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Specify Test Condition
number of ways to split
2-way split
注意,第二个不合适,打破了顺序,因为 Small 和 Large 不应该跨越 Medium 进行组合。
Multi-way split
attribute types
Nominal
分类变量,没有天然的顺序或等级,只表示不同的类别或标签。类别之间是平等的,不能比较大小。e.g 颜色
Ordinal
Ordinal 属性表示具有顺序但没有明确数值间隔的变量。e.g 小/中/大、等级(优秀/良好/及格/不及格)
Continuous
Continuous 属性表示数值型变量,取值是连续的,可以进行加减乘除等数学运算。
Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.
• Binary Decision: (A < v) or (A ≥ v)
• consider all possible splits and finds the best cut
• can be more computationally intensive
Tree evaluation
determine the Best Split:
• Greedy approach:
• Creation of nodes with homogeneous class distribution is preferred
• Need a measure of node impurity:
Random Forest
-
Best suited for:
- Problems with complex, nonlinear relationships.
- Reducing overfitting compared to a single decision tree.
- Both classification and regression tasks.
Example: Predicting customer churn in a subscription service based on user activity.
* self study
A collection of decision trees, where each tree is trained on a random subset of the data and features (bagging).
Nearest Neighbor Classifier
Uses k “closest” points (nearest neighbors) for performing classification
-
Best suited for:
- Small datasets with few features where instances are similar to each other.
- Problems where making predictions based on similarity to labeled examples works well.
Example: Classifying hand-written digits based on similarity to labeled digit images.
requirement
– The set of stored records
– Distance Metric to compute distance between records
– The value of k, the number of nearest neighbors to retrieve
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes
The value of k determines the complexity of the model: Lower k produces more complex models
steps
1. Compute distance to other training records
2. Identify k nearest neighbors
Determine the class from nearest neighbor list,take the majority vote of class labels among the k-nearest neighbors 总之就是周围最近的一圈的标签是啥,它就是啥
Weigh the vote according to distance: weight factor, w = 1/d^2
3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) 已知推未知哈!所以是有监督学习
issues
1) Scaling issues -- Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
2) Euclidean measure -- • High dimensional data
• curse of dimensionality (维数灾难:指当(数学)空间维度增加时,分析和组织高维空间遇到的各种问题场景。)
• Can produce counter-intuitive results
Solution: Normalize the vectors to unit length
limitation
• k-NN classifiers are lazy learners
• It does not build models explicitly. Unlike eager learners such as decision trees
• Classifying unknown records is relatively expensive
• Naïve algorithm: O(n)
• Need for structures to retrieve nearest neighbors fast.
• The Nearest Neighbor Search problem.
• Also, Approximate Nearest Neighbor Search
• Issues with distance in very high-dimensional spaces
Support Vector Machines
• SVMs are part of a family of classifiers that assumes that the classes are linearly separable
• That is, there is a hyperplane that separates (approximately, or exactly) the instances of the two classes and the goal is to find this hyperplane
-
Best suited for:
- Classification tasks with clear margins between classes.
- Datasets with a high-dimensional feature space (e.g., text classification).
- Problems requiring kernel methods to handle nonlinear data.
Example: Spam email detection.
AdaBoost
* self study
Neural Networks
* self study
线性分类:A simple model for classification is to take a linear combination of
the feature values and compute a score.
非线性:
Deep learning
Networks with multiple layers
Multiple layers: Each layer applies a logistic function to the input linear combination:
• The result is a more complex function
Training Process:
• Start with some initial weights
• Forward pass: Compute the outputs of all internal nodes
• Backward pass: Perform backpropagation to estimate the gradients
• Change the weights to move towards the direction of the gradient
• Repeat
Gradient Descent
Stochastic gradient descent 随机梯度下降
• Stochastic gradient descent: Consider one input point at the time. Each point is considered only once. 在传统梯度下降中,所有训练数据的梯度被计算后再更新参数,称为 批量梯度下降。随机梯度下降(SGD) 则不同,它在每次迭代中仅使用 一个样本(或一个小批量)来估算梯度,从而更新模型参数。
• Intermediate solution: Use mini-batches of data points. 在每次迭代中使用一个小批量数据(例如 32 或 64 个样本)进行更新,结合了批量和随机梯度下降的优点。
Backpropagation
Main idea:
• Start from the final layer: compute the gradients for the weights of the final layer.
• Use these gradients to compute the gradients of previous layers using the chain
rule
• Propagate the error backwards
Backpropagation essentially is an application of the chain rule for differentiation.
Word Embeddings (感觉不太会考)
06 Word2Vec模型(第一个专门做词向量的模型,CBOW和Skip-gram)_哔哩哔哩_bilibili
为什么哪里都有水论文的程序员,xswl
Define a model that aims to predict between a center word 𝑤𝑐 and context words in some window of length 𝑚 in terms of word vectors
Word2Vec
Predict between every word and its context words
词向量的运算:使用向量减法可以很好地解决相似性维度问题(如语法变化和语义类比)。这正是 Word2Vec 模型的核心结果,
CBOW
Predict context words given the center word
Skipgram
Predict center word from a bag-of-words context
Lecture 10 Clustering
Overview of Clustering
Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that
– Members of a cluster are close/similar to each other
– Members of different clusters are dissimilar
distance measures
Document = set of words
– Jaccard distance
Document = point in space of words
– (x1, x2,..., xN), where xi = 1 iff word i appears in doc
– Euclidean distance
Document = vector in space of words
–Vector from origin to (x1, x2,..., xN)
–Cosine distance
Hierarchical Clustering
– Agglomerative (bottom up)(自下而上)(本节课主要讲):
• Initially, each point is a cluster
• Repeatedly combine the two “nearest” clusters into one
– Divisive (top down) 分裂(自上而下):
• Start with one cluster and recursively split it
Agglomerative
operation: Repeatedly combine two nearest clusters
Euclidean case
each cluster has a centroid = average of its (data) points
1) How do you represent a cluster of more than one point?
– Key problem: As you merge clusters, how do you represent the “location” of each cluster, to tell which pair of clusters is closest? centroid
2) How do you determine the “nearness” of clusters?
Measure cluster distances by distances of centroids
NON-Euclidean case
The only “locations” we can talk about are the points themselves,there is no “average” of two points
1) How do you represent a cluster of more than one point?
clustroid = (data)point “closest” to other points
closet: – Smallest maximum distance to other points
– Smallest average distance to other points
– Smallest sum of squares of distances to other points
(2) How do you determine the “nearness” of clusters?
Treat clustroid as if it were centroid, when computing inter-cluster distances
Centroid vs Clustroid
Centroid is the avg. of all (data)points in the cluster. This means centroid is an “artificial” point.
Clustroid is an existing (data)point that is “closest” to all other points in the cluster.
3) When to stop combining clusters?
– Approach 1: Pick a number k upfront, and stop when we have k clusters
• Makes sense when we know that the data naturally falls into k classes
– Approach 2: Stop when the next merge would create a cluster with low “cohesion”
• i.e, a “bad” cluster
cohesion based approaches:
– Approach 3.1: Use the diameter of the merged cluster = maximum distance between points in the cluster
– Approach 3.2: Use the radius = maximum distance of a point from centroid (or clustroid)
– Approach 3.3: Use a density-based approach
• Density = number of points per unit volume
• E.g., divide number of points in cluster by diameter or radius of the cluster
• Perhaps use a power of the radius (e.g., square or cube)
inplementation
• Naïve implementation of hierarchical clustering:
– At each step, compute pairwise distances between all pairs of clusters, then merge
– O(N3)
• Careful implementation using priority queue can reduce time to O(N2 log N)
– Still too expensive for really big datasets that do not fit in memory
The k Means Algorithm
Best suited for:
Clustering data into groups based on feature similarity.
Exploratory data analysis to identify hidden patterns or groupings.
Situations where the number of clusters is known or can be estimated.
Example: Customer segmentation for targeted marketing campaigns.
• Assumes Euclidean space/distance
• Start by picking k, the number of clusters
• Initialize clusters by picking one point per cluster
steps
• 1) For each point, place it in the cluster whose current centroid it is nearest
• 2) After all points are assigned, update the locations of centroids of the k clusters
• 3) Reassign all points to their closest centroid
– Sometimes moves points between clusters
• Repeat 2 and 3 until convergence
– Convergence: Points don’t move between clusters and centroids stabilize
– Average falls rapidly until right k, then changes little
picking initial k points
• Approach 1: Sampling
– Cluster a sample of the data using hierarchical clustering, to obtain k clusters
– Pick a point from each cluster (e.g. point closest to centroid)
– Sample fits in main memory
• Approach 2: Pick “dispersed” set of points
– Pick first point at random
– Pick the next point to be the one whose minimum distance from the selected points is as large as possible
– Repeat until we have k points
Complexity
• In each round, we have to examine each input point exactly once to find closest centroid
• Each round is O(kN) for N points, k clusters
• But the number of rounds to convergence can be very large!
BFR ALGORITHM
- BFR 是一种基于高维数据的增量聚类算法,是对k-means 的扩展,适合处理大规模数据集和高维数据。
- 它主要用于处理接近高斯分布的簇。
BFR 将数据点分为三类:
- Discard Set(DS):
- 靠近某个簇中心的数据点,可以被丢弃并直接归入该簇,使用均值和方差来表示。
- Compression Set(CS):
- 彼此靠近但尚未归入某个簇的数据点,可以被压缩为一个子簇,用统计量(均值、方差、点数)表示。
- Retained Set(RS):
- 离群点或尚未找到合适簇的数据点,保留在集合中等待进一步处理。
limitation:
– Clusters are normally distributed in each dimension
– Axes are fixed – ellipses at an angle are not OK
CURE (Clustering Using REpresentatives) ALGORITHM
- CURE 是一种基于代表点的聚类算法。
- 它通过选择多个代表点(representative points)来表示一个簇,而不是单一的中心点,从而可以处理具有非球形分布的簇。
– Assumes a Euclidean distance
– Allows clusters to assume any shape
– Uses a collection of representative points to representclusters
K-means vs hierarchical
层次聚类算法(如凝聚层次聚类)通过合并或分割簇的方式来构建一个树状结构。它基于簇内数据点的相似性来构建树,从而确保内聚性较强的簇被合并在一起。
- 内聚性度量:簇内的最小距离(单链接)、最大距离(完全链接)或平均距离(平均链接)等度量可以影响聚类的结果。
K-means算法通过最小化簇内数据点的平方误差(即数据点与簇中心的距离)来实现聚类。这个误差的最小化直接反映了簇内的内聚性。K-means的目标是将数据点分配到与其最接近的簇中心,从而确保簇内的点尽可能紧密。
- 内聚性度量:K-means最小化每个簇内点到簇中心的平方距离(内聚性指标)。
Lecture11 Recommendation Systems
Overview
types of recommendations
• Editorial and hand curated
– List of favorites
– Lists of “essential” items
• Simple aggregates
– Top 10, Most Popular, Recent Uploads
• Tailored to individual users
– Amazon, Netflix, TikTok…
Formal Model**
• X = set of Customers
• S = set of Items
Utility function u: X × S → R
– R = set of ratings
– R is a totally ordered set
– e.g., 0-5 stars, real number in [0,1]
problems:
• (1) Gathering “known” ratings for matrix
– How to collect the data in the utility matrix
• (2) Extrapolate推断 unknown ratings from the known ones
– Mainly interested in high unknown ratings
• We are not interested in knowing what you don’t like but what you like
• (3) Evaluating extrapolation methods
– How to measure success/performance of recommendation methods
Three approaches to recommender systems↓
Content-based Recommender Systems
Content-based filtering focuses on the attributes of the items themselves (e.g., genres, tags) and the user's past preferences for specific item types.
Main idea: Recommend items to customer x similar to previous items rated highly by x
e.g
• Movie recommendations
– Recommend movies with same actor(s), director, genre, …
• Websites, blogs, news
– Recommend other sites with “similar” content
Item Profiles
For each item, create an item profile
• Profile is a set (vector) of features
– Movies: author, title, actor, director,…
– Text: Set of “important” words in document
• How to pick important features?
– Usual heuristic from text mining is TF-IDF (Term frequency * Inverse Doc Frequency)
• Term … Feature
• Document … Item
TF-IDF identifies words that are important to a specific document but not frequent across all documents.
这个参考lec2中的
User Profiles
User profile possibilities:
– Weighted average of rated item profiles
– Variation: Normalize weights using average rating of user
– More sophisticated aggregations possible…
e.g1 Boolean Utility Matrix
• Items are movies, only feature is "Actor”
– Item profile: vector with 0 or 1 for each Actor
• Suppose user x has watched 5 movies
– 2 movies featuring actor A
– 3 movies featuring actor B
• User profile = mean of item profiles
– Feature A's weight = 2/5 = 0.4
– Feature B's weight =3/5=0.6
e.g2 Star Ratings
• Same example, 1-5 star ratings
– Actor A's movies rated 3 and 5
– Actor B's movies rated 1, 2 and 4
• Useful step: Normalize ratings by subtracting user's mean rating (3)
– Actor A's normalized ratings = 0, +2
Profile weight = (0 + 2)/2 = 1
– Actor B's normalized ratings = -2, -1, +1
Profile weight = (-2-1+1)/3 = -2/3
Predictions
Prediction heuristic 启发式
Pro & Con
Pro:
• +: No need for data on other users
– No cold-start or sparsity problems
• +: Able to recommend to users with unique tastes
• +: Able to recommend new & unpopular items
– No first-rater problem
• +: Able to provide explanations
– Can provide explanations of recommended items by listing content-features that caused an item to be recommended
Con:
• –: Finding the appropriate features is hard
– E.g., images, movies, music
• –: Overspecialization
– Never recommends items outside user’s content profile
– People might have multiple interests
– Unable to exploit quality judgments of other users
• –: Recommendations for new users
– How to build a user profile?
Collaborative Filtering
Collaborative filtering focuses on the relationships and patterns in user interactions (e.g., who liked what) and doesn’t require information about the content of the items.
根据用户推荐
User-user collaborative filtering
• Consider user x
• Find set N of other users whose ratings are “similar” to x’s ratings
• Estimate x’s ratings based on ratings of users in N
FINDING “SIMILAR” USERS
• Consider users x and y with rating vectors rx and ry
• We need a similarity metric sim(x, y)
• Capture intuition that sim(A,B) > sim(A,C)
Jaccard Similarity
sim(A,B) = |rA ∩ rB| / |rA ∪ rB|
sim(A,B) = 1/5
sim(A,C) = 2/4
*problem: Ignores rating values!
总之就是确实算不了打分呢
cosine similarity
• sim(A, B) = cos(rA, rB)
sim(A, B) = 4*5/√[(4^2+5^2+1^2)*(5^2+5^2+4^2) ]=0.38
sim(A, C) = 0.32
sim(A, B) > sim (A, C), but not by much
Problem: treats missing ratings as negative
centered cosine similarity
2/3 = 4 - 10/3, 5/3 = 5 - 10/3 ... 总之是原数减去平均数
sim(A, B) = cos(rA, rB) = 0.09; sim(A, C) = -0.56
sim(A,B) > sim(A,C)
Captures intuition better
Missing ratings treated as "average"
Handles "tough raters" and "easy raters"
Also known as Pearson Correlation Coefficient
Rating Predictions
Item-item collaborative filtering
– For item i, find other similar items
– Estimate rating for item i based on ratings for similar items
– Can use same similarity metrics and prediction functions as in user-user model
In practice, it has been observed that item-item often works better than user-user because Items are simpler, users have multiple tastes
e.g item-item
Here we use Pearson correlation as similarity:
思路:锁user算item
1) Subtract mean rating mi from each movie i
Item-Item | 对每个用户的评分减去该用户的平均评分 | 消除用户评分偏差 |
User-User | 对每个物品的评分减去该物品的平均评分 | 消除物品评分基线偏差 |
m1 = (1+3+5+5+4)/5 = 3.6
row1: 每个数减去3.6 = [-2.6, NAN, -0.6, NAN, NAN, 1.4, NAN, NAN, 1.4, NAN, 0.4, NAN]
2) Compute cosine similarities between rows(items)
S(r1,r3) = 5+20+20/√(76*84)= 0.41
S(r1,r6) = 0.59
3) Predict by taking weighted average:
就是给相关的分加上通过相似性算出来的权重
rating (1,5) = (0.41*2 + 0.59*3) / (0.41+0.59) = 2.6
这个2和3是基于同一user的不同item评分
Pro & Con
• + Works for any kind of item
– No feature selection needed
• - Cold Start:
– Need enough users in the system to find a match
• - Sparsity:
– The user/ratings matrix is sparse
– Hard to find users that have rated the same items
• - First rater:
– Cannot recommend an item that has not been previously rated
– New items, Esoteric items
• - Popularity bias:
– Cannot recommend items to someone with unique taste
– Tends to recommend popular items
Evaluating Recommendation Systems
Evaluationg predictions
PROBLEMS WITH ERROR MEASURES
• Narrow focus on accuracy sometimes
misses the point
– Prediction Diversity
– Prediction Context
– Order of predictions
• In practice, we care only to predict high ratings:
– RMSE might penalize a method that does well for high ratings and badly for others
– Alternative: precision at top k Percentage of predictions in the user's top k withheld ratings