【Intro】Cora数据集介绍

https://graphsandnetworks.com/the-cora-dataset/

Graph Convolutional Network (GCN) on the CORA citation dataset — StellarGraph 1.0.0rc1 documentation

pytorch-GAT/The Annotated GAT (Cora).ipynb at main · gordicaleksa/pytorch-GAT · GitHub

Cora数据集

Cora数据集包括2708份科学出版物,分为7类。引文网络由5429个链接组成。数据集中的每个发布都用一个0/1值的词向量来描述,该词向量表示字典中相应词的缺失/存在。这部词典由1433个独特的单词组成。

这个数据集在图学习中是MNIST等价的。

import pandas as pdnode_df = pd.read_csv('./data/nodes.csv')
node_df.head()
Unnamed: 0nodeIdlabelssubjectfeatures
0031336PaperNeural_Networks[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
111061127PaperRule_Learning[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...
221106406PaperReinforcement_Learning[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3313195PaperReinforcement_Learning[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4437879PaperProbabilistic_Methods[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
edge_df = pd.read_csv('./data/edges.csv')
edge_df.head()
Unnamed: 0sourceNodeIdtargetNodeIdrelationshipType
00351033CITES
1135103482CITES
2235103515CITES
33351050679CITES
44351103960CITES
edge_df = pd.read_csv('./data/edges.csv', names=["target", "source"])
edge_df["label"] = "cites"edge_df.sample(frac=0.5).head(5)
targetsourcelabel
563.023541130539CITEScites
2766.03506132083CITEScites
4040.01107808116512CITEScites
134.059454335CITEScites
1023.04584124064CITEScites

https://graphsandnetworks.com/the-cora-dataset/

按此指引,下载

https://github.com/gordicaleksa/pytorch-GAT/blob/main/The%20Annotated%20GAT%20(Cora).ipynb

事实证明,将注意力的想法与已经存在的图形卷积网络(GCN)结合起来是一个很好的举动🤓- GAT是GNN文献中被引用次数第二多的论文(截至该notebook撰写时)。

整个想法来自cnn。卷积神经网络解决了各种计算机视觉任务,并在深度学习领域掀起了一场巨大的热潮,所以一些人决定把这个想法转移到图上。基本问题是,虽然图像位于规则网格上(你也可以将其视为图形),因此具有精确的顺序概念,但图不享受这种良好的属性,邻居的数量以及邻居的顺序。

因此,如何定义图的kernel成为一个问题。我们无法建党将kernel的大小定义为3 \times 3,因为节点的邻居可能很少或者很大。

此时主要用到的是两个思路:

  • spectral methods(都以某种方式利用了图的拉普拉斯特征基)
    据说其历史源于graph signal processing,有空读一下
  • spatial method

对于spatial methods(空间方法)的high level解释

假设我们有邻居的特征向量,则可以执行以下操作:

  1. 以某种方式变换它们(也许是一个线性投影),
  2. 以某种方式聚合它们(也许用注意力系数对它们进行加权->GAT)。
  3. 通过将当前节点的(变换后的)特征向量与聚合的邻域表示结合起来,(以某种方式)更新当前节点的特征向量。

import & 读取数据

# I always like to structure my imports into Python's native libs,
# stuff I installed via conda/pip and local file imports (but we don't have those here)import pickle# Visualization related imports
import matplotlib.pyplot as plt
import networkx as nx
import igraph as ig# Main computation libraries
import scipy.sparse as sp
import numpy as np# Deep learning related imports
import torch
"""Contains constants needed for data loading and visualization."""import os
import enum# Supported datasets - only Cora in this notebook
class DatasetType(enum.Enum):CORA = 0# Networkx is not precisely made with drawing as its main feature but I experimented with it a bit
class GraphVisualizationTool(enum.Enum):NETWORKX = 0,IGRAPH = 1# We'll be dumping and reading the data from this directory
DATA_DIR_PATH = os.path.join(os.getcwd(), 'data')
CORA_PATH = os.path.join(DATA_DIR_PATH, 'cora')  # this is checked-in no need to make a directory#
# Cora specific constants
## Thomas Kipf et al. first used this split in GCN paper and later Petar Veličković et al. in GAT paper
CORA_TRAIN_RANGE = [0, 140]  # we're using the first 140 nodes as the training nodes
CORA_VAL_RANGE = [140, 140+500]
CORA_TEST_RANGE = [1708, 1708+1000]
CORA_NUM_INPUT_FEATURES = 1433
CORA_NUM_CLASSES = 7# Used whenever we need to visualzie points from different classes (t-SNE, CORA visualization)
cora_label_to_color_map = {0: "red", 1: "blue", 2: "green", 3: "orange", 4: "yellow", 5: "pink", 6: "gray"}

数据所在位置在当前notebook所在位置data文件夹下后cora文件夹,如:GNN_test_project/data/cora/node_features.csr

使用前140个节点来训练节点,500个节点作为验证,1000个节点作为测试。

输入的feature数量是1433,共分成7类。

为了方便可视化,这里设置了每个类别不同的颜色。

数据集了解

Transductive(直推式) - 假设我们有一个单一的图(如:Cora),将一些节点(而不是图)分成训练/验证/测试训练集。在训练时,只使用来自训练节点的标签。但是。在前向传递过程中,根据空间gnn工作的本质,将从邻居中聚集特征向量,其中一些可能属于验证集甚至测试集!重点是——此处不是在使用它们的标签信息(没有使用节点的feature),而是在使用它们的结构信息和特征。

Inductive(归纳式) - 如果有计算机视觉或NLP背景,可能对这个更熟悉。有一组训练图,有一组单独的验证图当然还有一组单独的测试图。

pickle.load(file)

pickle — Python object serialization — Python 3.12.3 documentation

pickle.dump和pickle.load-CSDN博客

从file中读取一个字符串,并重构为原来的python对象

with open(path, 'rb') as file

python - What's the difference between open('filepath', 'rb') and open(rb'filepath')? There's some encoding difference between them - Stack Overflow

https://www.quora.com/What-does-opening-a-file-rb-in-Python-mean

請問with open() as f 的語法意思為何? open()內參數何時使用'wb'、'rb'? - Cupoy

r将字符串字面值标记为raw(在这种特殊情况下不做任何事情),而b将其标记为二进制,这意味着结果对象是bytes对象,而不是STR对象

简单来说就是:读入,并且转换成bytes类型

with open(path, 'wb') as file

类似的,这里是写入

pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)

用于将python独享序列化并保存到文件中

loading/saving Pickle files:
# First let's define these simple functions for loading/saving Pickle files - we need them for Cora# All Cora data is stored as pickle
def pickle_read(path):with open(path, 'rb') as file:data = pickle.load(file)return datadef pickle_save(path, data):with open(path, 'wb') as file:pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)

 加载数据

node_features_csr = pickle_read(os.path.join(CORA_PATH, 'node_features.csr'))
node_labels_npy = pickle_read(os.path.join(CORA_PATH, 'node_labels.npy'))
adjacency_list_dict = pickle_read(os.path.join(CORA_PATH, 'adjacency_list.dict'))
获得三个数据:
1. 节点特征
2. 边的标签
3. 邻接表(N个节点:节点的所有邻居节点)
load_graph_data
# We'll pass the training config dictionary a bit later
def load_graph_data(training_config, device):dataset_name = training_config['dataset_name'].lower()should_visualize = training_config['should_visualize']if dataset_name == DatasetType.CORA.name.lower():# shape = (N, FIN), where N is the number of nodes and FIN is the number of input featuresnode_features_csr = pickle_read(os.path.join(CORA_PATH, 'node_features.csr'))# shape = (N, 1)node_labels_npy = pickle_read(os.path.join(CORA_PATH, 'node_labels.npy'))# shape = (N, number of neighboring nodes) <- this is a dictionary not a matrix!adjacency_list_dict = pickle_read(os.path.join(CORA_PATH, 'adjacency_list.dict'))# Normalize the features (helps with training)node_features_csr = normalize_features_sparse(node_features_csr)num_of_nodes = len(node_labels_npy)# shape = (2, E), where E is the number of edges, and 2 for source and target nodes. Basically edge index# contains tuples of the format S->T, e.g. 0->3 means that node with id 0 points to a node with id 3.topology = build_edge_index(adjacency_list_dict, num_of_nodes, add_self_edges=True)# Note: topology is just a fancy way of naming the graph structure data # (aside from edge index it could be in the form of an adjacency matrix)if should_visualize:  # network analysis and graph drawingplot_in_out_degree_distributions(topology, num_of_nodes, dataset_name)  # we'll define these in a secondvisualize_graph(topology, node_labels_npy, dataset_name)# Convert to dense PyTorch tensors# Needs to be long int type because later functions like PyTorch's index_select expect ittopology = torch.tensor(topology, dtype=torch.long, device=device)node_labels = torch.tensor(node_labels_npy, dtype=torch.long, device=device)  # Cross entropy expects a long intnode_features = torch.tensor(node_features_csr.todense(), device=device)# Indices that help us extract nodes that belong to the train/val and test splitstrain_indices = torch.arange(CORA_TRAIN_RANGE[0], CORA_TRAIN_RANGE[1], dtype=torch.long, device=device)val_indices = torch.arange(CORA_VAL_RANGE[0], CORA_VAL_RANGE[1], dtype=torch.long, device=device)test_indices = torch.arange(CORA_TEST_RANGE[0], CORA_TEST_RANGE[1], dtype=torch.long, device=device)return node_features, node_labels, topology, train_indices, val_indices, test_indiceselse:raise Exception(f'{dataset_name} not yet supported.')

读取节点特征数据,节点特征归一化

读取邻接列表,得到边的连接信息

两者数据类型都是numpy.ndarray

normalize features sparse 
def normalize_features_sparse(node_features_sparse):assert sp.issparse(node_features_sparse), f'Expected a sparse matrix, got {node_features_sparse}.'# Instead of dividing (like in normalize_features_dense()) we do multiplication with inverse sum of features.# Modern hardware (GPUs, TPUs, ASICs) is optimized for fast matrix multiplications! ^^ (* >> /)# shape = (N, FIN) -> (N, 1), where N number of nodes and FIN number of input featuresnode_features_sum = np.array(node_features_sparse.sum(-1))  # sum features for every node feature vector# Make an inverse (remember * by 1/x is better (faster) then / by x)# shape = (N, 1) -> (N)node_features_inv_sum = np.power(node_features_sum, -1).squeeze()# Again certain sums will be 0 so 1/0 will give us inf so we replace those by 1 which is a neutral element for mulnode_features_inv_sum[np.isinf(node_features_inv_sum)] = 1.# Create a diagonal matrix whose values on the diagonal come from node_features_inv_sumdiagonal_inv_features_sum_matrix = sp.diags(node_features_inv_sum)# We return the normalized features.return diagonal_inv_features_sum_matrix.dot(node_features_sparse)

归一化特征,让每个节点的特征和为1 

 node_features_sum = np.array(node_features_sparse.sum(-1))

scipy.sparse.csr_matrix.sum — SciPy v1.13.1 Manual

python - How to get sum of each row and sum of each column in Scipy sparse matrices (csr_matrix and csc_matrix)? - Stack Overflow

python对矩阵某行求和_python – 对scipy.sparse.csr_matrix中的行求和-CSDN博客

python - Convert Pandas dataframe to Sparse Numpy Matrix directly - Stack Overflow

这里node_features_sparse的数据类型是scipy.sparse._csr.csr_matrix

import pandas as pddf = pd.DataFrame({'w_0': [1, 0, 1, 0, 1, 0, 1, 0],'w_1': [0, 0, 0, 0, 1, 0, 1, 0],'w_2': [1, 1, 1, 1, 1, 1, 1, 1],'w_4': [0, 1, 0, 1, 1, 0, 1, 1]
})
sp.csr_matrix(df.values).sum(-1)"""
输入:
matrix([[2],[2],[2],[2],[4],[1],[4],[2]])
"""

⬅️df

import pandas as pddf = pd.DataFrame({'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
df = sp.csr_matrix(df.values)
df_sum = np.array(df.sum(-1))
df.toarray(), df_sum, np.power(df_sum, -1), np.power(df_sum, -1).squeeze()"""
输出:
(array([[1., 0., 1., 0.],[0., 0., 1., 1.],[1., 0., 1., 0.],[0., 0., 1., 1.],[1., 1., 1., 1.],[0., 0., 1., 0.],[1., 1., 1., 1.],[0., 0., 1., 1.]]),array([[2.],[2.],[2.],[2.],[4.],[1.],[4.],[2.]]),array([[0.5 ],[0.5 ],[0.5 ],[0.5 ],[0.25],[1.  ],[0.25],[0.5 ]]),array([0.5 , 0.5 , 0.5 , 0.5 , 0.25, 1.  , 0.25, 0.5 ]))
"""

注意,这里,要求得是float类型

import pandas as pddf = pd.DataFrame({'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
df = sp.csr_matrix(df.values)
df_sum = np.array(df.sum(-1))
df_inv_sum = np.power(df_sum, -1).squeeze()
# 某些和可能是0,所以1/0会给我们inf,所以我们用1代替
df_inv_sum[np.isinf(df_inv_sum)] = 1.
df_inv_sum, sp.diags(df_inv_sum).toarray()"""
输出:
(array([0.5 , 0.5 , 0.5 , 0.5 , 0.25, 1.  , 0.25, 0.5 ]),array([[0.5 , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],[0.  , 0.5 , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],[0.  , 0.  , 0.5 , 0.  , 0.  , 0.  , 0.  , 0.  ],[0.  , 0.  , 0.  , 0.5 , 0.  , 0.  , 0.  , 0.  ],[0.  , 0.  , 0.  , 0.  , 0.25, 0.  , 0.  , 0.  ],[0.  , 0.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  ],[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.25, 0.  ],[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.5 ]]))
"""
import pandas as pddf = pd.DataFrame({'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
df = sp.csr_matrix(df.values)
df_sum = np.array(df.sum(-1))
df_inv_sum = np.power(df_sum, -1).squeeze()
# 某些和可能是0,所以1/0会给我们inf,所以我们用1代替
df_inv_sum[np.isinf(df_inv_sum)] = 1.
diag_df_sum_matrix = sp.diags(df_inv_sum)
diag_df_sum_matrix.dot(df).toarray()"""
输出:
array([[0.5 , 0.  , 0.5 , 0.  ],[0.  , 0.  , 0.5 , 0.5 ],[0.5 , 0.  , 0.5 , 0.  ],[0.  , 0.  , 0.5 , 0.5 ],[0.25, 0.25, 0.25, 0.25],[0.  , 0.  , 1.  , 0.  ],[0.25, 0.25, 0.25, 0.25],[0.  , 0.  , 0.5 , 0.5 ]])
"""

矩阵乘法

build edge index
def build_edge_index(adjacency_list_dict, num_of_nodes, add_self_edges=True):source_nodes_ids, target_nodes_ids = [], []seen_edges = set()for src_node, neighboring_nodes in adjacency_list_dict.items():for trg_node in neighboring_nodes:# if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)source_nodes_ids.append(src_node)target_nodes_ids.append(trg_node)seen_edges.add((src_node, trg_node))if add_self_edges:source_nodes_ids.extend(np.arange(num_of_nodes))target_nodes_ids.extend(np.arange(num_of_nodes))# shape = (2, E), where E is the number of edges in the graphedge_index = np.row_stack((source_nodes_ids, target_nodes_ids))return edge_index

记录边的信息

adj_list_dict = {0: [1, 2, 5],1: [0, 2, 3, 4],2: [0, 1, 5],3: [1, 4, 5],4: [1, 3, 6],5: [0, 2, 3, 6 ,7],6: [4, 5, 7],7: [5, 6]
}
num_of_nodes = 8
source_nodes_ids, target_nodes_ids = [], []
seen_edges = set()
for src_node, neighboring_nodes in adj_list_dict.items():for trg_node in neighboring_nodes:# if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)source_nodes_ids.append(src_node)target_nodes_ids.append(trg_node)seen_edges.add((src_node, trg_node))
print(pd.DataFrame([source_nodes_ids, target_nodes_ids]))
source_nodes_ids.extend(np.arange(num_of_nodes))
target_nodes_ids.extend(np.arange(num_of_nodes))
print(pd.DataFrame([source_nodes_ids, target_nodes_ids]))"""
输出:0   1   2   3   4   5   6   7   8   9   ...  16  17  18  19  20  21  22  \
0   0   0   0   1   1   1   1   2   2   2  ...   5   5   5   5   5   6   6   
1   1   2   5   0   2   3   4   0   1   5  ...   0   2   3   6   7   4   5   23  24  25  
0   6   7   7  
1   7   5   6  [2 rows x 26 columns]0   1   2   3   4   5   6   7   8   9   ...  24  25  26  27  28  29  30  \
0   0   0   0   1   1   1   1   2   2   2  ...   7   7   0   1   2   3   4   
1   1   2   5   0   2   3   4   0   1   5  ...   5   6   0   1   2   3   4   31  32  33  
0   5   6   7  
1   5   6   7  [2 rows x 34 columns]
"""

可以看出,在做的事是把节点连接信息填上,并添加了自环

adj_list_dict = {0: [1, 2, 5],1: [0, 2, 3, 4],2: [0, 1, 5],3: [1, 4, 5],4: [1, 3, 6],5: [0, 2, 3, 6 ,7],6: [4, 5, 7],7: [5, 6]
}
num_of_nodes = 8
source_nodes_ids, target_nodes_ids = [], []
seen_edges = set()
for src_node, neighboring_nodes in adj_list_dict.items():for trg_node in neighboring_nodes:# if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)source_nodes_ids.append(src_node)target_nodes_ids.append(trg_node)seen_edges.add((src_node, trg_node))
source_nodes_ids.extend(np.arange(num_of_nodes))
target_nodes_ids.extend(np.arange(num_of_nodes))
np.row_stack((source_nodes_ids, target_nodes_ids))"""
输出:
array([[0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6,6, 6, 7, 7, 0, 1, 2, 3, 4, 5, 6, 7],[1, 2, 5, 0, 2, 3, 4, 0, 1, 5, 1, 4, 5, 1, 3, 6, 0, 2, 3, 6, 7, 4,5, 7, 5, 6, 0, 1, 2, 3, 4, 5, 6, 7]])
"""

把源节点和目标节点放一起

为防止python报错,在这里先定义画图函数

# Let's just define dummy visualization functions for now - just to stop Python interpreter from complaining!
# We'll define them in a moment, properly, I swear.def plot_in_out_degree_distributions():passdef visualize_graph():passdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # checking whether you have a GPUconfig = {'dataset_name': DatasetType.CORA.name,'should_visualize': False
}node_features, node_labels, edge_index, train_indices, val_indices, test_indices = load_graph_data(config, device)print(node_features.shape, node_features.dtype)
print(node_labels.shape, node_labels.dtype)
print(edge_index.shape, edge_index.dtype)
print(train_indices.shape, train_indices.dtype)
print(val_indices.shape, val_indices.dtype)
print(test_indices.shape, test_indices.dtype)

数据集可视化

数一个节点作为源节点的次数和作为目标节点的次数

def plot_in_out_degree_distributions(edge_index, num_of_nodes, dataset_name):"""Note: It would be easy to do various kinds of powerful network analysis using igraph/networkx, etc.I chose to explicitly calculate only the node degree statistics here, but you can go much further if needed andcalculate the graph diameter, number of triangles and many other concepts from the network analysis field."""if isinstance(edge_index, torch.Tensor):edge_index = edge_index.cpu().numpy()assert isinstance(edge_index, np.ndarray), f'Expected NumPy array got {type(edge_index)}.'# Store each node's input and output degree (they're the same for undirected graphs such as Cora)in_degrees = np.zeros(num_of_nodes, dtype=np.int)out_degrees = np.zeros(num_of_nodes, dtype=np.int)# Edge index shape = (2, E), the first row contains the source nodes, the second one target/sink nodes# Note on terminology: source nodes point to target/sink nodesnum_of_edges = edge_index.shape[1]for cnt in range(num_of_edges):source_node_id = edge_index[0, cnt]target_node_id = edge_index[1, cnt]out_degrees[source_node_id] += 1  # source node points towards some other node -> increment its out degreein_degrees[target_node_id] += 1  # similarly herehist = np.zeros(np.max(out_degrees) + 1)for out_degree in out_degrees:hist[out_degree] += 1fig = plt.figure(figsize=(12,8), dpi=100)  # otherwise plots are really small in Jupyter Notebookfig.subplots_adjust(hspace=0.6)plt.subplot(311)plt.plot(in_degrees, color='red')plt.xlabel('node id'); plt.ylabel('in-degree count'); plt.title('Input degree for different node ids')plt.subplot(312)plt.plot(out_degrees, color='green')plt.xlabel('node id'); plt.ylabel('out-degree count'); plt.title('Out degree for different node ids')plt.subplot(313)plt.plot(hist, color='blue')plt.xlabel('node degree')plt.ylabel('# nodes for a given out-degree') plt.title(f'Node out-degree distribution for {dataset_name} dataset')plt.xticks(np.arange(0, len(hist), 5.0))plt.grid(True)plt.show()
in_degrees = np.zeros(num_of_nodes, dtype=np.int_)
out_degrees = np.zeros(num_of_nodes, dtype=np.int_)
in_degrees, out_degrees"""
输出:
(array([0, 0, 0, 0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0, 0, 0, 0]))
"""

这里可能需要修改一下(原因是因为numpy版本不一样,作者提供的环境下可以直接用np.int。

in_degrees = np.zeros(num_of_nodes, dtype=np.int_)
out_degrees = np.zeros(num_of_nodes, dtype=np.int_)num_of_edges = edge_index.shape[1]
for cnt in range(num_of_edges):source_node_id = edge_index[0, cnt]target_node_id = edge_index[1, cnt]out_degrees[source_node_id] += 1  # source node points towards some other node -> increment its out degreein_degrees[target_node_id] += 1  # similarly herein_degrees, out_degrees"""
输出:
(array([4, 5, 4, 4, 4, 6, 4, 3]), array([4, 5, 4, 4, 4, 6, 4, 3]))
"""

剩下的部分就是画图了

Cora结果如上图所示,可以看出:

  • 上面的两张图是一样的,因为我们把Cora看作是一个无向图(尽管它自然应该被建模为一个有向图)。
  • 某些节点有大量的边(中间的峰值),但大多数节点的边要少得多。
  • 第三张图以直方图的形式很好地可视化了这一点——大多数节点只有2-5条边(因此最左边的峰值)。
"""
Check out this blog for available graph visualization tools:https://towardsdatascience.com/large-graph-visualization-tools-and-approaches-2b8758a1cd59Basically depending on how big your graph is there may be better drawing tools than igraph.Note: I unfortunatelly had to flatten this function since igraph is having some problems with Jupyter Notebook,
we'll only call it here so it's fine!"""dataset_name = config['dataset_name']
visualization_tool=GraphVisualizationTool.IGRAPHif isinstance(edge_index, torch.Tensor):edge_index_np = edge_index.cpu().numpy()if isinstance(node_labels, torch.Tensor):node_labels_np = node_labels.cpu().numpy()num_of_nodes = len(node_labels_np)
edge_index_tuples = list(zip(edge_index_np[0, :], edge_index_np[1, :]))  # igraph requires this format# Construct the igraph graph
ig_graph = ig.Graph()
ig_graph.add_vertices(num_of_nodes)
ig_graph.add_edges(edge_index_tuples)# Prepare the visualization settings dictionary
visual_style = {}# Defines the size of the plot and margins
# go berserk here try (3000, 3000) it looks amazing in Jupyter!!! (you'll have to adjust the vertex_size though!)
visual_style["bbox"] = (700, 700)
visual_style["margin"] = 5# I've chosen the edge thickness such that it's proportional to the number of shortest paths (geodesics)
# that go through a certain edge in our graph (edge_betweenness function, a simple ad hoc heuristic)# line1: I use log otherwise some edges will be too thick and others not visible at all
# edge_betweeness returns < 1 for certain edges that's why I use clip as log would be negative for those edges
# line2: Normalize so that the thickest edge is 1 otherwise edges appear too thick on the chart
# line3: The idea here is to make the strongest edge stay stronger than others, 6 just worked, don't dwell on itedge_weights_raw = np.clip(np.log(np.asarray(ig_graph.edge_betweenness())+1e-16), a_min=0, a_max=None)
edge_weights_raw_normalized = edge_weights_raw / np.max(edge_weights_raw)
edge_weights = [w**6 for w in edge_weights_raw_normalized]
visual_style["edge_width"] = edge_weights# A simple heuristic for vertex size. Size ~ (degree / 4) (it gave nice results I tried log and sqrt as well)
visual_style["vertex_size"] = [deg / 4 for deg in ig_graph.degree()]# This is the only part that's Cora specific as Cora has 7 labels
if dataset_name.lower() == DatasetType.CORA.name.lower():visual_style["vertex_color"] = [cora_label_to_color_map[label] for label in node_labels_np]
else:print('Feel free to add custom color scheme for your specific dataset. Using igraph default coloring.')# Set the layout - the way the graph is presented on a 2D chart. Graph drawing is a subfield for itself!
# I used "Kamada Kawai" a force-directed method, this family of methods are based on physical system simulation.
# (layout_drl also gave nice results for Cora)
visual_style["layout"] = ig_graph.layout_kamada_kawai()print('Plotting results ... (it may take couple of seconds).')
ig.plot(ig_graph, **visual_style)# This website has got some awesome visualizations check it out:
# http://networkrepository.com/graphvis.php?d=./data/gsm50/labeled/cora.edges

----------------------------------------------------------------------------------------------

OK到此为止,算是初步认识了Cora数据集,这里附上前面用来解释代码的toy dataset做可视化的流程:

# Visualization related imports
import matplotlib.pyplot as plt
import networkx as nx
import igraph as ig# Main computation libraries
import scipy.sparse as sp
import numpy as np
import pandas as pd# 节点特征
node_df = pd.DataFrame({'w_0': [1., 0., 1., 0., 1., 0., 1., 0.],'w_1': [0., 0., 0., 0., 1., 0., 1., 0.],'w_2': [1., 1., 1., 1., 1., 1., 1., 1.],'w_4': [0., 1., 0., 1., 1., 0., 1., 1.]
})
num_of_nodes = 8
node_df = sp.csr_matrix(node_df.values)
node_df_sum = np.array(node_df.sum(-1))
node_df_inv_sum = np.power(node_df_sum, -1).squeeze()
# 某些和可能是0,所以1/0会给我们inf,所以我们用1代替
node_df_inv_sum[np.isinf(node_df_inv_sum)] = 1.
diag_node_df_sum_matrix = sp.diags(node_df_inv_sum)
topology = diag_node_df_sum_matrix.dot(node_df).toarray()
# 边
adj_list_dict = {0: [1, 2, 5],1: [0, 2, 3, 4],2: [0, 1, 5],3: [1, 4, 5],4: [1, 3, 6],5: [0, 2, 3, 6 ,7],6: [4, 5, 7],7: [5, 6]
}source_nodes_ids, target_nodes_ids = [], []
seen_edges = set()
for src_node, neighboring_nodes in adj_list_dict.items():for trg_node in neighboring_nodes:# if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)source_nodes_ids.append(src_node)target_nodes_ids.append(trg_node)seen_edges.add((src_node, trg_node))
source_nodes_ids.extend(np.arange(num_of_nodes))
target_nodes_ids.extend(np.arange(num_of_nodes))
edge_index = np.row_stack((source_nodes_ids, target_nodes_ids))in_degrees = np.zeros(num_of_nodes, dtype=np.int_)
out_degrees = np.zeros(num_of_nodes, dtype=np.int_)num_of_edges = edge_index.shape[1]
for cnt in range(num_of_edges):source_node_id = edge_index[0, cnt]target_node_id = edge_index[1, cnt]out_degrees[source_node_id] += 1  # source node points towards some other node -> increment its out degreein_degrees[target_node_id] += 1  # similarly herehist = np.zeros(np.max(out_degrees) + 1)
for out_degree in out_degrees:hist[out_degree] += 1fig = plt.figure(figsize=(12,8), dpi=100)  # otherwise plots are really small in Jupyter Notebook
fig.subplots_adjust(hspace=0.6)plt.subplot(311)
plt.plot(in_degrees, color='red')
plt.xlabel('node id'); plt.ylabel('in-degree count'); plt.title('Input degree for different node ids')plt.subplot(312)
plt.plot(out_degrees, color='green')
plt.xlabel('node id'); plt.ylabel('out-degree count'); plt.title('Out degree for different node ids')plt.subplot(313)
plt.plot(hist, color='blue')
plt.xlabel('node degree')
plt.ylabel('# nodes for a given out-degree') 
plt.title(f'Node out-degree distribution for toy data dataset')
plt.xticks(np.arange(0, len(hist), 5.0))plt.grid(True)
plt.show()

label_to_color_map = {0: "red", 1: "blue"}
edge_index = torch.from_numpy(edge_index)
node_labels = np.array([0,0,0,1,1,0,1,1])
node_labels = torch.from_numpy(node_labels)
edge_index_np = edge_index.cpu().numpy()
node_labels_np = node_labels.cpu().numpy()
print(type(node_labels), len(node_labels))
num_of_nodes = len(node_labels_np)
edge_index_tuples = list(zip(edge_index_np[0, :], edge_index_np[1, :]))# Construct the igraph graph
ig_graph = ig.Graph()
ig_graph.add_vertices(num_of_nodes)
ig_graph.add_edges(edge_index_tuples)# Prepare the visualization settings dictionary
visual_style = {}# Defines the size of the plot and margins
visual_style["bbox"] = (400, 400)
visual_style["margin"] = 20edge_weights_raw = np.clip(np.log(np.asarray(ig_graph.edge_betweenness())+1e-16), a_min=0, a_max=None)
edge_weights_raw_normalized = edge_weights_raw / np.max(edge_weights_raw)
edge_weights = [w**6 for w in edge_weights_raw_normalized]
visual_style["edge_width"] = edge_weights
visual_style["vertex_color"] = [label_to_color_map[label] for label in node_labels_np]
visual_style["layout"] = ig_graph.layout_kamada_kawai()
ig.plot(ig_graph, **visual_style)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/23054.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

日常实习-小米计算机视觉算法岗面经

文章目录 流程问题请你写出项目中用到的模型代码&#xff0c;Resnet50&#xff08;1&#xff09;网络退化现象&#xff1a;把网络加深之后&#xff0c;效果反而变差了&#xff08;2&#xff09;过拟合现象&#xff1a;训练集表现很棒&#xff0c;测试集很差 把你做的工作里面的…

Windows上虚拟机安装OpenGaus22.03

在Windows上安装OpenGauss并不像在Linux上那么直接&#xff0c;因为OpenGauss主要面向OpenEuler系统设计。可以通过使用虚拟机或者Docker来在Windows上运行OpenGauss。虚拟机比Docker提供更完整的操作环境。以下是采用虚拟机的详细步骤&#xff1a; 通过虚拟机安装OpenGauss …

运放应用1 - 反相放大电路

1.前置知识 反相放大电路存在 负反馈电路 &#xff0c;工作在线性区&#xff0c;可以利用 虚短 概念来分析电路。 注&#xff1a;运放的 虚断 特性是一直存在的&#xff0c;虚短特性则需要运放工作在 线性区 有关运放的基础知识&#xff0c;可以参考我的另外一篇文章&#xff…

ASCE(美国土木工程师学会)文献校外去哪里查找下载

今天要讲的数据库是ASCE&#xff08;美国土木工程师学会&#xff09;&#xff0c;该数据库每年出版5万多页的专业期刊、杂志、会议录、专著、技术报告、实践手册和标准等。目前&#xff0c;ASCE数据库中包含35种期刊(1983年至今)、近700卷会议录( 1996年至今)、Civil Engineeri…

htb_solarlab

端口扫描 80,445 子域名扫描 木有 尝试使用smbclient连接445端口 Documents目录可查看 将Documents底下的文件下载到本地看看 xlsx文件里有一大串用户信息&#xff0c;包括username和password 先弄下来 不知道在哪登录&#xff0c;也没有子域名&#xff0c;于是返回进行全端…

salesforce inactive的用户会收到通知邮件吗

在 Salesforce 中&#xff0c;inactive 用户通常不会收到任何通知邮件。这是因为 Salesforce 不会向已停用&#xff08;inactive&#xff09;的用户发送电子邮件或通知&#xff0c;原因如下&#xff1a; 权限和访问&#xff1a;已停用的用户在系统中没有任何访问权限&#xff…

C++缺省参数函数重载

缺省参数 大家知道什么是备胎吗&#xff1f; C中函数的参数也可以配备胎。 3.1缺省参数概念 缺省参数是声明或定义函数时为函数的参数指定一个默认值。在调用该函数时&#xff0c;如果没有指定实参则采用该默认值&#xff0c;否则使用指定的实参。 void TestFunc(int a 0…

智慧医疗新纪元:可视化医保管理引领未来

在数字化浪潮席卷全球的今天&#xff0c;我们的生活正在经历前所未有的变革。其中&#xff0c;智慧医保可视化管理系统就像一股清新的风&#xff0c;为医疗保障领域带来了全新的活力与可能。 想象一下&#xff0c;在繁忙的医院里&#xff0c;患者和家属不再需要为了查询医保信息…

龙芯下如何进行.NET Core程序开发部署

&#x1f3c6;作者&#xff1a;科技、互联网行业优质创作者 &#x1f3c6;专注领域&#xff1a;.Net技术、软件架构、人工智能、数字化转型、DeveloperSharp、微服务、工业互联网、智能制造 &#x1f3c6;欢迎关注我&#xff08;Net数字智慧化基地&#xff09;&#xff0c;里面…

rk3566 klipper config can error

config can hw refer to :RK3568 & Ubuntu20.04调试can口_can0: flags193<up,running,noarp> mtu 16 unspec 00-CSDN博客 check hw: fireflyfirefly:~$ ifconfig -a can0: flags128<NOARP> mtu 16 unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00…

关于nginx的一些介绍

一、Nginx 简介 中文简介文档 二、Centos 安装 Nginx 2.1 安装编译工具及库文件 $ yum -y install make zlib zlib-devel gcc-c libtool openssl openssl-devel2.2 安装 pcre pcre 作用是 Nginx 支持 Rewrite 功能 $ cd /usr/local/src $ wget http://downloads.sourcef…

VBA信息获取与处理第二个专题第五节:实际场景中随机数的利用

《VBA信息获取与处理》教程(版权10178984)是我推出第六套教程&#xff0c;目前已经是第一版修订了。这套教程定位于最高级&#xff0c;是学完初级&#xff0c;中级后的教程。这部教程给大家讲解的内容有&#xff1a;跨应用程序信息获得、随机信息的利用、电子邮件的发送、VBA互…

Vxe UI vue 使用 VxeUI.previewImage() 图片预览方法

Vxe UI vue 使用 VxeUI.previewImage() 图片预览方法的调用 查看 github 代码 调用全局方法 VxeUI.previewImage() 参数说明&#xff1a; urlList&#xff1a;图片列表&#xff0c;支持传字符串&#xff0c;也可以传对象数组 [{url: xx’l}] activeIndex&#xff1a;指定默…

2. redis配置文件解析

redis配置文件解析 一、redis配置文件1、监听地址2、监听端口3、redis接收请求的队列长度3.1 修改系统参数/内核参数 4、客户端空闲的超时时间5、指定redis的pid文件6、定义错误日志7、定义数据库的数量8、定义持久化存储9、设置redis密码10、redis并发连接11、最大内存策略 二…

FIREYE燃烧控制器,Fireye红外扫描仪,Fireye说明书Fireye 技术参数Fireye 代理商

上海德奥达热能设备有限公司上海德奥达热能设备有限公司 FIREYE燃烧控制器&#xff0c;Fireye红外扫描仪&#xff0c;Fireye control&#xff0c;原装美国进口火焰检测器&#xff0c;Fireye紫外线扫描仪&#xff0c; Fireye紫外传感器&#xff0c;fireye价格&#xff0c;Fireye…

正则表达式二

修饰符 i&#xff1a;将匹配设置为不区分大小写&#xff0c;即A和a没有区别 var str"Google Runoob taobao runoob"; var n1str.match(/runoob/g); //runoob var n2str.match(/runoob/gi); //Runoob&#xff0c;runoobg&#xff1a;重找所有匹配项&#xff0…

Windows Server FTP详解

搭建&#xff1a; Windows Server 2012R2 FTP服务介绍及搭建_windows2012server r2ftp怎么做&#xff1f;-CSDN博客 问题&#xff1a; https://www.cnblogs.com/123525-m/p/17448357.html Java使用 被动FTP&#xff08;PASV&#xff09; 被动FTP模式在数据连接建立过程中…

Docker的安装和使用

目录 Docker的安装和使用移除旧版本docker配置docker yum源安装 最新 docker启动& 开机启动docker&#xff1b; enable start 二合一配置加速 Docker相关命令查看下载相关命令Docker启动相关命令上传Docker Hub加载删除镜像 Docker存储卷映射命令 Docker 网络Docker Compo…

计算机网络 ——数据链路层(广域网)

计算机网络 —— 广域网 什么是广域网PPP协议PPP协议的三个部分PPP协议的帧格式 HDLC协议HDLC的站HDLC的帧样式 PPP和HDLC的异同 我们今天来看广域网。 什么是广域网 广域网&#xff08;Wide Area Network&#xff0c;简称WAN&#xff09;是一种地理覆盖范围广泛的计算机网络…

Redis篇 list类型在Redis中的命令操作

list在redis基本的命令 一.基本命令1.lpush和range2.lpushx rpushx3.lpop rpop4.lindex linsert llen5.lrem6.ltrim lset7.blpop brpop 一.基本命令 list在redis中相当于数组或者顺序表. 1.lpush和range 2.lpushx rpushx 3.lpop rpop 4.lindex linsert llen 如果要插入的列表中…