结巴直接分词
python -m jieba -d ' ' allTrain.txt > train_contents.txt
使用redis
cmd1 :redis-server.exe redis.windows.conf
cmd2:redis-cli.exe -h 127.0.0.1 -p 6379
scrapy-redis src- scrapy-redis copy- scrapy project
redis
keys * 列出
https://github.com/rmax/scrapy-redis
type jobbole:requests :类型
zrange jobbole:requests 0 1 :zset元素
scard jobbole:dupefilter :set元素数量
smembers jobbole:dupefilter :获得key
查看mysql文件夹位置
show global variables like "%datadir%"
打开 tensorflow summary 的目录 执行 tensorboard --logdir=C:\redis\logs
TensorBoard 0.1.6 at http://DESKTOP-FIPG2GH:6006 (Press CTRL+C to quit) 便可以在浏览器输入 localhost:6006 查看tensorflow 模型相关 graph HISTOGRAMS
jupyter
'sha1:f0147912cfac:fe72a5a54b1bb234881e4fdc5d04419d70dc4e58'
LINUX下批量修改文件夹下面的文件名
i=1; for x in *; do mv $x $i.扩展名; let i=i+1; done
删除文件夹及文件夹下所有内容
rm -rf folder
python 替换掉字符串中的换行符
str.replace('\n',' ')
RE处理数据
1 import re 2 import os 3 dir_list = [dirs for dirs in sorted(os.listdir()) if dirs.endswith('.json')] 4 print("JSON文件:{0}".format(len(dir_list))) 5 path = '../pubmedData/' 6 if not os.path.exists(path): 7 os.makedirs(path) 8 9 for file in dir_list: 10 print("正在处理:{0}".format(file)) 11 with open(file,'r') as f: 12 x = f.read() 13 cit_pubmed = re.findall('cit {(.*?)Pubmed-entry',x,re.DOTALL) 14 print("匹配到的总数:{0}".format(len(cit_pubmed))) 15 16 i = 0 17 j = 0 18 k = 0 19 set_title_list = [] 20 set_abstract_list = [] 21 set_issn_list = [] 22 issn_class = [] 23 for y in range(len(cit_pubmed)): 24 #title 25 title = re.findall('title {(.*?)authors {',cit_pubmed[y],re.DOTALL) 26 set_title_list.append(len(title)) 27 if len(title) == 2: 28 i += 1 29 title = re.findall('name "(.*?)."', title[0], re.DOTALL) 30 if len(title) == 1: 31 title = re.findall('name "(.*?)."', title[0], re.DOTALL) 32 i += 1 33 34 #issn 35 issn = re.findall('issn "(.*?)",',cit_pubmed[y], re.DOTALL) 36 if len(issn) == 1: 37 #abstract 38 abstract = re.findall('abstract "(.*?).",',cit_pubmed[y],re.DOTALL) 39 if len(abstract) == 1: 40 with open(path + issn[0] + '.txt','a') as f: 41 f.write(abstract[0].replace("\n", " ") + '\n') 42 j += 1 43 set_abstract_list.append(len(abstract)) 44 45 issn_class.append(issn[0]) 46 k += 1 47 set_issn_list.append(len(issn)) 48 49 set_title_list = set(set_title_list) 50 set_abstract_list = set(set_abstract_list) 51 set_issn_list = set(set_issn_list) 52 print("TITLE种类:{0},总数:{1}".format(set_title_list, i)) 53 print("ABSTRACT种类:{0},总数:{1}".format(set_abstract_list, j)) 54 print("ISSN种类:{0},总数:{1}".format(set_issn_list, k)) 55 print("ISSN_CLASS:{0}类".format(len(set(issn_class))))
numpy argsort()
1 import numpy as np 2 x=np.array([5,4,3,2,1]) 3 y = x.argsort() 4 #output array([4, 3, 2, 1, 0])
取出ndarray 中最大的五个数的index
x=np.array([[5,4,3,2,1,7,8,9],[1,2,3,4,5,9,8,6]]) y = map(lambda label: label.argsort()[-1:-6:-1], x) t = list() t.extend(y) #result [array([7, 6, 5, 0, 1]), array([5, 6, 7, 4, 3])]
numpy.hstack() horizontal 水平的 a = array([1,2,3]) b = array([4,5,6]) c = array([1,2,3,4,5,6])
numpy.vstack() vertical 垂直的 a = array([1,2,3]) b = array([4,5,6]) c = array([1,2,3],[4,5,6])
统计数组中出现次数最少的两个值
1 from collections import Counter 2 a = [1,2,3,4,2,3,4,5] 3 x = Counter(a).most_common()[-2:]
查看文件夹大小
du -h --max-depth=1 pubmedData
查看单个文件大小
ls -sh 1932-6203.txt
列出当前文件夹下前十个最大的文件
du -a | sort -n -r | head -n 10
python 引用
1 x = [1,2,3] 2 y = x 3 print (y) 4 >>[1,2,3] 5 x.pop() 6 print (y) 7 >>[1,2] 8 x = [1,2,3] 9 y = x[:] 10 print (y) 11 >>[1,2,3] 12 x.pop() 13 print (y) 14 >>[1,2,3]
Python中一个对象有两个头部信息 1.型标志符 标识对象的类型 2.引用计数器 用来决定是不是可以回收这个变量
类型属于对象的不属于变量 python变量 是在特定的时间引用了特定的变量 a = 123(整数) a = '123'(字符串) a = 1.23(float)
对象的垃圾收集 a = 123(整数) a = '123'(字符串) a = 1.23(float) 如果a 从指向int对象123 变成指向str对象‘123’则int对象123就要进行回收 被回收的空间自动放到 # 自由内存空间池 #
递归计算任意结构list元素和
def sum(l):total = 0for x in l:if not isinstance(x, list):total += xelse:total += sum(x)return total sum([[1,2,3],[1,[2]]])