在尝试一些时间序列集合时,我需要一个大数据集来检查我们的聚合查询在增加数据负载的情况下不会成为瓶颈。 我们解决了5000万份文档,因为超出此数目我们仍然会考虑分片。
每次事件如下所示:
{"_id" : ObjectId("5298a5a03b3f4220588fe57c"),"created_on" : ISODate("2012-04-22T01:09:53Z"),"value" : 0.1647851116706831
}
当我们想要获取随机值时,我们考虑使用JavaScript或Python生成它们(我们可以在Java中进行尝试,但是我们希望尽可能快地编写它)。 我们不知道哪个会更快,所以我们决定对其进行测试。
我们的第一次尝试是通过MongoDB Shell运行一个JavaScript文件。
看起来是这样的:
var minDate = new Date(2012, 0, 1, 0, 0, 0, 0);
var maxDate = new Date(2013, 0, 1, 0, 0, 0, 0);
var delta = maxDate.getTime() - minDate.getTime();var job_id = arg2;var documentNumber = arg1;
var batchNumber = 5 * 1000;var job_name = 'Job#' + job_id
var start = new Date();var batchDocuments = new Array();
var index = 0;while(index < documentNumber) {var date = new Date(minDate.getTime() + Math.random() * delta);var value = Math.random();var document = { created_on : date,value : value};batchDocuments[index % batchNumber] = document;if((index + 1) % batchNumber == 0) {db.randomData.insert(batchDocuments);}index++;if(index % 100000 == 0) { print(job_name + ' inserted ' + index + ' documents.');}
}
print(job_name + ' inserted ' + documentNumber + ' in ' + (new Date() - start)/1000.0 + 's');
这是我们运行它的方式以及所获得的:
mongo random --eval "var arg1=50000000;arg2=1" create_random.js
Job#1 inserted 100000 documents.
Job#1 inserted 200000 documents.
Job#1 inserted 300000 documents.
...
Job#1 inserted 49900000 documents.
Job#1 inserted 50000000 in 566.294s
好吧,这已经超出了我的期望(每秒88293次插入)。
现在轮到Python了。 您将需要安装pymongo才能正确运行它。
import sys
import os
import pymongo
import time
import randomfrom datetime import datetimemin_date = datetime(2012, 1, 1)
max_date = datetime(2013, 1, 1)
delta = (max_date - min_date).total_seconds()job_id = '1'if len(sys.argv) < 2:sys.exit("You must supply the item_number argument")
elif len(sys.argv) > 2:job_id = sys.argv[2] documents_number = int(sys.argv[1])
batch_number = 5 * 1000;job_name = 'Job#' + job_id
start = datetime.now();# obtain a mongo connection
connection = pymongo.Connection("mongodb://localhost", safe=True)# obtain a handle to the random database
db = connection.random
collection = db.randomDatabatch_documents = [i for i in range(batch_number)];for index in range(documents_number):try: date = datetime.fromtimestamp(time.mktime(min_date.timetuple()) + int(round(random.random() * delta)))value = random.random()document = {'created_on' : date, 'value' : value, }batch_documents[index % batch_number] = documentif (index + 1) % batch_number == 0:collection.insert(batch_documents) index += 1;if index % 100000 == 0: print job_name, ' inserted ', index, ' documents.' except:print 'Unexpected error:', sys.exc_info()[0], ', for index ', indexraise
print job_name, ' inserted ', documents_number, ' in ', (datetime.now() - start).total_seconds(), 's'
我们运行它,这是我们这次得到的:
python create_random.py 50000000
Job#1 inserted 100000 documents.
Job#1 inserted 200000 documents.
Job#1 inserted 300000 documents.
...
Job#1 inserted 49900000 documents.
Job#1 inserted 50000000 in 1713.501 s
与JavaScript版本(每秒插入29180次)相比,它要慢一些,但不要气lets。 Python是一种功能齐全的编程语言,因此如何利用我们所有的CPU内核(例如4个内核)并为每个内核启动一个脚本,每个脚本插入总文档数的一小部分(例如12500000)。
import sys
import pymongo
import time
import subprocess
import multiprocessingfrom datetime import datetimecpu_count = multiprocessing.cpu_count()# obtain a mongo connection
connection = pymongo.Connection('mongodb://localhost', safe=True)# obtain a handle to the random database
db = connection.random
collection = db.randomDatatotal_documents_count = 50 * 1000 * 1000;
inserted_documents_count = 0
sleep_seconds = 1
sleep_count = 0for i in range(cpu_count):documents_number = str(total_documents_count/cpu_count)print documents_numbersubprocess.Popen(['python', '../create_random.py', documents_number, str(i)])start = datetime.now();while (inserted_documents_count < total_documents_count) is True:inserted_documents_count = collection.count()if (sleep_count > 0 and sleep_count % 60 == 0): print 'Inserted ', inserted_documents_count, ' documents.' if (inserted_documents_count < total_documents_count):sleep_count += 1time.sleep(sleep_seconds) print 'Inserting ', total_documents_count, ' took ', (datetime.now() - start).total_seconds(), 's'
运行并行执行Python脚本是这样的:
python create_random_parallel.py
Job#3 inserted 100000 documents.
Job#2 inserted 100000 documents.
Job#0 inserted 100000 documents.
Job#1 inserted 100000 documents.
Job#3 inserted 200000 documents.
...
Job#2 inserted 12500000 in 571.819 s
Job#0 inserted 12400000 documents.
Job#3 inserted 10800000 documents.
Job#1 inserted 12400000 documents.
Job#0 inserted 12500000 documents.
Job#0 inserted 12500000 in 577.061 s
Job#3 inserted 10900000 documents.
Job#1 inserted 12500000 documents.
Job#1 inserted 12500000 in 578.427 s
Job#3 inserted 11000000 documents.
...
Job#3 inserted 12500000 in 623.999 s
Inserting 50000000 took 624.655 s
即使仍然比第一次JavaScript导入还要慢,这确实非常好(80044次插入/秒)。 因此,让我们修改最后一个Python脚本,以通过多个MongoDB Shell运行JavaScript。
由于我无法提供所需的参数给mongo命令,也无法提供给主要python脚本启动的子进程,因此我提出了以下替代方案:
for i in range(cpu_count):documents_number = str(total_documents_count/cpu_count)script_name = 'create_random_' + str(i + 1) + '.bat'script_file = open(script_name, 'w')script_file.write('mongo random --eval "var arg1=' + documents_number +';arg2=' + str(i + 1) +'" ../create_random.js');script_file.close()subprocess.Popen(script_name)
我们动态生成shell脚本,然后让python为我们运行它们。
Job#1 inserted 100000 documents.
Job#4 inserted 100000 documents.
Job#3 inserted 100000 documents.
Job#2 inserted 100000 documents.
Job#1 inserted 200000 documents.
...
Job#4 inserted 12500000 in 566.438s
Job#3 inserted 12300000 documents.
Job#2 inserted 10800000 documents.
Job#1 inserted 11600000 documents.
Job#3 inserted 12400000 documents.
Job#1 inserted 11700000 documents.
Job#2 inserted 10900000 documents.
Job#1 inserted 11800000 documents.
Job#3 inserted 12500000 documents.
Job#3 inserted 12500000 in 574.782s
Job#2 inserted 11000000 documents.
Job#1 inserted 11900000 documents.
Job#2 inserted 11100000 documents.
Job#1 inserted 12000000 documents.
Job#2 inserted 11200000 documents.
Job#1 inserted 12100000 documents.
Job#2 inserted 11300000 documents.
Job#1 inserted 12200000 documents.
Job#2 inserted 11400000 documents.
Job#1 inserted 12300000 documents.
Job#2 inserted 11500000 documents.
Job#1 inserted 12400000 documents.
Job#2 inserted 11600000 documents.
Job#1 inserted 12500000 documents.
Job#1 inserted 12500000 in 591.073s
Job#2 inserted 11700000 documents.
...
Job#2 inserted 12500000 in 599.005s
Inserting 50000000 took 599.253 s
这也很快(83437次插入/秒),但仍然无法击败我们的第一次尝试。
结论
我的PC配置与众不同,唯一的优化是我有运行MongoDB的SSD驱动器。
第一次尝试产生了最佳结果,并且监视CPU资源后,我意识到MongoDB甚至可以在单个Shell控制台中利用所有这些资源。 在所有内核上运行的Python脚本也足够快,并且如果需要,它具有的优势是允许我们将该脚本转换为可完全运行的应用程序。
- 代码可在GitHub上获得 。
翻译自: https://www.javacodegeeks.com/2013/12/mongodb-facts-80000-insertssecond-on-commodity-hardware.html