尽管我喜欢Cypher的LOAD CSV命令使它容易地将数据获取到Neo4j中的方法,但它目前打破了最不惊奇的规则,因为它急切地在所有行中加载某些查询,即使是那些使用定期提交的查询。
这是我的同事Michael在第二篇博客文章中指出的,它解释了如何成功使用LOAD CSV :
即使遵循我之前的建议,人们遇到的最大问题是,对于超过一百万行的大量导入,Cypher遇到了内存不足的情况。
这与提交大小无关 ,因此即使使用小批量的PERIODIC COMMIT也会发生。
最近,我花了几天的时间将数据导入具有4GB RAM的Windows机器上的Neo4j中,所以我发现这个问题的时间甚至早于Michael的建议。
Michael解释了如何确定您的查询是否遭受意外的急切评估:
如果分析该查询,则会看到查询计划中有一个“急切”步骤。
那就是“拉入所有数据”的地方。
您可以通过在单词“ PROFILE”前面加上前缀来配置查询。 您需要在Web浏览器的/ webadmin控制台中或使用Neo4j shell运行查询。
我为查询执行了此操作,并且能够识别得到快速评估的查询模式,在某些情况下,我们可以解决该问题。
我们将使用Northwind数据集来演示Eager管道如何潜入我们的查询,但是请记住,该数据集足够小,不会引起问题。
文件中的行如下所示:
$ head -n 2 data/customerDb.csv
OrderID,CustomerID,EmployeeID,OrderDate,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipCity,ShipRegion,ShipPostalCode,ShipCountry,CustomerID,CustomerCompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,EmployeeID,LastName,FirstName,Title,TitleOfCourtesy,BirthDate,HireDate,Address,City,Region,PostalCode,Country,HomePhone,Extension,Photo,Notes,ReportsTo,PhotoPath,OrderID,ProductID,UnitPrice,Quantity,Discount,ProductID,ProductName,SupplierID,CategoryID,QuantityPerUnit,UnitPrice,UnitsInStock,UnitsOnOrder,ReorderLevel,Discontinued,SupplierID,SupplierCompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,Phone,Fax,HomePage,CategoryID,CategoryName,Description,Picture
10248,VINET,5,1996-07-04,1996-08-01,1996-07-16,3,32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France,VINET,Vins et alcools Chevalier,Paul Henriot,Accounting Manager,59 rue de l'Abbaye,Reims,,51100,France,26.47.15.10,26.47.15.11,5,Buchanan,Steven,Sales Manager,Mr.,1955-03-04,1993-10-17,14 Garrett Hill,London,,SW1 8JR,UK,(71) 555-4848,3453,\x,"Steven Buchanan graduated from St. Andrews University, Scotland, with a BSC degree in 1976. Upon joining the company as a sales representative in 1992, he spent 6 months in an orientation program at the Seattle office and then returned to his permanent post in London. He was promoted to sales manager in March 1993. Mr. Buchanan has completed the courses ""Successful Telemarketing"" and ""International Sales Management."" He is fluent in French.",2,http://accweb/emmployees/buchanan.bmp,10248,11,14,12,0,11,Queso Cabrales,5,4,1 kg pkg.,21,22,30,30,0,5,Cooperativa de Quesos 'Las Cabras',Antonio del Valle Saavedra,Export Administrator,Calle del Rosal 4,Oviedo,Asturias,33007,Spain,(98) 598 76 54,,,4,Dairy Products,Cheeses,\x
合并,合并,合并
我们要做的第一件事是为每个员工和每个订单创建一个节点,然后在它们之间创建一个关系。
我们可以从以下查询开始:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
这样就可以了,但是如果我们像这样对查询进行概要分析……
PROFILE LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
WITH row LIMIT 0
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
…我们会在第三行看到“渴望”:
==> +----------------+------+--------+----------------------------------+-----------------------------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +----------------+------+--------+----------------------------------+-----------------------------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph(0) | 0 | 0 | employee, order, UNNAMED216 | MergePattern |
==> | Eager | 0 | 0 | | |
==> | UpdateGraph(1) | 0 | 0 | employee, employee, order, order | MergeNode; :Employee; MergeNode; :Order |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +----------------+------+--------+----------------------------------+-----------------------------------------+
您会注意到,当我们分析每个查询时,我们将删除定期提交部分,并添加“ WITH row LIMIT 0”。 这使我们能够生成足够的查询计划来标识“急切”运算符,而无需实际导入任何数据。
我们希望将该查询分为两个查询,以便可以不急于处理它:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
WITH row LIMIT 0
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
==> +-------------+------+--------+----------------------------------+-----------------------------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +-------------+------+--------+----------------------------------+-----------------------------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph | 0 | 0 | employee, employee, order, order | MergeNode; :Employee; MergeNode; :Order |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +-------------+------+--------+----------------------------------+-----------------------------------------+
现在我们已经创建了员工和订单,我们可以将他们加入在一起:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph | 0 | 0 | employee, order, UNNAMED216 | MergePattern |
==> | Filter(0) | 0 | 0 | | Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(0) | 0 | 0 | order, order | :Order |
==> | Filter(1) | 0 | 0 | | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(1) | 0 | 0 | employee, employee | :Employee |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
眼中没有渴望!
比赛,比赛,比赛,合并,合并
如果我们快进几步,我们现在可能已经将导入脚本重构到了在一个查询中创建节点并在另一个查询中创建关系的地步。
我们的create查询按预期工作:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (employee:Employee {employeeId: row.EmployeeID})
MERGE (order:Order {orderId: row.OrderID})
MERGE (product:Product {productId: row.ProductID})
==> +-------------+------+--------+----------------------------------------------------+--------------------------------------------------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +-------------+------+--------+----------------------------------------------------+--------------------------------------------------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph | 0 | 0 | employee, employee, order, order, product, product | MergeNode; :Employee; MergeNode; :Order; MergeNode; :Product |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +-------------+------+--------+----------------------------------------------------+------------------------------------------------------------
现在,我们在图表中有了员工,产品和订单。 现在,让我们创建三者之间的关系:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MATCH (product:Product {productId: row.ProductID})
MERGE (employee)-[:SOLD]->(order)
MERGE (order)-[:PRODUCT]->(product)
如果我们描述一下,我们会发现Eager再次潜入了!
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph(0) | 0 | 0 | order, product, UNNAMED318 | MergePattern |
==> | Eager | 0 | 0 | | |
==> | UpdateGraph(1) | 0 | 0 | employee, order, UNNAMED287 | MergePattern |
==> | Filter(0) | 0 | 0 | | Property(product,productId) == Property(row,ProductID) |
==> | NodeByLabel(0) | 0 | 0 | product, product | :Product |
==> | Filter(1) | 0 | 0 | | Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(1) | 0 | 0 | order, order | :Order |
==> | Filter(2) | 0 | 0 | | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(2) | 0 | 0 | employee, employee | :Employee |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
在这种情况下,“急切”发生在我们第二次致电MERGE时,正如Michael在他的帖子中指出的那样:
问题是,在单个Cypher语句中,您必须隔离会进一步影响匹配的更改,例如,当您创建带有标签的节点时,该标签突然被以后的MATCH或MERGE操作所匹配。
在这种情况下,我们可以通过使用单独的查询来创建关系来解决该问题:
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (employee:Employee {employeeId: row.EmployeeID})
MATCH (order:Order {orderId: row.OrderID})
MERGE (employee)-[:SOLD]->(order)
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph | 0 | 0 | employee, order, UNNAMED236 | MergePattern |
==> | Filter(0) | 0 | 0 | | Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(0) | 0 | 0 | order, order | :Order |
==> | Filter(1) | 0 | 0 | | Property(employee,employeeId) == Property(row,EmployeeID) |
==> | NodeByLabel(1) | 0 | 0 | employee, employee | :Employee |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +----------------+------+--------+-------------------------------+-----------------------------------------------------------+
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MATCH (order:Order {orderId: row.OrderID})
MATCH (product:Product {productId: row.ProductID})
MERGE (order)-[:PRODUCT]->(product)
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph | 0 | 0 | order, product, UNNAMED229 | MergePattern |
==> | Filter(0) | 0 | 0 | | Property(product,productId) == Property(row,ProductID) |
==> | NodeByLabel(0) | 0 | 0 | product, product | :Product |
==> | Filter(1) | 0 | 0 | | Property(order,orderId) == Property(row,OrderID) |
==> | NodeByLabel(1) | 0 | 0 | order, order | :Order |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +----------------+------+--------+------------------------------+--------------------------------------------------------+
合并,设置
我尝试使LOAD CSV脚本尽可能地幂等,这样,如果我们将更多行或更多列的数据添加到CSV中,我们可以重新运行查询而不必重新创建所有内容。
这可以引导您进入以下创建供应商的模式:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (supplier:Supplier {supplierId: row.SupplierID})
SET supplier.companyName = row.SupplierCompanyName
我们要确保只有一个具有该SupplierID的Supplier,但是我们可能会逐步添加新属性,并决定仅使用'SET'命令替换所有内容。 如果我们分析该查询,则“渴望”会潜伏:
==> +----------------+------+--------+--------------------+----------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +----------------+------+--------+--------------------+----------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph(0) | 0 | 0 | | PropertySet |
==> | Eager | 0 | 0 | | |
==> | UpdateGraph(1) | 0 | 0 | supplier, supplier | MergeNode; :Supplier |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +----------------+------+--------+--------------------+----------------------+
我们可以使用“ ON CREATE SET”和“ ON MATCH SET”以一些重复的代价来解决此问题:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-northwind/data/customerDb.csv" AS row
MERGE (supplier:Supplier {supplierId: row.SupplierID})
ON CREATE SET supplier.companyName = row.SupplierCompanyName
ON MATCH SET supplier.companyName = row.SupplierCompanyName
==> +-------------+------+--------+--------------------+----------------------+
==> | Operator | Rows | DbHits | Identifiers | Other |
==> +-------------+------+--------+--------------------+----------------------+
==> | EmptyResult | 0 | 0 | | |
==> | UpdateGraph | 0 | 0 | supplier, supplier | MergeNode; :Supplier |
==> | Slice | 0 | 0 | | { AUTOINT0} |
==> | LoadCSV | 1 | 0 | row | |
==> +-------------+------+--------+--------------------+----------------------+
使用我一直在使用的数据集,在某些情况下可以避免OutOfMemory异常,而在其他情况下,可以将运行查询所花费的时间减少3倍。
随着时间的流逝,我希望所有这些情况都将得到解决,但是从Neo4j 2.1.5开始,这些是我已经确定过急的模式。
如果您知道其他任何人,请告诉我,我可以将其添加到帖子中或撰写第二部分。
翻译自: https://www.javacodegeeks.com/2014/10/neo4j-cypher-avoiding-the-eager.html