我正在使用
PHP和libtidy来尝试筛选可能是历史上最糟糕和最不正确的HTML表格使用情况.该站点关闭了几个table,tr,td,font或bold标签,并且一致地嵌套了表中的许多不同的表层.
示例代码段:
Home Team - Wildcats | Away Team - Polar Bears |
Rosters
1
Baird, T
2
Knight, P
8
Miller, B
9
Huebsch, B
11
Buschmann, C
12
Reding, J
14
Simpson, S
27
Kupferschmidt, M
28
Anderson, D
31
Gehrts, J
39
McGinnis, G
42
Temple, B
44
Kemplin, A
77
Weiner, B
95
Zytkoskie, D
5
Mack, A
8
Foucault, R
11
Oberpriller, D *
12
Underwood, J
15
Oberpriller, M
19
Langfus, B
25
Carroll, R
30
Hirdler, T
33
Gibson, S
35
Marthaler, C
44
Yurik, J
58
Gronemeyer, S
Goals
Player
Period
Time
Assist 1
Assist 2
SH
PP
Kupferschmidt, M
1
12:51
Kemplin, A
None
McGinnis, G
1
12:33
Huebsch, B
None
Kupferschmidt, M
2
16:01
None
None
Buschmann, C
3
00:38
None
None
Player
Period
Time
Assist 1
Assist 2
SH
PP
Oberpriller, D *
3
12:31
Gronemeyer, S
None
Penalties
Player
Period
Minutes
Offense
Start
Expired
Buschmann, C
3
2
Interference
11:11
09:11
Buschmann, C
3
2
Unsportmanlike Conduct
03:26
01:26
Bench
3
2
Too Many Men
01:46
00:00
Player
Period
Minutes
Offense
Start
Expired
Marthaler, C
1
2
Interference
01:19
16:19
Underwood, J
2
2
Interference
12:32
10:32
Marthaler, C
3
2
Interference
11:39
09:39
Goalies
Name
Shots
Goals
Baird, T
20
1
Open Net
0
Name
Shots
Goals
Hirdler, T
42
奇怪的是,所有浏览器似乎都很好.
PHPTidy设法很好地理解了这一切,但是这些表是如此深入且几乎随机地嵌套,使用DOM XPath很难遍历它.
有没有人对其他方法有任何建议?
POST-MORTEM:经过太多的比利时小麦啤酒和弄脏了我的代码真正的好我通过strip_tags()删除所有标签除了table,tr和td,然后通过libtidy运行它得到了很好的结果.它现在格式很漂亮,很容易遍历.看起来它只是需要一点点按摩才能将它发送到解析器.