蜘蛛之前用的pyquery做采集,发现对非正常的网页并不适用,正则表达式更是不能用,目标网站网页乱的一团乱麻,打算重新用lxml的迭代解析功能来做吧。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win32 Type "copyright", "credits" or "license()" for more information. >>> import lxml Traceback (most recent call last): File "<pyshell#0>", line 1, in <module> import lxml ImportError: No module named lxml >>> import lxml >>> class EchoTarget(object): def start(self, tag, attrib): print("start %s %r" % (tag, dict(attrib))) print attrib def end(self, tag): print("end %s" % tag) def data(self, data): print("data %r" % data) def comment(self, text): print("comment %s" % text) def close(self): print("close") return "closed!" >>> from lxml import etree >>> from io import StringIO, BytesIO >>> parser = etree.XMLParser(target = EchoTarget()) >>> result = etree.XML("<element a='aaa' b='bbb' c=333 >some<!--comment-->text</element>",parser) close Traceback (most recent call last): File "<pyshell#18>", line 1, in <module> result = etree.XML("<element a='aaa' b='bbb' c=333 >some<!--comment-->text</element>",parser) File "lxml.etree.pyx", line 2723, in lxml.etree.XML (src/lxml/lxml.etree.c:52448) File "parser.pxi", line 1573, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79932) File "parser.pxi", line 1452, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78774) File "parser.pxi", line 960, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75389) File "parsertarget.pxi", line 149, in lxml.etree._TargetParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:86190) File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) XMLSyntaxError: AttValue: " or ' expected, line 1, column 15 >>> result = etree.XML("<element a='aaa' b='bbb' >some<!--comment-->text</element>",parser) start element {'a': u'aaa', 'b': u'bbb'} {'a': u'aaa', 'b': u'bbb'} data u'some' comment comment data u'text' end element close >>> result 'closed!' >>> class EchoTarget(object): def start(self, tag, attrib): print attrib print("start %s %r" % (tag, dict(attrib))) def end(self, tag): print("end %s" % tag) def data(self, data): print("data %r" % data) def comment(self, text): print("comment %s" % text) def close(self): print("close") return "closed!" >>> result = etree.XML("<element a='aaa' b='bbb' >some<!--comment-->text</element>",parser) start element {'a': u'aaa', 'b': u'bbb'} {'a': u'aaa', 'b': u'bbb'} data u'some' comment comment data u'text' end element close >>> class EchoTarget(object): def start(self, tag, attrib): print attrib print("start %s %r" % (tag, dict(attrib))) def end(self, tag): print("end %s" % tag) def data(self, data): print("data %r" % data) def comment(self, text): print("comment %s" % text) def close(self): print("close") return "closed!" >>> parser = etree.XMLParser(target = EchoTarget()) >>> >>> result = etree.XML("<element a='aaa' b='bbb' >some<!--comment-->text</element>",parser) {'a': u'aaa', 'b': u'bbb'} start element {'a': u'aaa', 'b': u'bbb'} data u'some' comment comment data u'text' end element close >>> |
使用目标解析器方法
目标解析器方法对于熟悉 SAX 事件驱动代码的开发人员来说应该不陌生。目标解析器是可以实现以下方法的类:
start 在元素打开时触发。数据和元素的子元素仍不可用。
end 在元素关闭时触发。所有元素的子节点,包括文本节点,现在都是可用的。
data 触发文本子节点并访问该文本。
close 在解析完成后触发。
清单 2 演示了如何创建实现所需方法的目标解析器类(这里称为 TitleTarget)。这个解析器在一个内部列表(self.text)中收集 Title 元素的文本节点,并在到达 close() 方法后返回列表。
清单 2. 一个目标解析器,它返回 Title 标记的所有文本子节点的列表
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True if tag == ‘Title’ else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data.encode(‘utf-8’))
def close(self):
return self.textparser = etree.XMLParser(target = TitleTarget(),recover=True)
# This and most other samples read in the Google copyright data
infile = ‘copyright.xml’results = etree.parse(infile, parser)
# When iterated over, ‘results’ will contain the output from
# target parser’s close() methodout = open(‘titles.txt’, ‘w’)
out.write(‘\n’.join(results))
out.close()
在运行版权数据时,代码运行时间为 54 秒。目标解析可以实现合理的速度并且不会生成消耗内存的解析树,但是在数据中为所有元素触发事件。对于特别大型的文档,如果只对其中一些元素感兴趣,那么这种方法并不理想,就像在这个例子中一样。能否将处理限制到选择的标记并获得较好的性能呢?
In [7]: f = StringIO.StringIO(r”””
…:
…:
…:
…:
…:
…:
…:
…:
…:
…:
…:
…:
…: bloated, but consider this — my longest chapter so far
…: would be 75 printed pages, and it loads in under 5 seconds…
…: On dialup.
…: ⑧
…: “””)
http://lxml.de/parsing.html#parsers
The target parser interface
As in ElementTree, and similar to a SAX event handler, you can pass a target object to the parser:
>>> class EchoTarget(object):
… def start(self, tag, attrib):
… print(“start %s %r” % (tag, dict(attrib)))
… def end(self, tag):
… print(“end %s” % tag)
… def data(self, data):
… print(“data %r” % data)
… def comment(self, text):
… print(“comment %s” % text)
… def close(self):
… print(“close”)
… return “closed!”>>> parser = etree.XMLParser(target = EchoTarget())
>>> result = etree.XML(“
sometext “,
… parser)
start element {}
data u’some’
comment comment
data u’text’
end element
close>>> print(result)
closed!
It is important for the .close() method to reset the parser target to a usable state, so that you can reuse the parser as often as you like:
>>> result = etree.XML(“sometext “,
… parser)
start element {}
data u’some’
comment comment
data u’text’
end element
close>>> print(result)
closed!
Starting with lxml 2.3, the .close() method will also be called in the error case. This diverges(分歧) from the behaviour of ElementTree, but allows target objects to clean up their state in all situations, so that the parser can reuse them afterwards.
>>> class CollectorTarget(object):
… def __init__(self):
… self.events = []
… def start(self, tag, attrib):
… self.events.append(“start %s %r” % (tag, dict(attrib)))
… def end(self, tag):
… self.events.append(“end %s” % tag)
… def data(self, data):
… self.events.append(“data %r” % data)
… def comment(self, text):
… self.events.append(“comment %s” % text)
… def close(self):
… self.events.append(“close”)
… return “closed!”>>> parser = etree.XMLParser(target = CollectorTarget())
>>> result = etree.XML(“
some“,
… parser) # doctest: +ELLIPSIS
Traceback (most recent call last):
…
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch…>>> for event in parser.target.events:
… print(event)
start element {}
data u’some’
close
Note that the parser does not build a tree when using a parser target. The result of the parser run is whatever the target object returns from its .close() method. If you want to return an XML tree here, you have to create it programmatically in the target object. An example for a parser target that builds a tree is the TreeBuilder:
>>> parser = etree.XMLParser(target = etree.TreeBuilder())>>> result = etree.XML(“
sometext “,
… parser)>>> print(result.tag)
element
>>> print(result[0].text)
comment
http://lxml.de/tutorial.html
http://www.ibm.com/developerworks/cn/xml/x-hiperfparse/index.html
http://sebug.net/paper/books/dive-into-python3/xml.html
http://alvinli1991.github.io/python/2013/11/12/python-xml-parser—elementtree/
http://pycoders-weekly-chinese.readthedocs.org/en/latest/issue6/processing-xml-in-python-with-element-tree.html