lxml 笔记

蜘蛛之前用的pyquery做采集,发现对非正常的网页并不适用,正则表达式更是不能用,目标网站网页乱的一团乱麻,打算重新用lxml的迭代解析功能来做吧。

使用目标解析器方法
目标解析器方法对于熟悉 SAX 事件驱动代码的开发人员来说应该不陌生。目标解析器是可以实现以下方法的类:
start 在元素打开时触发。数据和元素的子元素仍不可用。
end 在元素关闭时触发。所有元素的子节点,包括文本节点,现在都是可用的。
data 触发文本子节点并访问该文本。
close 在解析完成后触发。
清单 2 演示了如何创建实现所需方法的目标解析器类(这里称为 TitleTarget)。这个解析器在一个内部列表(self.text)中收集 Title 元素的文本节点,并在到达 close() 方法后返回列表。
清单 2. 一个目标解析器,它返回 Title 标记的所有文本子节点的列表
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True if tag == ‘Title’ else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data.encode(‘utf-8’))
def close(self):
return self.text

parser = etree.XMLParser(target = TitleTarget(),recover=True)

# This and most other samples read in the Google copyright data
infile = ‘copyright.xml’

results = etree.parse(infile, parser)

# When iterated over, ‘results’ will contain the output from
# target parser’s close() method

out = open(‘titles.txt’, ‘w’)
out.write(‘\n’.join(results))
out.close()
在运行版权数据时,代码运行时间为 54 秒。目标解析可以实现合理的速度并且不会生成消耗内存的解析树,但是在数据中为所有元素触发事件。对于特别大型的文档,如果只对其中一些元素感兴趣,那么这种方法并不理想,就像在这个例子中一样。能否将处理限制到选择的标记并获得较好的性能呢?

In [7]: f = StringIO.StringIO(r”””
…:
…: Mark
…: http://diveintomark.org/
…:

…:Dive into history, 2009 edition
…:
…: tag:diveintomark.org,2009-03-27:/archives/20090327172042
…: 2009-03-27T21:56:07Z
…: 2009-03-27T17:20:42Z …:
…:
…:
…:

Putting an entire chapter on one page sounds ⑦
…: bloated, but consider this — my longest chapter so far
…: would be 75 printed pages, and it loads in under 5 seconds…
…: On dialup.

…: ⑧
…: “””)

http://lxml.de/parsing.html#parsers

The target parser interface

As in ElementTree, and similar to a SAX event handler, you can pass a target object to the parser:
>>> class EchoTarget(object):
… def start(self, tag, attrib):
… print(“start %s %r” % (tag, dict(attrib)))
… def end(self, tag):
… print(“end %s” % tag)
… def data(self, data):
… print(“data %r” % data)
… def comment(self, text):
… print(“comment %s” % text)
… def close(self):
… print(“close”)
… return “closed!”

>>> parser = etree.XMLParser(target = EchoTarget())

>>> result = etree.XML(“sometext“,
… parser)
start element {}
data u’some’
comment comment
data u’text’
end element
close

>>> print(result)
closed!
It is important for the .close() method to reset the parser target to a usable state, so that you can reuse the parser as often as you like:
>>> result = etree.XML(“sometext“,
… parser)
start element {}
data u’some’
comment comment
data u’text’
end element
close

>>> print(result)
closed!
Starting with lxml 2.3, the .close() method will also be called in the error case. This diverges(分歧) from the behaviour of ElementTree, but allows target objects to clean up their state in all situations, so that the parser can reuse them afterwards.
>>> class CollectorTarget(object):
… def __init__(self):
… self.events = []
… def start(self, tag, attrib):
… self.events.append(“start %s %r” % (tag, dict(attrib)))
… def end(self, tag):
… self.events.append(“end %s” % tag)
… def data(self, data):
… self.events.append(“data %r” % data)
… def comment(self, text):
… self.events.append(“comment %s” % text)
… def close(self):
… self.events.append(“close”)
… return “closed!”

>>> parser = etree.XMLParser(target = CollectorTarget())

>>> result = etree.XML(“some“,
… parser) # doctest: +ELLIPSIS
Traceback (most recent call last):

lxml.etree.XMLSyntaxError: Opening and ending tag mismatch…

>>> for event in parser.target.events:
… print(event)
start element {}
data u’some’
close
Note that the parser does not build a tree when using a parser target. The result of the parser run is whatever the target object returns from its .close() method. If you want to return an XML tree here, you have to create it programmatically in the target object. An example for a parser target that builds a tree is the TreeBuilder:
>>> parser = etree.XMLParser(target = etree.TreeBuilder())

>>> result = etree.XML(“sometext“,
… parser)

>>> print(result.tag)
element
>>> print(result[0].text)
comment

http://lxml.de/tutorial.html
http://www.ibm.com/developerworks/cn/xml/x-hiperfparse/index.html
http://sebug.net/paper/books/dive-into-python3/xml.html
http://alvinli1991.github.io/python/2013/11/12/python-xml-parser—elementtree/
http://pycoders-weekly-chinese.readthedocs.org/en/latest/issue6/processing-xml-in-python-with-element-tree.html

发表评论

电子邮件地址不会被公开。