python之htmlParser入门教程分享

　　HTMLParser.HTMLParser()

　　htmlParser模块包含了类HTMLParser ?这个类本身很有用.因为当产生事件时，本身并不做任何工作。对?的利用需要实现其子类，并且编写处理你感兴趣事件的方法

　　HTMLPaser模块定义一个类HTMLParser ，可以用作解析html和xhtml 的基础.和htmllib中的parser不同，这个parser并不是基于sgmllib实现

　　一个简单htmlparser 使用样例

　　输出结果

　　Encountered a start tag: html

　　Encountered a start tag: head

　　Encountered a start tag: title

　　Encountered some data : Test

　　Encountered an end tag : title

　　Encountered an end tag : head

　　Encountered a start tag: body

　　Encountered a start tag: h1

　　Encountered some data : Parse me!

　　Encountered an end tag : h1

　　Encountered an end tag : body

　　Encountered an end tag : html

　　If it is important to keep track of the structural position of the current event within the document, you will need to maintain a data structure with this information. If you are certain that the document you are processing is well-formed XHTML, a stack suffices. For example:

　　如果要记录当前标签在整个html文档中的结构位置，则需要维护一个记录位置信息的数据结构。如果你可以确定要处理html文档是严格遵循xhtml标准的，一个栈结构就足够了。

　　使用栈结构进行html标签匹配的思想，如果不理解可以参考括号匹配内容-----来源《数据结构》

　　运行结果

　　/html/head/title >> Advice

　　/html/body/p >> The

　　/html/body/p/a >> IETF admonishes:

　　/html/body/p/a/i >> Be strict in what you

　　/html/body/p/a/i/b >> send

　　/html/body/p/a/i >> .

　　如果要处理的数据不那么良好，就需要实现一个更复杂的栈，我们可以定义一个新的对象，这个对象可以删除和endtag相对应的最近的一个starttag,同时还可以避免没有被闭合的<p> 和<blockquote>嵌套在其中。你可以为一个应用，顺着这种方式做更多的工作，这里的TagStack是一个很好的例子，可以作为开端

　　对pop方法的一点简单说明，因为刚开始学习python ，这里曾产生困惑:

　　pop操作首先对lst进行反转，然后self.lst.index(tag)，注意，index()方法返回的是第一个匹配查找目标的位置，所以这里可以获得与endtag相匹配的最近的一个starttag的位置

如需转载，请注明文章出处和来源网址：http://www.divcss5.com/html/h60181.shtml