Python爬虫 BeautifulSoup 常用操作

发表于 2017-01-31 Disqus：

一切从获取一个 soup 对象开始

try:
    req = urllib2.Request(url_page)
    url_opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    source_code = url_opener.open(req, timeout=10).read()
    plain_text=unicode(source_code)
    soup = BeautifulSoup(plain_text)
except Exception,e:
    # handle exceptions here
    print e
    return

获取类型为 div 的 DOM

1	dom = soup.find('div')

获取包含某个属性的 DOM

1 2	dom = soup.find('div', {'class':'info-panel'}) dom = soup.find('a', {'name':'selectDetail'})

获取所有符合条件的 DOM

1
2
3

dom_list = soup.findAll('div',{'class':'info-panel'})
for dom in dom_list:
    # do something to dom

获取显示的文字

1	dom.text.strip()

获取 DOM 的一个属性指

1	dom.attrs['title']