Scrape the Web

From Driscollwiki
Revision as of 11:44, 19 June 2010 by Driscoll (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Notes

  • Use Firebug to traverse XHTML tree
  • "Scraping ... is reconstructing someone else's thought processes"

Parsers

  • BeautifulSoup
    • Relies on a deprecated parser
    • Not as good under Py3k
    • Not maintained
  • html5lib
    • Can return BeauSoup objects
  • lxml parse
    • XPath
      • Query language for tracing through a DOM tree
      • Some problems if the tree changes
    • CSS Select
      • Useful to search for CSS ids/classes
  • PyQuery
    • Traverse the tree in a similar fashion to JQuery

Simplest scraper

import urllib2 fd = urllib2.urlopen(url) print 'eggplant' in fd.read()

Anti bot measures

  • &s=
    • Might be a token to verify that you aren't a bot
  • Also the REFERRER header
Personal tools