←codingtales home

Scraping HTML with HandsomeSoup in Haskell

The Haskell XML Toolbox is a wonderful tool for parsing XML and by extension HTML. But HTML has common attributes like id, class that make CSS selectors very suitable for scraping. HandsomeSoup by Aditya Bhargava atop HXT.

This is not a introductory post to HandsomeSoup. I guess, it will be done by Aditya himself. I would like to write a web scraper using it. Wallbase is a very beautiful website for wallpapers. I love good wallpapers and I regularly scrape the toplist and download them. We'll write that scraper in Haskell today.

Now, a webscraper that downloads images was also written by Aditya. It uses bare HXT. Ours will look more or less like a modification of it.

The fromUrl function gives us an IOSArrow wrapped in an IO. We pass it through the (>>>) function to the css function that takes CSS selectors. The result is another IOSArrow. We can go through this chain until we reach our element (although even one css is almost always sufficient). The (!) function gives us attributes of the matched elements.

main = do
  doc <- fromUrl "http://wallbase.cc/toplist"
  links <- runX $ doc >>> css ".thdraggable.thlink" ! "href"
  putStrLn $ show links

Wallbase has all the thumbnails that get us to the individual pages of those wallpapers. The class = "thdraggable thlink" class on the a tag gives us the tag and the href attribute gives us the link of the page holding the wallpaper.

We take this URI and extract an image which is wrapped in a <div id="bigwall">. css "#bigwall > img" gets us to the image and the src attribute gives us the actual image location. The downloadImage function takes the URL and saves it. The part of the file after the last '/' gives us the name of the file.

import Text.HandsomeSoup
import Text.XML.HXT.Core
import qualified Data.ByteString.Char8 as B
import Control.Monad.Maybe

explode :: Eq a => a -> [a] -> [[a]]
explode _ [] = []
explode x (x':xs) | x == x' = explode x xs
explode x xs = takeWhile (/=x) xs : explode x (dropWhile (/=x) xs)

downloadImage :: String -> IO ()
downloadImage url = do
  putStrLn $ "Downloading " ++ url
  content <- runMaybeT $ openUrl url
  case content of
    Nothing -> putStrLn $ "Error: " ++ url
    Just content' -> do
      let name = last $ explode '/' url
      B.writeFile name (B.pack content')

extractImage :: String -> IO String
extractImage url = do
  putStrLn $ "Extracting from " ++ url
  doc <- fromUrl url
  link <- runX $ doc >>> css "#bigwall > img" ! "src"
  return $ head link

main = do
  doc <- fromUrl "http://wallbase.cc/toplist"
  links <- runX $ doc >>> css ".thdraggable.thlink" ! "href"
  mapM_ (\l -> extractImage l >>= downloadImage) links

Replace mapM_ with parallel_ from the parallel-io package and suddenly the images are download in parallel. HandsomeSoup gives us a very easy way to scrape the web using Haskell.