The Haskell XML Toolbox is a wonderful tool for parsing XML and by extension HTML. But HTML has common attributes like
id, class that make CSS selectors very suitable for scraping. HandsomeSoup by Aditya Bhargava atop HXT.
This is not a introductory post to HandsomeSoup. I guess, it will be done by Aditya himself. I would like to write a web scraper using it. Wallbase is a very beautiful website for wallpapers. I love good wallpapers and I regularly scrape the toplist and download them. We'll write that scraper in Haskell today.
Now, a webscraper that downloads images was also written by Aditya. It uses bare HXT. Ours will look more or less like a modification of it.
fromUrl function gives us an
IOSArrow wrapped in an
IO. We pass it through the
(>>>) function to the
css function that takes CSS selectors. The result is another
IOSArrow. We can go through this chain until we reach our element (although even one
css is almost always sufficient). The
(!) function gives us attributes of the matched elements.
main = do doc <- fromUrl "http://wallbase.cc/toplist" links <- runX $ doc >>> css ".thdraggable.thlink" ! "href" putStrLn $ show links
Wallbase has all the thumbnails that get us to the individual pages of those wallpapers. The
class = "thdraggable thlink" class on the
a tag gives us the tag and the
href attribute gives us the link of the page holding the wallpaper.
We take this URI and extract an image which is wrapped in a
css "#bigwall > img" gets us to the image and the
src attribute gives us the actual image location. The
downloadImage function takes the URL and saves it. The part of the file after the last '/' gives us the name of the file.
import Text.HandsomeSoup import Text.XML.HXT.Core import qualified Data.ByteString.Char8 as B import Control.Monad.Maybe explode :: Eq a => a -> [a] -> [[a]] explode _  =  explode x (x':xs) | x == x' = explode x xs explode x xs = takeWhile (/=x) xs : explode x (dropWhile (/=x) xs) downloadImage :: String -> IO () downloadImage url = do putStrLn $ "Downloading " ++ url content <- runMaybeT $ openUrl url case content of Nothing -> putStrLn $ "Error: " ++ url Just content' -> do let name = last $ explode '/' url B.writeFile name (B.pack content') extractImage :: String -> IO String extractImage url = do putStrLn $ "Extracting from " ++ url doc <- fromUrl url link <- runX $ doc >>> css "#bigwall > img" ! "src" return $ head link main = do doc <- fromUrl "http://wallbase.cc/toplist" links <- runX $ doc >>> css ".thdraggable.thlink" ! "href" mapM_ (\l -> extractImage l >>= downloadImage) links
parallel_ from the
parallel-io package and suddenly the images are download in parallel. HandsomeSoup gives us a very easy way to scrape the web using Haskell.