« Back to all articles

Crawling UTF-8 pages using the Symfony2 DomCrawler component

04 April 2013

Just a small gotcha for anyone using Symfony2’s DomCrawler component. The standard behaviour of the class (from the current docs) is:

    $crawler = new Crawler($html);

    foreach ($crawler as $domElement) {
        print $domElement->nodeName;
    }

However, this will assume the document is ISO-8859-1. If you want to crawl a UTF-8 page correctly do it like so:

    $crawler = new Crawler;
    $crawler->addHTMLContent(file_get_contents('http://www.columbia.edu/~fdc/utf8/'), 'UTF-8');

    foreach ($crawler as $domElement) {
        print $domElement->nodeName;
    }

The second parameter to addHTMLContent is UTF-8 by default, but I’ve added it to illustrate that you could use other character sets too.