You are here

querypath

Dealing with different character sets

I've been working on a project using MySQL, PHP and Querypath, an XML / HTML document parser where I need to store content retrieved from webpages into MySQL tables. I'm using the standard cURL to retrieve the pages, Querypath to parse them and then insert data into various tables in a MySQL database.

Everything worked great when dealing with US and Canadian websites. Then I started trying to do similar stuff on some UK websites and weird, random issues starting popping up that I couldn't really explain. One problem was that I didn't have the character set for my MySQL tables set to UTF-8 so that was an easy fix. But some pages were still parsing correctly and others weren't for no apparent reason.

After smashing my head against the wall for a few hours, it popped into my head that it could be a character set problem even though because I never had this problem with the US websites. I simply changed the options on my Querypath object as follows:

$qp = htmlqp($html, NULL, array('convert_to_encoding' => 'utf-8'));

The weird thing about this is that Querypath should be able to detect the character set because its even set in the document but maybe not. Anyways, this fix solved my problems.