PostgreSQL unicode
I was having trouble storing unicode strings in my postgresql database.
DISCLAIMER: I’m still really struggling with Unicode in Python, but I think I got a little closer to understanding today!
When I installed and configured PostgreSQL for the first time, the default setting for character set was LATIN-1
, which is annoying. I kept running into problems saving bookmarks in subMarks when the page titles had unicode characters in them (the emdash has a lot to answer for).
In order to fix this, I had to convert my database to UTF-8
. You can’t actually do this though, instead, you have to create a new database, and dump the data into it. Creating a new UTF-8
database is fairly straightforward:
$ createdb -E UTF8 -T template0 NEW_DB_NAME
I used some instructions I found for transferring the data, however, I didn’t drop the cluster and create a new one - I have other databases running and I really couldn’t be bothered with all the hassle, I just created a new database and left the old one where it was. Basically, I did this:
$ pg_dump --blobs --oids OLD_DB_NAME > OLD_DB_NAME.latin1.sql
$ pg_dump --blobs --oids --encoding=UTF-8 OLD_DB_NAME > OLD_DB_NAME.utf8.sql
then:
psql --set ON_ERROR_STOP=on NEW_DB_NAME < OLD_DB_NAME.utf8.sql
That all went fine, but the next step was causing me some serious problems: connecting to the database from my python web app.
Since I’m running PostgreSQL 8.4, I couldn’t specify the client_encoding
as part of the DSN, it needs a separate step:
conn = psycopg2.connect(app.config['DSN'])
conn.set_client_encoding('UTF8')
But that isn’t quite enough. UTF-8 is not Unicode so Python was getting upset with the encoding of the strings it got from the database.
My Googlefu was strong at this point though, I discovered the missing piece in a bug filing for Django. Adding the line:
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
to the top of my database code did the job. You can view the code change in this commit.
This got me a little closer to finally getting Python Unicode!