Misanthrope's Thoughts: 03/01/2012

Wednesday, March 28, 2012

I'm a Father Now

On the 23 of March my wife gave a birth to a beautiful girl. We are going to name her Alice (the name that stayed last after we've cut out all other names from the list one by one). But I've not abandoned yet my insidious plan to (secretly from her mom) register her as Hatshepsut ;-)

Sunday, March 11, 2012

QGIS and GDAL>=1.9 Encoding Issue: a Workaround

After significant encoding handling changes in GDAL>=1.9 it became quite challenging to handle non-Latin attributes (Cyrillic in particular) stored in .dbf part of a shp-file. So the majority of the Cyrillic and other non-Latin users have to stick with GDAl 1.8.1 for now.

The cause of the issue is described here (Rus). Briefly: GDAL>=1.9 attemts to re-encode the .dbf-file to UTF-8 on the basis of the LDID (Language Driver ID) written in .dbf header. But unfortunately LDID is usually missing, and in particular QGIS does not write it to the .dbf-file it creates. In case when LDID is missing, GDAL>=1.9 assumes that encoding of the .dbf-file is ISO8859_1 (Latin-1) which makes non-Latin characters unreadable.

The workaround I'm currently using is creating additional .cpg-file, that contains the ID of the encoding used. For example if encoding is Windows-1251, .cpg-file contains the following record: "1251" (without quotes). When .cpg-file is present, GDAL>=1.9 + QGIS works just fine.

UPD: on some OS you will need to use ID from Additional ID column instead of Encoding ID column.

UPD2: For Windows you may also try to use "unofficial" version of QGIS from here (with encoding issue solved). But it is possible that its installer is in Russian.

UPD3: There is another workaround. You can open .dbf-file in Libre Office Calc (Open Office Calc) providing encoding needed and save it from there. This will write necessary header to .dbf-file and QGIS will open attributes correctly. Note that this also will make fields names written in upper case.

UPD4: there is a plugin for encoding fixing available.

Here you are a table of the encoding IDs (taken from here):

-->

Encoding ID	Encodind name	Additional ID	Other names
1252	Western	iso-8859-1 except when 128-159 is used, use "Windows-1252"	iso8859-1, iso_8859-1, iso-8859-1, ANSI_X3.4-1968, iso-ir-6, ANSI_X3.4-1986, ISO_646, irv:1991, ISO646-US, us, IBM367, cp367, csASCII, latin1, iso_8859-1:1987, iso-ir-100, ibm819, cp819, Windows-1252
20105		us-ascii	us-acii, ascii
28592	Central European (ISO)	iso-8859-2	iso8859-2, iso-8859-2, iso_8859-2, latin2, iso_8859-2:1987, iso-ir-101, l2, csISOLatin2
1250	Central European (Windows)	Windows-1250	Windows-1250, x-cp1250
1251	Cyrillic (Windows)	Windows-1251	Windows-1251, x-cp1251
1253	Greek (Windows)	Windows-1253	Windows-1253
1254	Turkish (Windows)	Windows-1254	Windows-1254
932	Japanese (Shift-JIS)	shift_jis	shift_jis, x-sjis, ms_Kanji, csShiftJIS, x-ms-cp932
51932	Japanese (EUC)	x-euc-jp	Extended_UNIX_Code_Packed_Format_for_Japanese, csEUCPkdFmtJapanese, x-euc-jp, x-euc
50220	Japanese (JIS)	iso-2022-jp	csISO2022JP, iso-2022-jp
1257	Baltic (Windows)	Windows-1257	windows-1257
950	Traditional Chinese (BIG5)	big5	big5, csbig5, x-x-big5
936	Simplified Chinese (GB2312)	gb2312	GB_2312-80, iso-ir-58, chinese, csISO58GB231280, csGB2312, gb2312
20866	Cyrillic (KOI8-R)	koi8-r	csKOI8R, koi8-r
949	Korean (KSC5601)	ks_c_5601	ks_c_5601, ks_c_5601-1987, korean, csKSC56011987
1255 (logical)	Hebrew (ISO-logical)	Windows-1255	iso-8859-8i
1255 (visual)	Hebrew (ISO-Visual)	iso-8859-8	ISO-8859-8 Visual, ISO-8859-8 , ISO_8859-8, visual
862	Hebrew (DOS)	dos-862	dos-862
1256	Arabic (Windows)	Windows-1256	Windows-1256
720	Arabic (DOS)	dos-720	dos-720
874	Thai	Windows-874	Windows-874
1258	Vietnamese	Windows-1258	Windows-1258
65001	Unicode UTF-8	UTF-8	UTF-8, unicode-1-1-utf-8, unicode-2-0-utf-8
65000	Unicode UTF-7	UNICODE-1-1-UTF-7	utf-7, UNICODE-1-1-UTF-7, csUnicode11UTF7, utf-7
50225	Korean (ISO)	ISO-2022-KR	ISO-2022-KR, csISO2022KR
52936	Simplified Chinese (HZ)	HZ-GB-2312	HZ-GB-2312
28594	Baltic (ISO)	iso-8869-4	ISO_8859-4:1988, iso-ir-110, ISO_8859-4, ISO-8859-4, latin4, l4, csISOLatin4
28585	Cyrillic (ISO)	iso_8859-5	ISO_8859-5:1988, iso-ir-144, ISO_8859-5, ISO-8859-5, cyrillic, csISOLatinCyrillic, csISOLatin5
28597	Greek (ISO)	iso-8859-7	ISO_8859-7:1987, iso-ir-126, ISO_8859-7, ISO-8859-7, ELOT_928, ECMA-118, greek, greek8, csISOLatinGreek
28599	Turkish (ISO)	iso-8859-9	ISO_8859-9:1989, iso-ir-148, ISO_8859-9, ISO-8859-9, latin5, l5, csISOLatin5

Misanthrope's Thoughts

Pages

Wednesday, March 28, 2012

I'm a Father Now

Sunday, March 11, 2012

QGIS and GDAL>=1.9 Encoding Issue: a Workaround

Followers