How To: Read and write shapefile and dBASE files encoded in various code pages
Esri has implemented a 'CODE PAGE CONVERSION' functionality in ArcGIS for Desktop (ArcMap, ArcCatalog, and ArcToolbox) that allows the Desktop applications to read and write shapefile and dBASE files encoded in various code pages. The code page conversion functionality for dBASE files (called 'dbfDefault') is activated by specifying a code page value in the system registry. This is very similar to the &CODEPAGE function used in ArcInfo Workstation.
Prior to ArcGIS 10.2.1, the following procedures can be used to set the desired code page behavior. If ArcGIS for Desktop 10.2.1 or 10.2.2 has been installed, download and install the patches described in Knowledge Base article 42646 before following these instructions.
Note: In the header of each shapefile (.DBF), a reference to a code page is included. Prior to ArcGIS 10.2.1, the code page used corresponded to the user's locale. For example, if the user is on a Japanese locale, the code page used in the .DBF file is 'Shft-JIS'. At ArcGIS 10.2.1, the default sets the code page to UTF-8 (UNICODE) in the shapefile (.DBF). This is constant with current internationalization practices and should ensure the data is readable.
What does the dbfDefault setting do?
By setting a code page value in the system registry, users are able to read and write shapefile and dBASE files encoded in that code page. For example, users can export a shapefile encoded in OEM by setting the code page registry value to OEM. Users can also read shapefiles and dBASE files that do not have the code page information stored in the file as long as users know which code page the file is encoded in.
Why set the dbfDefault?
When opening a shapefile and dBASE file in ArcGIS Desktop, the Desktop programs look at the Language Driver ID (LDID) in the header of a dBASE file, or an associated *.CPG file, which are both used to define the code page and help determine the code page of the file that is read. Based on the code page information it retrieves, ArcGIS for Desktop displays the strings accordingly by performing a code page conversion if it is necessary. If a dBASE file lacks an LDID or a .CPG file, it assumes the file is encoded in the Windows (ANSI/Multi-byte) code page.
If the Desktop programs read a dBASE file encoded in OEM but the file does not contain any code page information or does not have an LDID or a .CPG file, the characters do not display correctly. This is because the Desktop programs assume the file is encoded in the ANSI code page since it cannot find the code page information, while the file is actually being encoded in OEM. This means ArcGIS treats the OEM file as being encoded in ANSI, which causes an incorrect display of 8-bit characters stored in the file.
Most shapefiles and dBASE files should have the code page information stored in the file. Some programs, such as Microsoft Access 2000 and Excel 2000, encode dBASE files in OEM but do not include the code page information in the LDID, so ArcGIS does not read the files correctly. To avoid this problem, users can set the dbfDefault to the appropriate code page before opening a file that lacks the code page information.
How does the dbfDefault work?
It is important to note that there is one exception to this: shapefiles exported from coverages in ArcCatalog and ArcToolbox in languages other than Spanish and Arabic are encoded in OEM, regardless of the dbfDefault setting. This is because 'Coverage to Shapefile' in ArcToolbox uses the functionality of ArcInfo Workstation, which are defined layers that run on DOS, so the output file is always encoded in the OEM code page or the DOS code page. Shapefiles exported from coverages in ArcCatalog and ArcToolbox in Spanish and Arabic are encoded in ANSI. Shapefiles exported from a coverage in ArcCatalog and ArcToolbox are always in the OEM code page (except for Spanish).
The same logic applies to shapefile and dBASE files that are read into ArcGIS for Desktop; if a shapefile or a dBASE file lacks an LDID or a .cpg file, ArcGIS assumes the file to be encoded is in the code page defined by dbfDefault. For example, if the dbfDefault value is set to OEM and a dBASE file lacks both an LDID and a .cpg file, ArcGIS for Desktop assumes the file is encoded in OEM, and therefore performs a code page conversion to display the 8-bit characters in ArcMap and ArcCatalog (since both of the applications are Windows programs that use the ANSI code page to display strings).
Note: If users have the dbfDefault value set to a certain code page, all shapefiles and dBASE files exported in ArcGIS are encoded in that code page. All shapefiles and dBASE files that do not have the code page information are assumed to be in that code page as well. Therefore, it is important to set the dbfDefault value back to its default value (no value) when the task completes.What are the programs that dbfDefault can be used with?
ArcGIS for Desktop is the only program that is affected by the dbfDefault setting. Other programs, such as ArcInfo Workstation and ArcView 3.x, or other code page settings such as the '&CODEPAGE' function used in ArcInfo Workstation and the Code Page Profile used in ArcView 3.x, are not affected.
In ArcInfo Workstation:
- ARCSHAPE with &CODEPAGE OEM creates a shapefile in OEM
- ARCSHAPE with &CODEPAGE ANSI creates a shapefile in ANSI
- INFODBASE with &CODEPAGE OEM creates a dBASE file in OEM
- INFODBASE with &CODEPAGE ANSI creates a dBASE file in ANSI
- Shapefile and dBASE files are saved in the ANSI code page.
Shapefile and dBASE files are the only data formats that can be used by the dbfDefault setting to specify the code page. Other data formats, such as coverage and personal geodatabase, are not affected by the dbfDefault setting.
In ArcGIS for Desktop (regardless of the dbfDefault setting):
- Personal geodatabases are saved in Unicode
- Personal geodatabase tables are saved in Unicode
- Coverages are saved in the ISO code page
- INFO files are saved in the ISO code page
- Interchange files are saved in the ANSI code page
- Text files are saved in the ANSI code page
Instructions provided describe how to set the dbfDefault value in the system registry. Two options are listed below.
Warning: The instructions below include making changes to essential parts of your operating system. It is recommended that you backup your operating system and files, including the registry, before proceeding. Consult with a qualified computer systems professional, if necessary. Esri cannot guarantee results from incorrect modifications while following these instructions; therefore, use caution and proceed at your own risk.
- Add two keys called 'Common' and 'CodePage' in the system registry.
To add a key:
- Open the Registry Editor: Click Start > Run, type 'regedit', and click OK.
- In the registry tree (in the left pane of the registry window), go to 'My Computer\HKEY_CURRENT_USER\Software\ESRI', and click the registry key, 'Desktop 10.x'. For Pro click the registry key 'Pro1.0'. (For version 9.3.1 and earlier versions, go to 'My Computer\HKEY_CURRENT_USER\Software', and click the registry key ESRI.)
- Add a new key called 'Common' (on the Edit menu:
Navigate to New, select Key, type the name "Common", and press ENTER).
- Click the registry key just created (Common), and add a new key called 'CodePage'.
- Add a new string value, 'dbfDefault', to the CodePage key.
To add a string value:
- Click the key CodePage.
- On the Edit menu, navigate to New, and select 'String Value'.
- Type 'dbfDefault' for the new value, and press ENTER.
The new CodePage key should appear as follows:
- Enter a code page value.
- Select the entry just added; it is important that dbfDefault is selected and not (Default).
- On the Edit menu, click Modify.
- In Value data, type the new code page value, and click OK.
The following are lists of supported code page identifiers (these are not case-sensitive).
- OEM Code Page Identifiers
708 - Arabic (ASMO 708)
720 - Arabic (Transparent ASMO), Arabic (DOS)
737 - Greek, Greek (DOS)
775 - Baltic, Baltic (DOS)
850 - Multi-lingual Latin 1, Western European (DOS)
852 - Latin 2, Central European (DOS)
855 - Cyrillic
857 - Turkish, Turkish (DOS)
860 - Portuguese, Portuguese (DOS)
861 - Icelandic, Icelandic (DOS)
862 - Hebrew, Hebrew (DOS)
863 - French Canadian, French Canadian (DOS)
864 - Arabic, Arabic (864)
865 - Nordic, Nordic (DOS)
866 - Russian, Cyrillic (DOS)
869 - Modern Greek, Modern Greek (DOS)
932 - Japanese, Japanese (Shift-JIS)
936 - Chinese (simplified): People's Republic of China, Singapore
949 - Korean (Unified Hangul Code)
950 - Traditional Chinese: Taiwan, Hong Kong, People's Republic of China
ALARABI - Sets the code page to 448
- ANSI Code Page Identifiers
1251 - Cyrillic
1252 - Western European
1253 - Greek
1254 - Turkish
1255 - Hebrew
1256 - Arabic
1257 - Baltic languages
1258 - Vietnamese
Big5 - Chinese: Taiwan, Hong Kong, Macau
SJIS - Japanese (Sets the code page to 932)
- ISO Code Page Identifiers
88592 - Latin 2: Central and Eastern European
88593 - Latin 3: Southern European
88594 - Latin 4: Northern European
88595 - Cyrillic
88596 - Arabic
88597 - Greek
88598 - Hebrew
88599 - Latin 5: Turkish
885910 - Latin 6: Nordic
885911 - Thai
885913 - Lithuanian
885915 - Latin 9: Western European (Upgraded from Latin 1)
- Unicode Values
UTF8 - Sets the code page to 65001
Note: Shapefiles can now be stored in UTF-8. However, shapefiles encoded in UTF-8 are only recognized in ArcGIS for Desktop.
Alternatively, use a batch file to modify the Windows registry.
- In Notepad, create the file ChangeCodePage.bat, using the following code:
Code: @ECHO OFF IF "%1"=="" GOTO :EOF reg add HKEY_CURRENT_USER\Software\ESRI\Desktop10.3\Common\CodePage /v dbfDefault /t REG_SZ /d %1 /f
Note: Change the path to match the version of ArcGIS on the system that is to be modified, for example, ..\Desktop10.1).
- Save the file to a location on the machine to be modified.
- Open a command prompt window (it may be necessary 'Run as Administrator' to execute the batch file).
- To execute the batch file (and change the code page to Japanese in this example), navigate to the location of the batch file and run the following command:
ChangeCodePage SJISThe registry keys are now created and the code page is set to SJIS.