If you’re here for Visual C++ 6.0, click here.

I’m not sure about Visual C++ 7.0 and 7.1 (.NET 2002 and 2003), but since 8.0 (2005), Visual C++ compiler already supports source file with UTF-8 no BOM encoding. It is just that, for it to work as expected, you need to follow a rule that no one in Microsoft will ever tell you. Now we are in 2017 and I have failed to find any thread or article that emphasizes it, that’s why I think I need to write this.

img

Here is the test source file we will use today: utf8-string-literal-test.cpp

For Visual C++ 8.0 and above

To get your UTF-8 source files compiled correctly, you must save in UTF-8 without BOM encoding, and the system locale (non-Unicode language) must be English.

img

I know this looks like bullshit and a bit racist, but that is what my test result tells me.

Here is what I get if I compile this file when my system locale is set to Japanese:

img

See? If the source file doesn’t contain BOM (byte-order mark), and your system locale is not English, VC will assume that your source file is not in Unicode. What a bold assumption. I think this is yet another compatibility trick by Microsoft. If you are sure your source file always use UTF-8 without BOM, you may mod the compiler so it stops making such assumption.

Mod the compiler (for Visual C++ 8.0 and above only)

I don’t like the idea of changing system locale, so I mod the compiler instead. This is how it looks like after I have modded my compiler:

img

Before you proceed, you should know that it is always a good idea to backup your compiler files. Better safe than sorry, right?

Step 1: Determine what to mod

Since this is related with character set, I grep for related API such as WideCharToMultiByte, then I choose a few files I think I needed to mod. For this example, I choose cl.exe, c1.dll, c1xx.dll and c2.dll.

img

Step 2: Mod the files to load kernel31 instead

Open the file in hex editor. Search for kernel32.dll string and replace it with kernel31.dll string. If there are multiple occurences, choose the one that is surrounded by many API names.

Step 3: Assert codepage 1252 in your kernel31

If you don’t know kernel31, please look forward to a future post I will write someday.

In your kernel31 project, you need to redirect two functions to your own implementation, MultiByteToWideChar and WideCharToMultiByte. The idea is, if the incoming codepage is not CP_UTF8, change it to 1252 (English codepage), and then return MultiByteToWideChar or WideCharToMultiByte. This is how I coded it:

KERNEL31_API int __stdcall
My_MultiByteToWideChar(
    UINT codepage, DWORD dwFlags, LPCSTR szMulti, int cbMulti,
    LPWSTR szWide, int cchWide)
{
    if (codepage != CP_UTF8) {
        codepage = 1252;
    }
    return MultiByteToWideChar(codepage, dwFlags, szMulti, cbMulti,
        szWide, cchWide);
}

KERNEL31_API int __stdcall
My_WideCharToMultiByte(
    UINT codepage, DWORD dwFlags, LPCWSTR szWide, int cchWide,
    LPSTR szMulti, int cbMulti, LPCSTR lpDefChar, LPBOOL lpUsedDefChar)
{
    if (codepage != CP_UTF8) {
        codepage = 1252;
    }
    return WideCharToMultiByte(codepage, dwFlags, szWide, cchWide,
        szMulti, cbMulti, lpDefChar, lpUsedDefChar);
}

If you prefer to download the one I created instead:
Kernel31 for VC8.0 (VC2005)
Kernel31 for VC14.0 (VC2015)

Step 4: Showtime

Just put the kernel31.dll into the compiler directory, and you’re done.

img

Congraz! Now your VC can compile UTF-8 source file without BOM! No more crappy BOM or UTF-16!

For Visual C++ 6.0

Let me make one thing clear, the IDE of Visual C++ 6.0 itself doesn’t know anything about Unicode. The moment you open your UTF-8 source file in the IDE, you’re doomed. You must use command-line to compile such thing with VC6.0. In fact, the compiler doesn’t know Unicode too, but because UTF-8 is a multi-byte character set, it happens to support such encoding too, with a little bit issue of course.

Let’s see what would happen if we compile the source file with VC6.0 compiler.

img

This is the issue I was talking about. To fix it, just append a space to the problematic string literal.

UPDATE May 11: A better way has been discovered.

img

信じられない? See for yourself:
utf8-string-literal-test-vc6-exe.7z

UPDATE: File encoding plays role too

You might think that as long as your system locale is English, which enable MSVC to compile UTF-8 source file, the string literal in your source file will always be UTF-8. This is NOT true. Here is the pseudocode that describes the MSVC behavior:

if (ansi_codepage == 1252) {
    open_source_file_with_encoding(utf_8);
}
else {
    open_source_file_with_encoding(ansi_codepage);
}
copy_string_literal_into_exe_with_encoding(file_encoding);
ANSI CP Source file CP String literal CP in EXE Conversion flow
1252 UTF-8 UTF-8 UTF-8 -> wide -> UTF-8
1252 932 932 UTF-8 -> wide -> 932
1252 936 936 UTF-8 -> wide-> 936
936 UTF-8 UTF-8 936 -> wide -> UTF8
936 932 932 936 -> wide -> 932
936 936 936 936 -> wide -> 936
932 UTF-8 UTF-8 932 -> wide -> UTF-8
932 932 932 932 -> wide -> 932
932 936 936 932 -> wide -> 936

UPDATE: For VC6.0, the key is LocaleName.

Say no more, I am totally speechless now. As long as your LocaleName is valid and start with en-, it will compile without issue. I’m not sure why (racist?) but it works. See for yourself.

img

img

Good thing is, this change does not require a reboot, and you can automate it with batch file. I have wrapped as a function, so you can integrate it into your batch file easily. Note that you need to setlocal enableDelayedExpansion. Don’t know where to put? Put it after your @echo off.

:localeName
set _path_="HKCU\Control Panel\International"
set _name_=LocaleName
if "%~1"=="patch" (
	call :localeName get _localeName_
	call :localeName set en-US
)
if "%~1"=="unpatch" (
	call :localeName set !_localeName_!
)
if "%~1"=="get" (
	for /f "tokens=3 skip=2" %%i in ('reg query !_path_! /v !_name_!') do (
		set _localeName_=%%i
	)
)
if "%~1"=="set" (
	reg add !_path_! /v !_name_! /d "%~2" /f >nul || exit/b 1
)
exit/b

Before compile script: call :localeName patch

After compile script: call :localeName unpatch