When opening a text file in an editor, the encoding must be specified. But how do I determine the correct encoding of a file?
Before I clarify the question of how to determine the correct encoding, I would like to briefly explain what exactly text encoding is:
Text encoding simply explained
Each text file basically contains a sequence of different bytes. And each byte has a value from 0 to 255. The text encoding defines what a single byte should represent. The text encoding is therefore a type of coding table (character set), which defines which byte value corresponds to which character to be displayed. For example, the value 65 (called a code point) corresponds to the character “A” in the encoding “ANSI”.
The text encoding is therefore a kind of glasses, how the byte sequence in a file should be interpreted and displayed. You can change the glasses or the encoding (character set) at any time and view the data in a different display, which will probably result in character sequences that no longer make sense. Therefore, if the encoding is incorrectly selected, you will suddenly see certain special characters and umlauts such as “ä” as strange hieroglyphs such as "ä". As soon as the correct encoding is used again, the characters are displayed readably for us again.
There are different text encodings, mainly for historical reasons, in order to be able to display the characters of different languages. Here are a few well-known encodings: ASCII, ANSI, ISO 8859-1, UTF-8, UTF-16
In the meantime, the UTF-8 encoding has become established, because UTF-8 encodes characters in different byte lengths, thus defining many more Unicode characters. Thanks to this, many languages can be covered, including emojis.
But how do I find out which glasses or encoding I should use to read a particular text file?
How you can identify the correct encoding
Basically, you cannot extract the encoding directly. You have to try out different glasses or encodings until you can read the content correctly.
Trying out encodings manually
This means you have to search for possible words that could be included and check whether they are displayed correctly with the selected encoding. In German, words with umlauts such as “März” are suitable for this. If the word “März” can be found in the content and read correctly, you have most likely selected the correct encoding.
If in doubt, I recommend starting with UTF-8 first, as this encoding is very common. If strange characters are displayed, I would continue with ANSI in a second step.
Reliable detection thanks to markers (BOM)
Text files can also be provided with a marker to prevent you from having to try them out. This means that each text file begins with a defined byte combination. This is referred to as a byte order mark (BOM). As soon as a text editor detects a predefined byte sequence at the beginning of the file, the corresponding encoding can be applied automatically. The byte order mark (BOM) is hidden from the user by the editor, as it is only used to identify the correct encoding and does not belong directly to the content.
Unfortunately, not all editors support these markers, so that many text files do not contain a BOM. However, the smasi CSV Wizard supports automatic detection using BOM, which makes it easier for you to load files.
The tool to find the right encoding
As soon as a file contains a Byte Order Mark (BOM), smasi CSV-Wizard will automatically determine the correct encoding. However, if this BOM is missing, you will have to manually determine the right encoding when opening the file.
But smasi CSV-Wizard also supports you in this regard to make it easier:
On the corresponding format page, you can select the various encodings and then check them directly via the “Show content” link.
In this example, “ANSI” was selected as the encoding, but the content is not currently displayed correctly. This means that in this case you have to continue searching with a different character set. So that you can try out the encodings manually, you must first select the 'Self-defined format' option at the top.
This allows you to determine the correct encoding setting with just a few mouse clicks.