diff --git a/hledger/hledger.m4.md b/hledger/hledger.m4.md index d3d877c8a..276c059ac 100644 --- a/hledger/hledger.m4.md +++ b/hledger/hledger.m4.md @@ -106,14 +106,19 @@ For more about how to do that on your system, see [Common tasks > Setting LEDGER ## Text encoding -Data files containing non-ascii characters must use UTF-8 encoding, with the exception being `csv` files (see [`encoding`](#encoding) below). -An optional [byte order mark (BOM)](https://www.unicode.org/faq/utf_bom.html#BOM) is allowed, at the beginning of the file (only). +hledger input files containing non-ascii characters must use UTF-8 encoding, +with the exception of CSV (SSV, TSV..) files, which can be read from other encodings (see [`encoding`](#encoding) CSV rule). -Also, your system should be configured with a locale that can decode UTF-8 text. -On some unix systems, you may need set the `LANG` environment variable, eg. +In UTF-8 input files, an optional [byte order mark (BOM)](https://www.unicode.org/faq/utf_bom.html#BOM) at the beginning of the file is allowed. + +Your system may need to be configured with a locale that understands the input file's encoding. +Eg on some unix systems, you may need set the `LANG` environment variable. You can read more about this in [Unicode characters](#unicode-characters), below. -On unix systems you can check a file's encoding with the `file` command. +On some unix systems you can use the `file` command to show a file's text encoding. +On mac, you'll need the version from homebrew: `brew install file-formula`. + +hledger's text output is always UTF-8 encoded. ## Data formats @@ -3215,7 +3220,7 @@ The following kinds of rule can appear in the rules file, in any order. | | | |-------------------------------------------------|------------------------------------------------------------------------------------------------| | [**`source`**](#source) | optionally declare which file to read data from | -| [**`encoding`**](#encoding) | optionally declare which encoding the data has | +| [**`encoding`**](#encoding) | optionally declare which text encoding the data has | | [**`separator`**](#separator) | declare the field separator, instead of relying on file extension | | [**`skip`**](#skip) | skip one or more header lines at start of file | | [**`date-format`**](#date-format) | declare how to parse CSV dates/date-times | @@ -3268,70 +3273,21 @@ See also ["Working with CSV > Reading files specified by rule"](#reading-files-s encoding ENCODING ``` -Specifying `encoding` followed by a valid encoding tells HLedger how to convert a -csv to be able to make use of it. -This is most often useful when getting a csv from a bank as they are sometimes -in an old encoding. +hledger normally expects non-ascii text to be UTF8-encoded. +If you need to read CSV files which have some other encoding, +you can do it by adding `encoding ENCODING` to your CSV rules. +Eg: `encoding ISO88591`. -If none is given, `utf8` is assumed. +The following encodings are supported +(these names are case-insensitive, and can be written with inner spaces or hyphens if you prefer): +ASCII, UTF8, UTF16, UTF32, ISO88591, ISO88592, ISO88593, ISO88594, +ISO88595, ISO88596, ISO88597, ISO88598, ISO88599, ISO885910, +ISO885911, ISO885913, ISO885914, ISO885915, ISO885916, CP1250, CP1251, +CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258, KOI8R, KOI8U, +GB18030, MacOSRoman, JISX0201, JISX0208, ISO2022JP, ShiftJIS, CP437, +CP737, CP775, CP850, CP852, CP855, CP857, CP860, CP861, CP862, CP863, +CP864, CP865, CP866, CP869, CP874, CP932. -The encoding will be checked case-insensitive with some alternative spellings also allowed. -The full list of valid encodings is: -- ASCII -- UTF8 -- UTF16 -- UTF32 -- ISO88591 -- ISO88592 -- ISO88593 -- ISO88594 -- ISO88595 -- ISO88596 -- ISO88597 -- ISO88598 -- ISO88599 -- ISO885910 -- ISO885911 -- ISO885913 -- ISO885914 -- ISO885915 -- ISO885916 -- CP1250 -- CP1251 -- CP1252 -- CP1253 -- CP1254 -- CP1255 -- CP1256 -- CP1257 -- CP1258 -- KOI8R -- KOI8U -- GB18030 -- MacOSRoman -- JISX0201 -- JISX0208 -- ISO2022JP -- ShiftJIS -- CP437 -- CP737 -- CP775 -- CP850 -- CP852 -- CP855 -- CP857 -- CP860 -- CP861 -- CP862 -- CP863 -- CP864 -- CP865 -- CP866 -- CP869 -- CP874 -- CP932 - -Alternate spellings may be found in the [source code of `encoding`](https://hackage.haskell.org/package/encoding/docs/src/Data.Encoding.html#encodingFromStringExplicit) ## `separator`