doc:text encoding, csv encoding: cleanups [#2319]

2025-02-15 14:47:43 -10:00 · 2025-02-15 14:47:43 -10:00 · 35df9576ec
commit 35df9576ec
parent cdf6ce072d
1 changed files with 24 additions and 68 deletions
--- a/hledger/hledger.m4.md
+++ b/hledger/hledger.m4.md
@ -106,14 +106,19 @@ For more about how to do that on your system, see [Common tasks > Setting LEDGER

 ## Text encoding

-Data files containing non-ascii characters must use UTF-8 encoding, with the exception being `csv` files (see [`encoding`](#encoding) below).
-An optional [byte order mark (BOM)](https://www.unicode.org/faq/utf_bom.html#BOM) is allowed, at the beginning of the file (only).
+hledger input files containing non-ascii characters must use UTF-8 encoding,
+with the exception of CSV (SSV, TSV..) files, which can be read from other encodings (see [`encoding`](#encoding) CSV rule).

-Also, your system should be configured with a locale that can decode UTF-8 text.
-On some unix systems, you may need set the `LANG` environment variable, eg.
+In UTF-8 input files, an optional [byte order mark (BOM)](https://www.unicode.org/faq/utf_bom.html#BOM) at the beginning of the file is allowed.
+
+Your system may need to be configured with a locale that understands the input file's encoding.
+Eg on some unix systems, you may need set the `LANG` environment variable.
 You can read more about this in [Unicode characters](#unicode-characters), below.

-On unix systems you can check a file's encoding with the `file` command.
+On some unix systems you can use the `file` command to show a file's text encoding.
+On mac, you'll need the version from homebrew: `brew install file-formula`.
+
+hledger's text output is always UTF-8 encoded.

 ## Data formats

@ -3215,7 +3220,7 @@ The following kinds of rule can appear in the rules file, in any order.
 |                                                 |                                                                                                |
 |-------------------------------------------------|------------------------------------------------------------------------------------------------|
 | [**`source`**](#source)                         | optionally declare which file to read data from                                                |
-| [**`encoding`**](#encoding)                     | optionally declare which encoding the data has                                                 |
+| [**`encoding`**](#encoding)                     | optionally declare which text encoding the data has                                            |
 | [**`separator`**](#separator)                   | declare the field separator, instead of relying on file extension                              |
 | [**`skip`**](#skip)                             | skip one or more header lines at start of file                                                 |
 | [**`date-format`**](#date-format)               | declare how to parse CSV dates/date-times                                                      |
@ -3268,70 +3273,21 @@ See also ["Working with CSV > Reading files specified by rule"](#reading-files-s
 encoding ENCODING
 ```

-Specifying `encoding` followed by a valid encoding tells HLedger how to convert a
-csv to be able to make use of it.
-This is most often useful when getting a csv from a bank as they are sometimes
-in an old encoding.
+hledger normally expects non-ascii text to be UTF8-encoded.
+If you need to read CSV files which have some other encoding,
+you can do it by adding `encoding ENCODING` to your CSV rules.
+Eg: `encoding ISO88591`.

-If none is given, `utf8` is assumed.
+The following encodings are supported
+(these names are case-insensitive, and can be written with inner spaces or hyphens if you prefer):
+ASCII, UTF8, UTF16, UTF32, ISO88591, ISO88592, ISO88593, ISO88594,
+ISO88595, ISO88596, ISO88597, ISO88598, ISO88599, ISO885910,
+ISO885911, ISO885913, ISO885914, ISO885915, ISO885916, CP1250, CP1251,
+CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258, KOI8R, KOI8U,
+GB18030, MacOSRoman, JISX0201, JISX0208, ISO2022JP, ShiftJIS, CP437,
+CP737, CP775, CP850, CP852, CP855, CP857, CP860, CP861, CP862, CP863,
+CP864, CP865, CP866, CP869, CP874, CP932.

-The encoding will be checked case-insensitive with some alternative spellings also allowed.
-The full list of valid encodings is:
- ASCII
- UTF8
- UTF16
- UTF32
- ISO88591
- ISO88592
- ISO88593
- ISO88594
- ISO88595
- ISO88596
- ISO88597
- ISO88598
- ISO88599
- ISO885910
- ISO885911
- ISO885913
- ISO885914
- ISO885915
- ISO885916
- CP1250
- CP1251
- CP1252
- CP1253
- CP1254
- CP1255
- CP1256
- CP1257
- CP1258
- KOI8R
- KOI8U
- GB18030
- MacOSRoman
- JISX0201
- JISX0208
- ISO2022JP
- ShiftJIS
- CP437
- CP737
- CP775
- CP850
- CP852
- CP855
- CP857
- CP860
- CP861
- CP862
- CP863
- CP864
- CP865
- CP866
- CP869
- CP874
- CP932
-
-Alternate spellings may be found in the [source code of `encoding`](https://hackage.haskell.org/package/encoding/docs/src/Data.Encoding.html#encodingFromStringExplicit)

 ## `separator`