doc:text encoding, csv encoding: cleanups [#2319]

This commit is contained in:
Simon Michael 2025-02-15 14:47:43 -10:00
parent cdf6ce072d
commit 35df9576ec

View File

@ -106,14 +106,19 @@ For more about how to do that on your system, see [Common tasks > Setting LEDGER
## Text encoding
Data files containing non-ascii characters must use UTF-8 encoding, with the exception being `csv` files (see [`encoding`](#encoding) below).
An optional [byte order mark (BOM)](https://www.unicode.org/faq/utf_bom.html#BOM) is allowed, at the beginning of the file (only).
hledger input files containing non-ascii characters must use UTF-8 encoding,
with the exception of CSV (SSV, TSV..) files, which can be read from other encodings (see [`encoding`](#encoding) CSV rule).
Also, your system should be configured with a locale that can decode UTF-8 text.
On some unix systems, you may need set the `LANG` environment variable, eg.
In UTF-8 input files, an optional [byte order mark (BOM)](https://www.unicode.org/faq/utf_bom.html#BOM) at the beginning of the file is allowed.
Your system may need to be configured with a locale that understands the input file's encoding.
Eg on some unix systems, you may need set the `LANG` environment variable.
You can read more about this in [Unicode characters](#unicode-characters), below.
On unix systems you can check a file's encoding with the `file` command.
On some unix systems you can use the `file` command to show a file's text encoding.
On mac, you'll need the version from homebrew: `brew install file-formula`.
hledger's text output is always UTF-8 encoded.
## Data formats
@ -3215,7 +3220,7 @@ The following kinds of rule can appear in the rules file, in any order.
| | |
|-------------------------------------------------|------------------------------------------------------------------------------------------------|
| [**`source`**](#source) | optionally declare which file to read data from |
| [**`encoding`**](#encoding) | optionally declare which encoding the data has |
| [**`encoding`**](#encoding) | optionally declare which text encoding the data has |
| [**`separator`**](#separator) | declare the field separator, instead of relying on file extension |
| [**`skip`**](#skip) | skip one or more header lines at start of file |
| [**`date-format`**](#date-format) | declare how to parse CSV dates/date-times |
@ -3268,70 +3273,21 @@ See also ["Working with CSV > Reading files specified by rule"](#reading-files-s
encoding ENCODING
```
Specifying `encoding` followed by a valid encoding tells HLedger how to convert a
csv to be able to make use of it.
This is most often useful when getting a csv from a bank as they are sometimes
in an old encoding.
hledger normally expects non-ascii text to be UTF8-encoded.
If you need to read CSV files which have some other encoding,
you can do it by adding `encoding ENCODING` to your CSV rules.
Eg: `encoding ISO88591`.
If none is given, `utf8` is assumed.
The following encodings are supported
(these names are case-insensitive, and can be written with inner spaces or hyphens if you prefer):
ASCII, UTF8, UTF16, UTF32, ISO88591, ISO88592, ISO88593, ISO88594,
ISO88595, ISO88596, ISO88597, ISO88598, ISO88599, ISO885910,
ISO885911, ISO885913, ISO885914, ISO885915, ISO885916, CP1250, CP1251,
CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258, KOI8R, KOI8U,
GB18030, MacOSRoman, JISX0201, JISX0208, ISO2022JP, ShiftJIS, CP437,
CP737, CP775, CP850, CP852, CP855, CP857, CP860, CP861, CP862, CP863,
CP864, CP865, CP866, CP869, CP874, CP932.
The encoding will be checked case-insensitive with some alternative spellings also allowed.
The full list of valid encodings is:
- ASCII
- UTF8
- UTF16
- UTF32
- ISO88591
- ISO88592
- ISO88593
- ISO88594
- ISO88595
- ISO88596
- ISO88597
- ISO88598
- ISO88599
- ISO885910
- ISO885911
- ISO885913
- ISO885914
- ISO885915
- ISO885916
- CP1250
- CP1251
- CP1252
- CP1253
- CP1254
- CP1255
- CP1256
- CP1257
- CP1258
- KOI8R
- KOI8U
- GB18030
- MacOSRoman
- JISX0201
- JISX0208
- ISO2022JP
- ShiftJIS
- CP437
- CP737
- CP775
- CP850
- CP852
- CP855
- CP857
- CP860
- CP861
- CP862
- CP863
- CP864
- CP865
- CP866
- CP869
- CP874
- CP932
Alternate spellings may be found in the [source code of `encoding`](https://hackage.haskell.org/package/encoding/docs/src/Data.Encoding.html#encodingFromStringExplicit)
## `separator`