csv: doc: clean up/expand manual after #1095

[ci skip]
2019-11-06 13:08:54 -08:00 · 2019-11-06 13:08:54 -08:00 · d92351e21a
commit d92351e21a
parent dcfc833d92
1 changed files with 351 additions and 220 deletions
--- a/hledger-lib/hledger_csv.m4.md
+++ b/hledger-lib/hledger_csv.m4.md
@ -25,7 +25,7 @@ Converting CSV to transactions requires some special conversion rules.
 These do several things:
 - they describe the layout and format of the CSV data
- they can customize the generated journal entries using a simple templating language
+- they can customize the generated journal entries (transactions) using a simple templating language
 - they can add refinements based on patterns in the CSV data, eg categorizing transactions with more detailed account names.
 When reading a CSV file named `FILE.csv`, hledger looks for a
@ -38,12 +38,245 @@ At minimum, the rules file must identify the date and amount fields.
 It's often necessary to specify the date format, and the number of header lines to skip, also.
 Eg:
 ```
-fields date, _, _, amount1
+fields date, _, _, amount
 date-format  %d/%m/%Y
 skip 1
 ```
-A more complete example:
+More examples in the EXAMPLES section below.
 # CSV RULES
 The following kinds of rule can appear in the rules file, in any order
 (except for `end` which can appear only inside a conditional block).
 Blank lines and lines beginning with `#` or `;` are ignored.
 ## `skip`
 ```rules
 skip N
 ```
 The word "skip" followed by a number (or no number, meaning 1)
 tells hledger to ignore this many non-empty lines preceding the CSV data.
 (Empty/blank lines are skipped automatically.)
 You'll need this whenever your CSV data contains header lines.
 It also has a second purpose: it can be used to ignore certain CSV
 records, see [conditional blocks](#if) below.
 ## `fields`
 ```rules
 fields FIELDNAME1, FIELDNAME2, ...
 ```
 A fields list ("fields" followed by one or more comma-separated field names) is the quick way to assign CSV field values to hledger fields.
 It  (a) names the CSV fields, in order (names may not contain whitespace; fields you don't care about can be left unnamed),
 and (b) assigns them to hledger fields if you use standard hledger field names.
 Here's an example:
 ```rules
 # use the 1st, 2nd and 4th CSV fields as the transaction's date, description and amount,
 # ignore the 3rd, 5th and 6th fields,
 # and name the 7th and 8th fields for later reference:
 #      1     2           3  4       5 6  7          8
 fields date, description, , amount1, , , somefield, anotherfield
 ```
 Here are the standard hledger field names:
 ### Transaction fields
 `date`, `date2`, `status`, `code`, `description`, `comment` can be used to form the
 [transaction's](journal.html#transactions) first line. Only `date` is required.
 (See also [date-format](#date-format) below.)
 ### Posting fields
 `accountN`, where N is 1 to 9, sets the Nth [posting's](journal.html#postings) account name.
 Most often there are two postings, so you'll want to set `account1` and `account2`.
 <!-- (Often, `account1` is fixed and `account2` will be set later by a [conditional block](#if).) -->
 A number of field/pseudo-field names are available for setting posting [amounts](journal.html#amounts):
 - `amountN` sets posting N's amount
 - `amountN-in` and `amountN-out` can be used instead, if the CSV has separate fields for debits and credits
 - `currencyN` sets a currency symbol to be left-prefixed to the amount, useful if the CSV provides that as a separate field
 - `balanceN` sets a (separate) [balance assertion](journal.html#balance-assertions) amount 
   (or when no posting amount is set, a [balance assignment](journal.html#balance-assignments))
 If you write these with no number
 (`amount`, `amount-in`, `amount-out`, `currency`, `balance`),
 it means posting 1.
 Also, if you set an amount for posting 1 only, 
 a second posting that balances the transaction will be generated automatically.
 This helps support CSV rules created before hledger 1.16.
 <!-- XXX check exact behaviour, eg in three-posting example below -->
 Finally, `commentN` sets a [comment](journal.html#comments) on the Nth posting. 
 Comments can of course contain [tags](journal.html#tags).
 ## `(field assignment)`
 ```rules
 HLEDGERFIELDNAME FIELDVALUE
 ```
 Instead of or in addition to a [fields list](#fields), you can
 assign a value to a hledger field by writing its name
 (any of the standard names above) followed by a text value.
 The value may contain interpolated CSV fields, 
 referenced by their 1-based position in the CSV record (`%N`),
 or by the name they were given in the fields list (`%CSVFIELDNAME`).
 Eg:
 ```rules
 # set the amount to the 4th CSV field, with " USD" appended
 amount %4 USD
 ```
 ```rules
 # combine three fields to make a comment, containing note: and date: tags
 comment note: %somefield - %anotherfield, date: %1
 ```
 Interpolation strips any outer whitespace, so a CSV value like `" 1 "`
 becomes `1` when interpolated
 ([#1051](https://github.com/simonmichael/hledger/issues/1051)).
 Note you can only interpolate CSV fields, not the hledger fields being assigned to;
 for more on this, see [TIPS](#tips).
 ## `date-format`
 ```rules
 date-format DATEFMT
 ```
 This is a helper for the `date` (and `date2`) fields.
 If your CSV dates are not formatted like `YYYY-MM-DD`, `YYYY/MM/DD` or `YYYY.MM.DD`,
 you'll need to specify the format by writing "date-format" followed by 
 a [strptime-like date parsing pattern](http://hackage.haskell.org/packages/archive/time/latest/doc/html/Data-Time-Format.html#v:formatTime),
 which must parse the date field values completely. Examples:
 ``` rules
 # for dates like "11/06/2013":
 date-format %m/%d/%Y
 ```
 ``` rules
 # for dates like "6/11/2013". The - allows leading zeros to be optional.
 date-format %-d/%-m/%Y
 ```
 ``` rules
 # for dates like "2013-Nov-06":
 date-format %Y-%h-%d
 ```
 ``` rules
 # for dates like "11/6/2013 11:32 PM":
 date-format %-m/%-d/%Y %l:%M %p
 ```
 ## `if`
 ```rules
 if PATTERN
 RULE
 if
 PATTERN
 PATTERN
 PATTERN
 RULE
 RULE
 ```
 Conditional blocks apply one or more rules to CSV records which are
 matched by any of the PATTERNs. This allows transactions to be
 customised or categorised based on patterns in the data.
 A single pattern can be written on the same line as the "if";
 or multiple patterns can be written on the following lines, non-indented.
 Patterns are case-insensitive [regular expressions](hledger.html#regular-expressions)
 which try to match any part of the whole CSV record.
 It's not yet possible to match within a specific field.
 Note the CSV record they see is close but not identical to the one in the CSV file;
 eg double quotes are removed, and the separator character becomes comma.
 After the patterns, there should be one or more rules to apply, all
 indented by at least one space. Three kinds of rule are allowed in
 conditional blocks:
 - [field assignments](#field-assignment) (to set a field's value)
 - [skip](#skip) (to skip the matched CSV record)
 - [end](#end) (to skip all remaining CSV records).
 Examples:
 ```rules
 # if the CSV record contains "groceries", set account2 to "expenses:groceries"
 if groceries
 account2 expenses:groceries
 ```
 ```rules
 # if the CSV record contains any of these patterns, set account2 and comment as shown
 if
 monthly service fee
 atm transaction fee
 banking thru software
 account2 expenses:business:banking
 comment  XXX deductible ? check it
 ```
 ## `end`
 As mentioned above, this rule can be used inside conditional blocks
 (only) to cause hledger to stop reading CSV records and proceed with
 command execution. Eg:
 ```rules
 # ignore everything following the first empty record
 if ,,,,
 end
 ```
 ## `include`
 ```rules
 include RULESFILE
 ```
 Include another CSV rules file at this point, as if it were written inline. 
 `RULESFILE` is an absolute file path or a path relative to the current file's directory.
 This can be useful eg for reusing common rules in several rules files:
 ```rules
 # someaccount.csv.rules
 ## someaccount-specific rules
 fields date,description,amount
 account1 some:account
 account2 some:misc
 ## common rules
 include categorisation.rules
 ```
 ## `newest-first`
 hledger always sorts the generated transactions by date.
 Transactions on the same date should appear in the same order as their CSV records,
 as hledger can usually auto-detect whether the CSV's normal order is oldest first or newest first.
 But if all of the following are true:
 - the CSV might sometimes contain just one day of data (all records having the same date)
 - the CSV records are normally in reverse chronological order (newest first)
 - and you care about preserving the order of same-day transactions
 you should add the `newest-first` rule as a hint. Eg:
 ```rules
 # tell hledger explicitly that the CSV is normally newest-first
 newest-first
 ```
 # EXAMPLES
 A more complete example, generating three-posting transactions:
 ```
 # hledger CSV rules for amazon.com order history
@ -79,264 +312,162 @@ comment3    fees
 For more examples, see [Convert CSV files](https://github.com/simonmichael/hledger/wiki/Convert-CSV-files).
 # TIPS
-# CSV RULES
+## Reading multiple CSV files
-The following seven kinds of rule can appear in the rules file, in any order.
+You can read multiple CSV files at once using multiple `-f` arguments on the command line.
-Blank lines and lines beginning with `#` or `;` are ignored.
+hledger will look for a correspondingly-named rules file for each CSV file.
 If you use the `--rules-file` option, that rules file will be used for all the CSV files.
-## skip
+## Deduplicating, importing
-`skip `*`N`*
+When you download a CSV file repeatedly, eg to get your latest bank
-
+transactions, the new file may contain some of the same records as the
-Skip this many non-empty lines preceding the CSV data.
+old one. The [print --new](hledger.html#print) command is one simple
-(Empty/blank lines are skipped automatically.)
+way to detect just the new transactions. Or better still, the
-You'll need this whenever your CSV data contains header lines. Eg:
+[import](hledger.html#import) command appends those new transactions
-<!-- XXX -->
+to your main journal. This is the easiest way to import CSV data. Eg,
-<!-- hledger tries to skip initial CSV header lines automatically. -->
+after downloading your latest CSV files:
-<!-- If it guesses wrong, use this directive to skip exactly N lines. -->
+```shell
-This can also be used in a conditional block to ignore certain CSV records.
+$ hledger import *.csv [--dry]
 ```rules
 # ignore the first CSV line
 skip 1
 ```
-## date-format
+## Other import methods
-`date-format `*`DATEFMT`*
+A number of other tools and workflows, hledger-specific and otherwise,
 exist for converting, deduplicating, classifying and managing CSV data.
 See:
-When your CSV date fields are not formatted like `YYYY/MM/DD` (or `YYYY-MM-DD` or `YYYY.MM.DD`),
+- <https://hledger.org> -> sidebar -> real world setups
-you'll need to specify the format.
+- <https://plaintextaccounting.org> -> data import/conversion
 DATEFMT is a [strptime-like date parsing pattern](http://hackage.haskell.org/packages/archive/time/latest/doc/html/Data-Time-Format.html#v:formatTime),
 which must parse the date field values completely. Examples:
-``` rules
+## Valid CSV
-# for dates like "11/06/2013":
+
-date-format %m/%d/%Y
+hledger accepts CSV conforming to [RFC 4180](https://tools.ietf.org/html/rfc4180).
 Some things to note when values are enclosed in quotes:
 - you must use double quotes (not single quotes)
 - spaces outside the quotes are [not allowed](https://stackoverflow.com/questions/4863852/space-before-quote-in-csv-field)
 ## Other separator characters
 With the `--separator 'CHAR'` option, hledger will expect the
 separator to be CHAR instead of a comma. Ie it will read other
 "Character Separated Values" formats, such as TSV (Tab Separated Values).
 Note: on the command line, use a real tab character in quotes, not \t. Eg:
 ```shell
 $ hledger -f foo.tsv --separator '	' print
 ```
 (Experimental.)
-``` rules
+## Setting amounts
 # for dates like "6/11/2013" (note the - to make leading zeros optional):
 date-format %-d/%-m/%Y
 ```
-``` rules
+A posting amount can be set in one of these ways:
 # for dates like "2013-Nov-06":
 date-format %Y-%h-%d
 ```
-``` rules
+- by assigning (with a fields list or field assigment) to
-# for dates like "11/6/2013 11:32 PM":
+  `amountN` (posting N's amount) or `amount` (posting 1's amount)
 date-format %-m/%-d/%Y %l:%M %p
 ```
-## field list
+- by assigning to `amountN-in` and `amountN-out` (or `amount-in` and `amount-out`).
  For each CSV record, whichever of these has a non-zero value will be used, with appropriate sign. 
  If both contain a non-zero value, this may not work.
-`fields `*`FIELDNAME1`*, *`FIELDNAME2`*...
+- by assigning to `balanceN` (or `balance`) instead of the above,
-
+  setting the amount indirectly via a 
-This (a) names the CSV fields, in order (names may not contain whitespace; uninteresting names may be left blank),
+  [balance assignment](journal.html#balance-assignments).
 and (b) assigns them to journal entry fields if you use any of these standard field names:
 Fields `date`, `date2`, `status`, `code`, `description` will form transaction description.
 An assignment to any of `accountN`, `amountN`, `amountN-in`, `amountN-out`, `balanceN` or `currencyN` will generate a posting (though it's your responsibility to ensure it is a well formed one). Normally the `N`'s are consecutive starting from 1 but it's not required. One posting will be generated for each unique `N`. If you wish to supply a comment for the posting, use `commentN`, though comment on its own will not cause posting to be generated.
 Fields `amount`, `amount-in`, `amount-out`, `currency`, `balance` and `comment` are treated as aliases for `amount1`, and so on. If your rules file leads to both aliased fields having different values, `hledger` will raise an error.
 Eg:
 ```rules
 # use the 1st, 2nd and 4th CSV fields as the entry's date, description and amount,
 # and give the 7th and 8th fields meaningful names for later reference:
 #
 # CSV field:
 #      1     2            3 4       5 6 7          8
 # entry field:
 fields date, description, , amount1, , , somefield, anotherfield
 ```
 For backwards compatibility, we treat posting 1 specially. If your rules generated just posting 1, another posting would be added to your transaction to balance it. If your rules generated posting 1 and posting 2, but amount in the posting 2 is empty, hledger will fill it out with the opposite of posting 1. This special handling is needed to ensure smooth upgrade path from version 1.15.
 ## field assignment
 *`ENTRYFIELDNAME`* *`FIELDVALUE`*
 This sets a journal entry field (one of the standard names above) to the given text value,
 which can include CSV field values interpolated by name (`%CSVFIELDNAME`) or 1-based position (`%N`).
 <!-- Whitespace before or after the value is ignored. -->
 Eg:
 ```rules
 # set the amount to the 4th CSV field with "USD " prepended
 amount USD %4
 ```
 ```rules
 # combine three fields to make a comment (containing two tags)
 comment note: %somefield - %anotherfield, date: %1
 ```
 Field assignments can be used instead of or in addition to a field list.
 Note, interpolation strips any outer whitespace, so a CSV value like
 `" 1 "` becomes `1` when interpolated ([#1051](https://github.com/simonmichael/hledger/issues/1051)).
 ## conditional block
 `if` *`PATTERN`*\
 &nbsp;&nbsp;&nbsp;&nbsp;*`FIELDASSIGNMENTS`*...
 `if`\
 *`PATTERN`*\
 *`PATTERN`*...\
 &nbsp;&nbsp;&nbsp;&nbsp;*`FIELDASSIGNMENTS`*...
 `if` *`PATTERN`*\
 *`PATTERN`*...\
 &nbsp;&nbsp;&nbsp;&nbsp;*`skip N`*...
 `if` *`PATTERN`*\
 *`PATTERN`*...\
 &nbsp;&nbsp;&nbsp;&nbsp;*`end`*...
 This applies one or more field assignments, only to those CSV records matched by one of the PATTERNs.
 The patterns are case-insensitive regular expressions which match anywhere
 within the whole CSV record (it's not yet possible to match within a
 specific field).  When there are multiple patterns they can be written
 on separate lines, unindented.
 The field assignments are on separate lines indented by at least one space.
 Instead of field assignments you can specify `skip` or `skip 1` to skip this record, `skip N` to skip the next N records (including the one that matchied) or `end` to skip the rest of the file.
 Examples:
 ```rules
 # if the CSV record contains "groceries", set account2 to "expenses:groceries"
 if groceries
 account2 expenses:groceries
 ```
 ```rules
 # if the CSV record contains any of these patterns, set account2 and comment as shown
 if
 monthly service fee
 atm transaction fee
 banking thru software
 account2 expenses:business:banking
 comment  XXX deductible ? check it
 ```
 ## include
 `include `*`RULESFILE`*
 Include another rules file at this point. `RULESFILE` is either an absolute file path or
 a path relative to the current file's directory. Eg:
 ```rules
 # rules reused with several CSV files
 include common.rules
 ```
 ## newest-first
 `newest-first`
 Consider adding this rule if all of the following are true: 
 you might be processing just one day of data,
 your CSV records are in reverse chronological order (newest first),
 and you care about preserving the order of same-day transactions.
 It usually isn't needed, because hledger autodetects the CSV order,
 but when all CSV records have the same date it will assume they are oldest first.
 # CSV TIPS
 ## CSV ordering
 The generated [journal entries](journal.html#transactions) will be sorted by date. 
 The order of same-day entries will be preserved 
 (except in the special case where you might need [`newest-first`](#newest-first), see above).
 ## CSV accounts
 Each journal entry will have at least two [postings](journal.html#postings), to `account1` and some other account (usually `account2`).
 It's conventional and recommended to use `account1` for the account whose CSV we are reading.
 ## CSV amounts
 A posting [amount](journal.html#amounts) could be set in one of these ways:
 - with an `amountN` field assignment, which sets the Nth posting's amount
 - (When the CSV has debit and credit amounts in separate fields:)\
  with field assignments for the `amountN-in` and `amountN-out` pseudo
  fields (both of them). Whichever one has a value will be used, with
  appropriate sign. If both contain a value, it might not work so well.
 - with `balanceN` field assignment that creates a [balance assignment](journal.html#balance-assignments) (see below).
 There is some special handling for sign in amounts:
 - If an amount value is parenthesised, it will be de-parenthesised and sign-flipped.
- If an amount value begins with a double minus sign, those will cancel out and be removed.
+- If an amount value begins with a double minus sign, those cancel out and are removed.
 If the currency/commodity symbol is provided as a separate CSV field,
-assign it to the `currency` pseudo field (applicable to the whole transaction) or `currencyN` (applicable to Nth posting only); the symbol will be prepended
+you can assign it to `currency` (affects all posting amounts) or `currencyN` (affects just posting N's amount).
-to the amount 
+The symbol will be prepended to the amount.
-(TODO: <s>when there is an amount</s>).
+Or for more control, you can set both currency symbol and amount with a field assignment, eg:
 Or, you can use an `amountN` [field assignment](#field-assignment) for more control, eg:
 ```
-fields date,description,currency,amount1
+fields date,description,currency,amount
-amount1 %amount1 %currency
+# add currency symbol on the right:
 amount %amount %currency
 ```
-## CSV balance assertions/assignments
+## Referencing other fields
-If the CSV includes a running balance, you can assign that to one of the pseudo fields
+In field assignments, you can interpolate only CSV fields, not hledger
-`balance` (or `balance1`), `balance2`, ... up to `balance9`.
+fields. In the example below, there's both a CSV field and a hledger
-This will generate a [balance assertion](journal.html#balance-assertions) 
+field named amount1, but %amount1 always means the CSV field, not
-(or if the amount is left empty, a [balance assignment](journal.html#balance-assignments)),
+the hledger field:
 on the appropriate posting, whenever the running balance field is non-empty.
-## References to other fields and evaluation order
+```rules
-
+# Name the third CSV field "amount1"
-Field assignments could include references to other fields or even to the same field you are trying to assign:
+fields date,description,amount1
 ```
 fields date,description,currency,amount1
 # Set hledger's amount1 to the CSV amount1 field followed by USD
 amount1 %amount1 USD
 amount1 %amount1 EUR
 amount1 %amount1 %currency
-if SOME_REGEXP
+# Set comment to the CSV amount1 (not the amount1 assigned above)
-    amount1 %amount1 GBP
+comment %amount1
 ```
 This is how this file would be evaluated.
-First, parts of CVS record are assigned according to `fields` directive.
+Here, since there's no CSV amount1 field, %amount1 will produce a literal "amount1":
 ```rules
 fields date,description,csvamount
 amount1 %csvamount USD
 # Can't interpolate amount1 here
 comment %amount1
 ```
-Then all other field assignments -- written at top level, or included in `if` blocks -- are considered to see if they should be applied. They are checked in the order they are written, with later assignment overwriting earlier ones.
+When there are multiple field assignments to the same hledger field,
 only the last one takes effect. Here, comment's value will be be B,
 or C if "something" is matched, but never A:
 ```rules
 comment A
 comment B
 if something
 comment C
 ```
-Once full set of field assignments that should be applied is known, their values are computed, and this is when all `%<fieldname>` references are evaluated.
+## How CSV rules are evaluated
-So for a particular row from CSV file, value from fourth column would be assigned to `amount1`.
+Here's how to think of CSV rules being evaluated (if you really need to). First,
-Then `hledger` will decide that `amount1` would have to be amended to `%amount1 USD`, but this will not happen immediately. This choice would be replaced by decision to rewrite `amount1` to `%amount EUR`, which will in turn be thrown away in favor of `%amount1 %currency`. If the `if` block condition will match the row, it will assign `amount1` to `%amount1 GBP`.
+- include - all includes are inlined, from top to bottom, depth first. (At each include point the file is inlined and scanned for further includes, before proceeding.)
-Overall, we will end up with one of the two alternatives for `amount1` - either `%amount1 %currency` or `%amount1 GBP`.
+Then "global" rules are evaluated, top to bottom. If a rule is repeated, the last one wins:
-Now substitution of all referenced values will happen, using the current values for `%amount1` and `currency`, which were provided by the `fields` directive.
+- skip (at top level)
 - date-format
 - newest-first
 - fields - names the CSV fields, optionally sets up initial assignments to hledger fields
 Then for each CSV record in turn:
-## Reading multiple CSV files
+- test all `if` blocks. If any of them contain a `end` rule, skip all remaining CSV records.
  Otherwise if any of them contain a `skip` rule, skip that many CSV records.
  If there are multiple matched skip rules, the first one wins.
 - collect all field assignments at top level and in matched if blocks.
  When there are multiple assignments for a field, keep only the last one.
 - compute a value for each hledger field - either the one that was assigned to it
  (and interpolate the %CSVFIELDNAME references), or a default
 - generate a synthetic hledger transaction from these values, 
  which becomes part of the input to the hledger command that has been selected
-You can read multiple CSV files at once using multiple `-f` arguments on the command line,
+## Valid transactions
 and hledger will look for a correspondingly-named rules file for each.
 Note if you use the `--rules-file` option, this one rules file will be used for all the CSV files being read. 
-## Valid CSV
+hledger currently does not post-process and validate transactions
 generated from CSV as thoroughly as transactions read from a journal
 file. This means that if your rules are wrong, you can generate invalid
 transactions. Or, amounts may not be displayed with a canonical
 display style.
-hledger follows [RFC 4180](https://tools.ietf.org/html/rfc4180),
+So when setting up or adjusting CSV rules, you should check your
-with the addition of a customisable separator character.
+results visually with the print command. You can pipe print's output
 through hledger once more to validate and canonicalise fully.
 Eg:
-Some things to note:
+```shell
 $ hledger -f some.csv print | hledger -f- print -I
 ```
-When quoting fields, 
+(The -I/--ignore-assertions flag disables balance assertion checks,
-
+usually needed when re-parsing print output.)
 - you must use double quotes, not single quotes
 - spaces outside the quotes are [not allowed](https://stackoverflow.com/questions/4863852/space-before-quote-in-csv-field).