From d92351e21a3b67debde8285c49c4656a20296ac8 Mon Sep 17 00:00:00 2001 From: Simon Michael Date: Wed, 6 Nov 2019 13:08:54 -0800 Subject: [PATCH] csv: doc: clean up/expand manual after #1095 [ci skip] --- hledger-lib/hledger_csv.m4.md | 571 +++++++++++++++++++++------------- 1 file changed, 351 insertions(+), 220 deletions(-) diff --git a/hledger-lib/hledger_csv.m4.md b/hledger-lib/hledger_csv.m4.md index 4055c857f..42c4ed330 100644 --- a/hledger-lib/hledger_csv.m4.md +++ b/hledger-lib/hledger_csv.m4.md @@ -25,7 +25,7 @@ Converting CSV to transactions requires some special conversion rules. These do several things: - they describe the layout and format of the CSV data -- they can customize the generated journal entries using a simple templating language +- they can customize the generated journal entries (transactions) using a simple templating language - they can add refinements based on patterns in the CSV data, eg categorizing transactions with more detailed account names. When reading a CSV file named `FILE.csv`, hledger looks for a @@ -38,12 +38,245 @@ At minimum, the rules file must identify the date and amount fields. It's often necessary to specify the date format, and the number of header lines to skip, also. Eg: ``` -fields date, _, _, amount1 +fields date, _, _, amount date-format %d/%m/%Y skip 1 ``` -A more complete example: +More examples in the EXAMPLES section below. + +# CSV RULES + +The following kinds of rule can appear in the rules file, in any order +(except for `end` which can appear only inside a conditional block). +Blank lines and lines beginning with `#` or `;` are ignored. + +## `skip` + +```rules +skip N +``` +The word "skip" followed by a number (or no number, meaning 1) +tells hledger to ignore this many non-empty lines preceding the CSV data. +(Empty/blank lines are skipped automatically.) +You'll need this whenever your CSV data contains header lines. + +It also has a second purpose: it can be used to ignore certain CSV +records, see [conditional blocks](#if) below. + +## `fields` + +```rules +fields FIELDNAME1, FIELDNAME2, ... +``` +A fields list ("fields" followed by one or more comma-separated field names) is the quick way to assign CSV field values to hledger fields. +It (a) names the CSV fields, in order (names may not contain whitespace; fields you don't care about can be left unnamed), +and (b) assigns them to hledger fields if you use standard hledger field names. +Here's an example: +```rules +# use the 1st, 2nd and 4th CSV fields as the transaction's date, description and amount, +# ignore the 3rd, 5th and 6th fields, +# and name the 7th and 8th fields for later reference: +# 1 2 3 4 5 6 7 8 + +fields date, description, , amount1, , , somefield, anotherfield +``` + +Here are the standard hledger field names: + +### Transaction fields + +`date`, `date2`, `status`, `code`, `description`, `comment` can be used to form the +[transaction's](journal.html#transactions) first line. Only `date` is required. +(See also [date-format](#date-format) below.) + +### Posting fields + +`accountN`, where N is 1 to 9, sets the Nth [posting's](journal.html#postings) account name. +Most often there are two postings, so you'll want to set `account1` and `account2`. + + +A number of field/pseudo-field names are available for setting posting [amounts](journal.html#amounts): + +- `amountN` sets posting N's amount +- `amountN-in` and `amountN-out` can be used instead, if the CSV has separate fields for debits and credits +- `currencyN` sets a currency symbol to be left-prefixed to the amount, useful if the CSV provides that as a separate field +- `balanceN` sets a (separate) [balance assertion](journal.html#balance-assertions) amount + (or when no posting amount is set, a [balance assignment](journal.html#balance-assignments)) + +If you write these with no number +(`amount`, `amount-in`, `amount-out`, `currency`, `balance`), +it means posting 1. +Also, if you set an amount for posting 1 only, +a second posting that balances the transaction will be generated automatically. +This helps support CSV rules created before hledger 1.16. + + +Finally, `commentN` sets a [comment](journal.html#comments) on the Nth posting. +Comments can of course contain [tags](journal.html#tags). + +## `(field assignment)` + +```rules +HLEDGERFIELDNAME FIELDVALUE +``` + +Instead of or in addition to a [fields list](#fields), you can +assign a value to a hledger field by writing its name +(any of the standard names above) followed by a text value. +The value may contain interpolated CSV fields, +referenced by their 1-based position in the CSV record (`%N`), +or by the name they were given in the fields list (`%CSVFIELDNAME`). +Eg: +```rules +# set the amount to the 4th CSV field, with " USD" appended +amount %4 USD +``` +```rules +# combine three fields to make a comment, containing note: and date: tags +comment note: %somefield - %anotherfield, date: %1 +``` +Interpolation strips any outer whitespace, so a CSV value like `" 1 "` +becomes `1` when interpolated +([#1051](https://github.com/simonmichael/hledger/issues/1051)). +Note you can only interpolate CSV fields, not the hledger fields being assigned to; +for more on this, see [TIPS](#tips). + +## `date-format` + +```rules +date-format DATEFMT +``` +This is a helper for the `date` (and `date2`) fields. +If your CSV dates are not formatted like `YYYY-MM-DD`, `YYYY/MM/DD` or `YYYY.MM.DD`, +you'll need to specify the format by writing "date-format" followed by +a [strptime-like date parsing pattern](http://hackage.haskell.org/packages/archive/time/latest/doc/html/Data-Time-Format.html#v:formatTime), +which must parse the date field values completely. Examples: + +``` rules +# for dates like "11/06/2013": +date-format %m/%d/%Y +``` + +``` rules +# for dates like "6/11/2013". The - allows leading zeros to be optional. +date-format %-d/%-m/%Y +``` + +``` rules +# for dates like "2013-Nov-06": +date-format %Y-%h-%d +``` + +``` rules +# for dates like "11/6/2013 11:32 PM": +date-format %-m/%-d/%Y %l:%M %p +``` + +## `if` + +```rules +if PATTERN + RULE + +if +PATTERN +PATTERN +PATTERN + RULE + RULE +``` + +Conditional blocks apply one or more rules to CSV records which are +matched by any of the PATTERNs. This allows transactions to be +customised or categorised based on patterns in the data. + +A single pattern can be written on the same line as the "if"; +or multiple patterns can be written on the following lines, non-indented. + +Patterns are case-insensitive [regular expressions](hledger.html#regular-expressions) +which try to match any part of the whole CSV record. +It's not yet possible to match within a specific field. +Note the CSV record they see is close but not identical to the one in the CSV file; +eg double quotes are removed, and the separator character becomes comma. + +After the patterns, there should be one or more rules to apply, all +indented by at least one space. Three kinds of rule are allowed in +conditional blocks: + +- [field assignments](#field-assignment) (to set a field's value) +- [skip](#skip) (to skip the matched CSV record) +- [end](#end) (to skip all remaining CSV records). + +Examples: +```rules +# if the CSV record contains "groceries", set account2 to "expenses:groceries" +if groceries + account2 expenses:groceries +``` +```rules +# if the CSV record contains any of these patterns, set account2 and comment as shown +if +monthly service fee +atm transaction fee +banking thru software + account2 expenses:business:banking + comment XXX deductible ? check it +``` + +## `end` + +As mentioned above, this rule can be used inside conditional blocks +(only) to cause hledger to stop reading CSV records and proceed with +command execution. Eg: +```rules +# ignore everything following the first empty record +if ,,,, + end +``` + +## `include` + +```rules +include RULESFILE +``` + +Include another CSV rules file at this point, as if it were written inline. +`RULESFILE` is an absolute file path or a path relative to the current file's directory. + +This can be useful eg for reusing common rules in several rules files: +```rules +# someaccount.csv.rules + +## someaccount-specific rules +fields date,description,amount +account1 some:account +account2 some:misc + +## common rules +include categorisation.rules +``` + +## `newest-first` + +hledger always sorts the generated transactions by date. +Transactions on the same date should appear in the same order as their CSV records, +as hledger can usually auto-detect whether the CSV's normal order is oldest first or newest first. +But if all of the following are true: + +- the CSV might sometimes contain just one day of data (all records having the same date) +- the CSV records are normally in reverse chronological order (newest first) +- and you care about preserving the order of same-day transactions + +you should add the `newest-first` rule as a hint. Eg: +```rules +# tell hledger explicitly that the CSV is normally newest-first +newest-first +``` + +# EXAMPLES + +A more complete example, generating three-posting transactions: ``` # hledger CSV rules for amazon.com order history @@ -79,264 +312,162 @@ comment3 fees For more examples, see [Convert CSV files](https://github.com/simonmichael/hledger/wiki/Convert-CSV-files). +# TIPS -# CSV RULES +## Reading multiple CSV files -The following seven kinds of rule can appear in the rules file, in any order. -Blank lines and lines beginning with `#` or `;` are ignored. +You can read multiple CSV files at once using multiple `-f` arguments on the command line. +hledger will look for a correspondingly-named rules file for each CSV file. +If you use the `--rules-file` option, that rules file will be used for all the CSV files. -## skip +## Deduplicating, importing -`skip `*`N`* - -Skip this many non-empty lines preceding the CSV data. -(Empty/blank lines are skipped automatically.) -You'll need this whenever your CSV data contains header lines. Eg: - - - -This can also be used in a conditional block to ignore certain CSV records. -```rules -# ignore the first CSV line -skip 1 +When you download a CSV file repeatedly, eg to get your latest bank +transactions, the new file may contain some of the same records as the +old one. The [print --new](hledger.html#print) command is one simple +way to detect just the new transactions. Or better still, the +[import](hledger.html#import) command appends those new transactions +to your main journal. This is the easiest way to import CSV data. Eg, +after downloading your latest CSV files: +```shell +$ hledger import *.csv [--dry] ``` -## date-format +## Other import methods -`date-format `*`DATEFMT`* +A number of other tools and workflows, hledger-specific and otherwise, +exist for converting, deduplicating, classifying and managing CSV data. +See: -When your CSV date fields are not formatted like `YYYY/MM/DD` (or `YYYY-MM-DD` or `YYYY.MM.DD`), -you'll need to specify the format. -DATEFMT is a [strptime-like date parsing pattern](http://hackage.haskell.org/packages/archive/time/latest/doc/html/Data-Time-Format.html#v:formatTime), -which must parse the date field values completely. Examples: +- -> sidebar -> real world setups +- -> data import/conversion -``` rules -# for dates like "11/06/2013": -date-format %m/%d/%Y +## Valid CSV + +hledger accepts CSV conforming to [RFC 4180](https://tools.ietf.org/html/rfc4180). +Some things to note when values are enclosed in quotes: + +- you must use double quotes (not single quotes) +- spaces outside the quotes are [not allowed](https://stackoverflow.com/questions/4863852/space-before-quote-in-csv-field) + +## Other separator characters + +With the `--separator 'CHAR'` option, hledger will expect the +separator to be CHAR instead of a comma. Ie it will read other +"Character Separated Values" formats, such as TSV (Tab Separated Values). +Note: on the command line, use a real tab character in quotes, not \t. Eg: +```shell +$ hledger -f foo.tsv --separator ' ' print ``` +(Experimental.) -``` rules -# for dates like "6/11/2013" (note the - to make leading zeros optional): -date-format %-d/%-m/%Y -``` +## Setting amounts -``` rules -# for dates like "2013-Nov-06": -date-format %Y-%h-%d -``` +A posting amount can be set in one of these ways: -``` rules -# for dates like "11/6/2013 11:32 PM": -date-format %-m/%-d/%Y %l:%M %p -``` +- by assigning (with a fields list or field assigment) to + `amountN` (posting N's amount) or `amount` (posting 1's amount) -## field list +- by assigning to `amountN-in` and `amountN-out` (or `amount-in` and `amount-out`). + For each CSV record, whichever of these has a non-zero value will be used, with appropriate sign. + If both contain a non-zero value, this may not work. -`fields `*`FIELDNAME1`*, *`FIELDNAME2`*... - -This (a) names the CSV fields, in order (names may not contain whitespace; uninteresting names may be left blank), -and (b) assigns them to journal entry fields if you use any of these standard field names: - -Fields `date`, `date2`, `status`, `code`, `description` will form transaction description. - -An assignment to any of `accountN`, `amountN`, `amountN-in`, `amountN-out`, `balanceN` or `currencyN` will generate a posting (though it's your responsibility to ensure it is a well formed one). Normally the `N`'s are consecutive starting from 1 but it's not required. One posting will be generated for each unique `N`. If you wish to supply a comment for the posting, use `commentN`, though comment on its own will not cause posting to be generated. - -Fields `amount`, `amount-in`, `amount-out`, `currency`, `balance` and `comment` are treated as aliases for `amount1`, and so on. If your rules file leads to both aliased fields having different values, `hledger` will raise an error. - -Eg: -```rules -# use the 1st, 2nd and 4th CSV fields as the entry's date, description and amount, -# and give the 7th and 8th fields meaningful names for later reference: -# -# CSV field: -# 1 2 3 4 5 6 7 8 -# entry field: -fields date, description, , amount1, , , somefield, anotherfield -``` - -For backwards compatibility, we treat posting 1 specially. If your rules generated just posting 1, another posting would be added to your transaction to balance it. If your rules generated posting 1 and posting 2, but amount in the posting 2 is empty, hledger will fill it out with the opposite of posting 1. This special handling is needed to ensure smooth upgrade path from version 1.15. - -## field assignment - -*`ENTRYFIELDNAME`* *`FIELDVALUE`* - -This sets a journal entry field (one of the standard names above) to the given text value, -which can include CSV field values interpolated by name (`%CSVFIELDNAME`) or 1-based position (`%N`). - -Eg: -```rules -# set the amount to the 4th CSV field with "USD " prepended -amount USD %4 -``` -```rules -# combine three fields to make a comment (containing two tags) -comment note: %somefield - %anotherfield, date: %1 -``` - -Field assignments can be used instead of or in addition to a field list. - -Note, interpolation strips any outer whitespace, so a CSV value like -`" 1 "` becomes `1` when interpolated ([#1051](https://github.com/simonmichael/hledger/issues/1051)). - -## conditional block - -`if` *`PATTERN`*\ -    *`FIELDASSIGNMENTS`*... - -`if`\ -*`PATTERN`*\ -*`PATTERN`*...\ -    *`FIELDASSIGNMENTS`*... - -`if` *`PATTERN`*\ -*`PATTERN`*...\ -    *`skip N`*... - -`if` *`PATTERN`*\ -*`PATTERN`*...\ -    *`end`*... - -This applies one or more field assignments, only to those CSV records matched by one of the PATTERNs. -The patterns are case-insensitive regular expressions which match anywhere -within the whole CSV record (it's not yet possible to match within a -specific field). When there are multiple patterns they can be written -on separate lines, unindented. -The field assignments are on separate lines indented by at least one space. - -Instead of field assignments you can specify `skip` or `skip 1` to skip this record, `skip N` to skip the next N records (including the one that matchied) or `end` to skip the rest of the file. - -Examples: -```rules -# if the CSV record contains "groceries", set account2 to "expenses:groceries" -if groceries - account2 expenses:groceries -``` -```rules -# if the CSV record contains any of these patterns, set account2 and comment as shown -if -monthly service fee -atm transaction fee -banking thru software - account2 expenses:business:banking - comment XXX deductible ? check it -``` - -## include - -`include `*`RULESFILE`* - -Include another rules file at this point. `RULESFILE` is either an absolute file path or -a path relative to the current file's directory. Eg: -```rules -# rules reused with several CSV files -include common.rules -``` - -## newest-first - -`newest-first` - -Consider adding this rule if all of the following are true: -you might be processing just one day of data, -your CSV records are in reverse chronological order (newest first), -and you care about preserving the order of same-day transactions. -It usually isn't needed, because hledger autodetects the CSV order, -but when all CSV records have the same date it will assume they are oldest first. - -# CSV TIPS - -## CSV ordering - -The generated [journal entries](journal.html#transactions) will be sorted by date. -The order of same-day entries will be preserved -(except in the special case where you might need [`newest-first`](#newest-first), see above). - -## CSV accounts - -Each journal entry will have at least two [postings](journal.html#postings), to `account1` and some other account (usually `account2`). -It's conventional and recommended to use `account1` for the account whose CSV we are reading. - -## CSV amounts - -A posting [amount](journal.html#amounts) could be set in one of these ways: - -- with an `amountN` field assignment, which sets the Nth posting's amount - -- (When the CSV has debit and credit amounts in separate fields:)\ - with field assignments for the `amountN-in` and `amountN-out` pseudo - fields (both of them). Whichever one has a value will be used, with - appropriate sign. If both contain a value, it might not work so well. - -- with `balanceN` field assignment that creates a [balance assignment](journal.html#balance-assignments) (see below). +- by assigning to `balanceN` (or `balance`) instead of the above, + setting the amount indirectly via a + [balance assignment](journal.html#balance-assignments). There is some special handling for sign in amounts: - If an amount value is parenthesised, it will be de-parenthesised and sign-flipped. -- If an amount value begins with a double minus sign, those will cancel out and be removed. +- If an amount value begins with a double minus sign, those cancel out and are removed. If the currency/commodity symbol is provided as a separate CSV field, -assign it to the `currency` pseudo field (applicable to the whole transaction) or `currencyN` (applicable to Nth posting only); the symbol will be prepended -to the amount -(TODO: when there is an amount). -Or, you can use an `amountN` [field assignment](#field-assignment) for more control, eg: +you can assign it to `currency` (affects all posting amounts) or `currencyN` (affects just posting N's amount). +The symbol will be prepended to the amount. +Or for more control, you can set both currency symbol and amount with a field assignment, eg: ``` -fields date,description,currency,amount1 -amount1 %amount1 %currency +fields date,description,currency,amount +# add currency symbol on the right: +amount %amount %currency ``` -## CSV balance assertions/assignments +## Referencing other fields -If the CSV includes a running balance, you can assign that to one of the pseudo fields -`balance` (or `balance1`), `balance2`, ... up to `balance9`. -This will generate a [balance assertion](journal.html#balance-assertions) -(or if the amount is left empty, a [balance assignment](journal.html#balance-assignments)), -on the appropriate posting, whenever the running balance field is non-empty. +In field assignments, you can interpolate only CSV fields, not hledger +fields. In the example below, there's both a CSV field and a hledger +field named amount1, but %amount1 always means the CSV field, not +the hledger field: -## References to other fields and evaluation order - -Field assignments could include references to other fields or even to the same field you are trying to assign: - -``` -fields date,description,currency,amount1 +```rules +# Name the third CSV field "amount1" +fields date,description,amount1 +# Set hledger's amount1 to the CSV amount1 field followed by USD amount1 %amount1 USD -amount1 %amount1 EUR -amount1 %amount1 %currency -if SOME_REGEXP - amount1 %amount1 GBP +# Set comment to the CSV amount1 (not the amount1 assigned above) +comment %amount1 ``` -This is how this file would be evaluated. -First, parts of CVS record are assigned according to `fields` directive. +Here, since there's no CSV amount1 field, %amount1 will produce a literal "amount1": +```rules +fields date,description,csvamount +amount1 %csvamount USD +# Can't interpolate amount1 here +comment %amount1 +``` -Then all other field assignments -- written at top level, or included in `if` blocks -- are considered to see if they should be applied. They are checked in the order they are written, with later assignment overwriting earlier ones. +When there are multiple field assignments to the same hledger field, +only the last one takes effect. Here, comment's value will be be B, +or C if "something" is matched, but never A: +```rules +comment A +comment B +if something + comment C +``` -Once full set of field assignments that should be applied is known, their values are computed, and this is when all `%` references are evaluated. +## How CSV rules are evaluated -So for a particular row from CSV file, value from fourth column would be assigned to `amount1`. +Here's how to think of CSV rules being evaluated (if you really need to). First, -Then `hledger` will decide that `amount1` would have to be amended to `%amount1 USD`, but this will not happen immediately. This choice would be replaced by decision to rewrite `amount1` to `%amount EUR`, which will in turn be thrown away in favor of `%amount1 %currency`. If the `if` block condition will match the row, it will assign `amount1` to `%amount1 GBP`. +- include - all includes are inlined, from top to bottom, depth first. (At each include point the file is inlined and scanned for further includes, before proceeding.) -Overall, we will end up with one of the two alternatives for `amount1` - either `%amount1 %currency` or `%amount1 GBP`. +Then "global" rules are evaluated, top to bottom. If a rule is repeated, the last one wins: -Now substitution of all referenced values will happen, using the current values for `%amount1` and `currency`, which were provided by the `fields` directive. +- skip (at top level) +- date-format +- newest-first +- fields - names the CSV fields, optionally sets up initial assignments to hledger fields +Then for each CSV record in turn: -## Reading multiple CSV files +- test all `if` blocks. If any of them contain a `end` rule, skip all remaining CSV records. + Otherwise if any of them contain a `skip` rule, skip that many CSV records. + If there are multiple matched skip rules, the first one wins. +- collect all field assignments at top level and in matched if blocks. + When there are multiple assignments for a field, keep only the last one. +- compute a value for each hledger field - either the one that was assigned to it + (and interpolate the %CSVFIELDNAME references), or a default +- generate a synthetic hledger transaction from these values, + which becomes part of the input to the hledger command that has been selected -You can read multiple CSV files at once using multiple `-f` arguments on the command line, -and hledger will look for a correspondingly-named rules file for each. -Note if you use the `--rules-file` option, this one rules file will be used for all the CSV files being read. +## Valid transactions -## Valid CSV +hledger currently does not post-process and validate transactions +generated from CSV as thoroughly as transactions read from a journal +file. This means that if your rules are wrong, you can generate invalid +transactions. Or, amounts may not be displayed with a canonical +display style. -hledger follows [RFC 4180](https://tools.ietf.org/html/rfc4180), -with the addition of a customisable separator character. +So when setting up or adjusting CSV rules, you should check your +results visually with the print command. You can pipe print's output +through hledger once more to validate and canonicalise fully. +Eg: -Some things to note: +```shell +$ hledger -f some.csv print | hledger -f- print -I +``` -When quoting fields, - -- you must use double quotes, not single quotes -- spaces outside the quotes are [not allowed](https://stackoverflow.com/questions/4863852/space-before-quote-in-csv-field). +(The -I/--ignore-assertions flag disables balance assertion checks, +usually needed when re-parsing print output.)