;import, print: better deduplication docs

This commit is contained in:
Simon Michael 2021-02-18 18:35:06 -08:00
parent f7bbb39a77
commit 554f7a59fd
2 changed files with 58 additions and 25 deletions

View File

@ -6,19 +6,64 @@ transactions as imported, without actually importing any.
_FLAGS _FLAGS
The input files are specified as arguments - no need to write -f before each one. Unlike other hledger commands, with `import` the journal file is an output file,
So eg to add new transactions from all CSV files to the main journal, it's just: and will be modified, though only by appending (existing data will not be changed).
`hledger import *.csv` The input files are specified as arguments, so to import one or more
CSV files to your main journal, you will run `hledger import bank.csv`
or perhaps `hledger import *.csv`.
New transactions are detected in the same way as print --new: Note you can import from any file format, though CSV files are the
by assuming transactions are always added to the input files in increasing date order, most common import source, and these docs focus on that case.
and by saving `.latest.FILE` state files.
The --dry-run output is in journal format, so you can filter it, eg ### Deduplication
to see only uncategorised transactions:
As a convenience `import` does *deduplication* while reading transactions.
This does not mean "ignore transactions that look the same",
but rather "ignore transactions that have been seen before".
This is intended for when you are periodically importing foreign data
which may contain already-imported transactions.
So eg, if every day you download bank CSV files containing redundant data,
you can safely run `hledger import bank.csv` and only new transactions will be imported.
(`import` is idempotent.)
Since the items being read (CSV records, eg) often do not come with
unique identifiers, hledger detects new transactions by date, assuming
that:
1. new items always have the newest dates
2. item dates do not change across reads
3. and items with the same date remain in the same relative order across reads.
These are often true of CSV files representing transactions, or true
enough so that it works pretty well in practice. 1 is important, but
violations of 2 and 3 amongst the old transactions won't matter (and
if you import often, the new transactions will be few, so less likely
to be the ones affected).
hledger remembers the latest date processed in each input file by
saving a hidden ".latest" state file in the same directory. Eg when
reading `finance/bank.csv`, it will look for and update the
`finance/.latest.bank.csv` state file.
The format is simple: one or more lines containing the
same ISO-format date (YYYY-MM-DD), meaning "I have processed
transactions up to this date, and this many of them on that date."
Normally you won't see or manipulate these state files yourself.
But if needed, you can delete them to reset the state (making all
transactions "new"), or you can construct them to "catch up" to a
certain date.
Note deduplication (and updating of state files) can also be done by
[`print --new`](#print), but this is less often used.
### Import testing
With `--dry-run`, the transactions that will be imported are printed
to the terminal, without affecting your journal.
The output is in journal format, so you can re-parse it.
Eg, to see any importable transactions which CSV rules have not categorised:
```shell ```shell
$ hledger import --dry ... | hledger -f- print unknown --ignore-assertions $ hledger import --dry bank.csv | hledger -f- -I print unknown
``` ```
### Importing balance assignments ### Importing balance assignments
@ -41,4 +86,4 @@ please test it and send a pull request.)
### Commodity display styles ### Commodity display styles
Imported amounts will be formatted according to the canonical [commodity styles](hledger.html#commodity-display-style) Imported amounts will be formatted according to the canonical [commodity styles](hledger.html#commodity-display-style)
(declared or inferred) in the main journal file. (declared or inferred) in the main journal file.

View File

@ -79,21 +79,9 @@ With `-m`/`--match` and a STR argument, print will show at most one transaction:
one whose description is most similar to STR, and is most recent. STR should contain at one whose description is most similar to STR, and is most recent. STR should contain at
least two characters. If there is no similar-enough match, no transaction will be shown. least two characters. If there is no similar-enough match, no transaction will be shown.
With `--new`, for each FILE being read, hledger reads (and writes) a special With `--new`, hledger prints only transactions it has not seen on a previous run.
state file (`.latest.FILE` in the same directory), containing the latest transaction date(s) This uses the same deduplication system as the [`import`](#import) command.
that were seen last time FILE was read. When this file is found, only transactions (See import's docs for details.)
with newer dates (and new transactions on the latest date) are printed.
This is useful for ignoring already-seen entries in import data, such as downloaded CSV files.
Eg:
```shell
$ hledger -f bank1.csv print --new
(shows transactions added since last print --new on this file)
```
This assumes that transactions added to FILE always have same or increasing dates,
and that transactions on the same day do not get reordered.
See also the [import](#import) command.
This command also supports the This command also supports the
[output destination](hledger.html#output-destination) and [output destination](hledger.html#output-destination) and