From 554f7a59fdafe96d36a77e2b9f0f664e53979bc0 Mon Sep 17 00:00:00 2001 From: Simon Michael Date: Thu, 18 Feb 2021 18:35:06 -0800 Subject: [PATCH] ;import, print: better deduplication docs --- hledger/Hledger/Cli/Commands/Import.md | 65 ++++++++++++++++++++++---- hledger/Hledger/Cli/Commands/Print.md | 18 ++----- 2 files changed, 58 insertions(+), 25 deletions(-) diff --git a/hledger/Hledger/Cli/Commands/Import.md b/hledger/Hledger/Cli/Commands/Import.md index dfead608c..cdb1a2de3 100644 --- a/hledger/Hledger/Cli/Commands/Import.md +++ b/hledger/Hledger/Cli/Commands/Import.md @@ -6,19 +6,64 @@ transactions as imported, without actually importing any. _FLAGS -The input files are specified as arguments - no need to write -f before each one. -So eg to add new transactions from all CSV files to the main journal, it's just: -`hledger import *.csv` +Unlike other hledger commands, with `import` the journal file is an output file, +and will be modified, though only by appending (existing data will not be changed). +The input files are specified as arguments, so to import one or more +CSV files to your main journal, you will run `hledger import bank.csv` +or perhaps `hledger import *.csv`. -New transactions are detected in the same way as print --new: -by assuming transactions are always added to the input files in increasing date order, -and by saving `.latest.FILE` state files. +Note you can import from any file format, though CSV files are the +most common import source, and these docs focus on that case. -The --dry-run output is in journal format, so you can filter it, eg -to see only uncategorised transactions: +### Deduplication + +As a convenience `import` does *deduplication* while reading transactions. +This does not mean "ignore transactions that look the same", +but rather "ignore transactions that have been seen before". +This is intended for when you are periodically importing foreign data +which may contain already-imported transactions. +So eg, if every day you download bank CSV files containing redundant data, +you can safely run `hledger import bank.csv` and only new transactions will be imported. +(`import` is idempotent.) + +Since the items being read (CSV records, eg) often do not come with +unique identifiers, hledger detects new transactions by date, assuming +that: + +1. new items always have the newest dates +2. item dates do not change across reads +3. and items with the same date remain in the same relative order across reads. + +These are often true of CSV files representing transactions, or true +enough so that it works pretty well in practice. 1 is important, but +violations of 2 and 3 amongst the old transactions won't matter (and +if you import often, the new transactions will be few, so less likely +to be the ones affected). + +hledger remembers the latest date processed in each input file by +saving a hidden ".latest" state file in the same directory. Eg when +reading `finance/bank.csv`, it will look for and update the +`finance/.latest.bank.csv` state file. +The format is simple: one or more lines containing the +same ISO-format date (YYYY-MM-DD), meaning "I have processed +transactions up to this date, and this many of them on that date." +Normally you won't see or manipulate these state files yourself. +But if needed, you can delete them to reset the state (making all +transactions "new"), or you can construct them to "catch up" to a +certain date. + +Note deduplication (and updating of state files) can also be done by +[`print --new`](#print), but this is less often used. + +### Import testing + +With `--dry-run`, the transactions that will be imported are printed +to the terminal, without affecting your journal. +The output is in journal format, so you can re-parse it. +Eg, to see any importable transactions which CSV rules have not categorised: ```shell -$ hledger import --dry ... | hledger -f- print unknown --ignore-assertions +$ hledger import --dry bank.csv | hledger -f- -I print unknown ``` ### Importing balance assignments @@ -41,4 +86,4 @@ please test it and send a pull request.) ### Commodity display styles Imported amounts will be formatted according to the canonical [commodity styles](hledger.html#commodity-display-style) -(declared or inferred) in the main journal file. \ No newline at end of file +(declared or inferred) in the main journal file. diff --git a/hledger/Hledger/Cli/Commands/Print.md b/hledger/Hledger/Cli/Commands/Print.md index c8948a685..345ab4d7e 100644 --- a/hledger/Hledger/Cli/Commands/Print.md +++ b/hledger/Hledger/Cli/Commands/Print.md @@ -79,21 +79,9 @@ With `-m`/`--match` and a STR argument, print will show at most one transaction: one whose description is most similar to STR, and is most recent. STR should contain at least two characters. If there is no similar-enough match, no transaction will be shown. -With `--new`, for each FILE being read, hledger reads (and writes) a special -state file (`.latest.FILE` in the same directory), containing the latest transaction date(s) -that were seen last time FILE was read. When this file is found, only transactions -with newer dates (and new transactions on the latest date) are printed. -This is useful for ignoring already-seen entries in import data, such as downloaded CSV files. -Eg: - -```shell -$ hledger -f bank1.csv print --new -(shows transactions added since last print --new on this file) -``` - -This assumes that transactions added to FILE always have same or increasing dates, -and that transactions on the same day do not get reordered. -See also the [import](#import) command. +With `--new`, hledger prints only transactions it has not seen on a previous run. +This uses the same deduplication system as the [`import`](#import) command. +(See import's docs for details.) This command also supports the [output destination](hledger.html#output-destination) and