;doc:import: rewrite deduplication section

This commit is contained in:
Simon Michael 2024-03-24 14:21:44 -10:00
parent 98a93d979a
commit be24d6505f

View File

@ -21,47 +21,42 @@ or perhaps `hledger import *.csv`.
Note you can import from any file format, though CSV files are the Note you can import from any file format, though CSV files are the
most common import source, and these docs focus on that case. most common import source, and these docs focus on that case.
### Deduplication ### "Deduplication"
`import` does *time-based deduplication*, to detect only the new `import` tries to import only the transactions which are new since the last import.
transactions since the last successful import. So if your bank's CSV includes the last three months of data, you can download and `import` it every month (or week, or day)
(This does not mean "ignore transactions that look the same", and only the new transactions will be imported each time.
but rather "ignore transactions that have been seen before".)
This is intended for when you are periodically importing downloaded data,
which may overlap with previous downloads.
Eg if every week (or every day) you download a bank's last three months of CSV data,
you can safely run `hledger import thebank.csv` each time and only new transactions will be imported.
Since the items being read (CSV records, eg) often do not come with It works as follows. For each imported `FILE` (usually a CSV file):
unique identifiers, hledger detects new transactions by date, assuming - It tries to find the latest date seen previously, by reading it from a hidden `.latest.FILE` in the same directory.
that: - Then it processes `FILE`, ignoring any transactions on or before the "latest seen" date.
And after a successful import, it updates the `.latest.FILE`(s) for next time (unless `--dry-run` was used).
This is simple but fairly effective. It assumes:
1. new items always have the newest dates 1. new items always have the newest dates
2. item dates do not change across reads 2. item dates are stable across successive CSV downloads
3. and items with the same date remain in the same relative order across reads. 3. the order of same-date items is stable across CSV downloads
These are often true of CSV files representing transactions, or true These are true of most CSV files representing transactions, or true enough.
enough so that it works pretty well in practice. 1 is important, but If you have a bank whose CSV dates or ordering occasionally changes,
violations of 2 and 3 amongst the old transactions won't matter (and you can reduce the chance of this happening in new transactions by importing more often
if you import often, the new transactions will be few, so less likely (and in old transactions it doesn't matter).
to be the ones affected).
hledger remembers the latest date processed in each input file by Note, `import` avoids reprocessing the same dates across successive runs,
saving a hidden ".latest.FILE" file in FILE's directory but it does not detect transactions that are duplicated within a single run.
(after a succesful import). So eg if you downloaded but did not import `bank.1.csv`, and later downloaded `bank.2.csv` with overlapping data,
you should not import both of them in a single run (`hledger import bank.1.csv bank.2.csv`);
instead, import them one at a time (`hledger import bank.1.csv`, then `hledger import bank.2.csv`).
Eg when reading `finance/bank.csv`, it will look for and update the Normally you can ignore the `.latest.*` files,
`finance/.latest.bank.csv` state file. but if needed, you can delete them (to make all transactions unseen),
The format is simple: one or more lines containing the or construct/modify them (to catch up to a certain date).
same ISO-format date (YYYY-MM-DD), meaning "I have processed The format is just a single ISO-format date (`YYYY-MM-DD`), possibly repeated on multiple lines.
transactions up to this date, and this many of them on that date." It means "I have seen transactions up to this date, and this many of them occurring on that date".
Normally you won't see or manipulate these state files yourself.
But if needed, you can delete them to reset the state (making all
transactions "new"), or you can construct them to "catch up" to a
certain date.
Note deduplication (and updating of state files) can also be done by ([`hledger print --new`](#print) also uses and updates these `.latest.*` files, but it is not often used.)
[`print --new`](#print), but this is less often used.
Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing). Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing).