;doc:import: rewrite deduplication section
This commit is contained in:
parent
98a93d979a
commit
be24d6505f
@ -21,47 +21,42 @@ or perhaps `hledger import *.csv`.
|
|||||||
Note you can import from any file format, though CSV files are the
|
Note you can import from any file format, though CSV files are the
|
||||||
most common import source, and these docs focus on that case.
|
most common import source, and these docs focus on that case.
|
||||||
|
|
||||||
### Deduplication
|
### "Deduplication"
|
||||||
|
|
||||||
`import` does *time-based deduplication*, to detect only the new
|
`import` tries to import only the transactions which are new since the last import.
|
||||||
transactions since the last successful import.
|
So if your bank's CSV includes the last three months of data, you can download and `import` it every month (or week, or day)
|
||||||
(This does not mean "ignore transactions that look the same",
|
and only the new transactions will be imported each time.
|
||||||
but rather "ignore transactions that have been seen before".)
|
|
||||||
This is intended for when you are periodically importing downloaded data,
|
|
||||||
which may overlap with previous downloads.
|
|
||||||
Eg if every week (or every day) you download a bank's last three months of CSV data,
|
|
||||||
you can safely run `hledger import thebank.csv` each time and only new transactions will be imported.
|
|
||||||
|
|
||||||
Since the items being read (CSV records, eg) often do not come with
|
It works as follows. For each imported `FILE` (usually a CSV file):
|
||||||
unique identifiers, hledger detects new transactions by date, assuming
|
- It tries to find the latest date seen previously, by reading it from a hidden `.latest.FILE` in the same directory.
|
||||||
that:
|
- Then it processes `FILE`, ignoring any transactions on or before the "latest seen" date.
|
||||||
|
|
||||||
|
And after a successful import, it updates the `.latest.FILE`(s) for next time (unless `--dry-run` was used).
|
||||||
|
|
||||||
|
This is simple but fairly effective. It assumes:
|
||||||
|
|
||||||
1. new items always have the newest dates
|
1. new items always have the newest dates
|
||||||
2. item dates do not change across reads
|
2. item dates are stable across successive CSV downloads
|
||||||
3. and items with the same date remain in the same relative order across reads.
|
3. the order of same-date items is stable across CSV downloads
|
||||||
|
|
||||||
These are often true of CSV files representing transactions, or true
|
These are true of most CSV files representing transactions, or true enough.
|
||||||
enough so that it works pretty well in practice. 1 is important, but
|
If you have a bank whose CSV dates or ordering occasionally changes,
|
||||||
violations of 2 and 3 amongst the old transactions won't matter (and
|
you can reduce the chance of this happening in new transactions by importing more often
|
||||||
if you import often, the new transactions will be few, so less likely
|
(and in old transactions it doesn't matter).
|
||||||
to be the ones affected).
|
|
||||||
|
|
||||||
hledger remembers the latest date processed in each input file by
|
Note, `import` avoids reprocessing the same dates across successive runs,
|
||||||
saving a hidden ".latest.FILE" file in FILE's directory
|
but it does not detect transactions that are duplicated within a single run.
|
||||||
(after a succesful import).
|
So eg if you downloaded but did not import `bank.1.csv`, and later downloaded `bank.2.csv` with overlapping data,
|
||||||
|
you should not import both of them in a single run (`hledger import bank.1.csv bank.2.csv`);
|
||||||
|
instead, import them one at a time (`hledger import bank.1.csv`, then `hledger import bank.2.csv`).
|
||||||
|
|
||||||
Eg when reading `finance/bank.csv`, it will look for and update the
|
Normally you can ignore the `.latest.*` files,
|
||||||
`finance/.latest.bank.csv` state file.
|
but if needed, you can delete them (to make all transactions unseen),
|
||||||
The format is simple: one or more lines containing the
|
or construct/modify them (to catch up to a certain date).
|
||||||
same ISO-format date (YYYY-MM-DD), meaning "I have processed
|
The format is just a single ISO-format date (`YYYY-MM-DD`), possibly repeated on multiple lines.
|
||||||
transactions up to this date, and this many of them on that date."
|
It means "I have seen transactions up to this date, and this many of them occurring on that date".
|
||||||
Normally you won't see or manipulate these state files yourself.
|
|
||||||
But if needed, you can delete them to reset the state (making all
|
|
||||||
transactions "new"), or you can construct them to "catch up" to a
|
|
||||||
certain date.
|
|
||||||
|
|
||||||
Note deduplication (and updating of state files) can also be done by
|
([`hledger print --new`](#print) also uses and updates these `.latest.*` files, but it is not often used.)
|
||||||
[`print --new`](#print), but this is less often used.
|
|
||||||
|
|
||||||
Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing).
|
Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing).
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user