;doc: import: deduplication: updates
This commit is contained in:
parent
4ab55b3757
commit
57ad739227
@ -21,38 +21,46 @@ or perhaps `hledger import *.csv`.
|
|||||||
Note you can import from any file format, though CSV files are the
|
Note you can import from any file format, though CSV files are the
|
||||||
most common import source, and these docs focus on that case.
|
most common import source, and these docs focus on that case.
|
||||||
|
|
||||||
### Skipping
|
### Deduplication
|
||||||
|
|
||||||
`import` tries to import only the transactions which are new since the last import, "skipping over" any that it saw last time.
|
`import` tries to import only the transactions which are new since the last import, "skipping over" any that it has seen in previous runs.
|
||||||
So if your bank's CSV includes the last three months of data, you can download and `import` it every month (or week, or day)
|
So if your bank's CSV includes the last three months of data, you can download and `import` it every month (or week, or day)
|
||||||
and only the new transactions will be imported each time.
|
and only the new transactions will be imported each time.
|
||||||
|
|
||||||
It works as follows. For each imported `FILE` (usually CSV, but they could be any of hledger's input formats):
|
It works as follows. For each imported `FILE` (usually CSV, but they could be any of hledger's input formats):
|
||||||
|
|
||||||
- It tries to find the latest date seen previously, by reading it from a hidden `.latest.FILE` in the same directory.
|
- It tries to recall the latest date seen previously, reading it from a hidden `.latest.FILE` in the same directory.
|
||||||
- Then it processes `FILE`, ignoring any transactions on or before the "latest seen" date.
|
- Then it processes `FILE`, ignoring any transactions on or before the "latest seen" date.
|
||||||
|
|
||||||
And after a successful import, it updates the `.latest.FILE`(s) for next time (unless `--dry-run` was used).
|
And after a successful import, it updates the `.latest.FILE`(s) for next time (unless `--dry-run` was used).
|
||||||
|
|
||||||
|
This is a limited kind of deduplication, let's call it "date skipping".
|
||||||
|
Within each input file, it avoids reprocessing the same dates across successive runs.
|
||||||
This is a simple system that works fairly well for transaction data.
|
This is a simple system that works fairly well for transaction data.
|
||||||
It assumes:
|
It assumes:
|
||||||
|
|
||||||
1. new items always have the newest dates
|
1. new items always have the newest dates
|
||||||
2. item dates are stable across successive CSV downloads
|
2. item dates are stable across successive downloads
|
||||||
3. the order of same-date items is stable across CSV downloads
|
3. the order of same-date items is stable across downloads
|
||||||
|
4. the name of the input file is stable across downloads
|
||||||
|
|
||||||
These are true of most CSV files representing transactions, or true enough.
|
These are true of most CSV files representing transactions, or true enough.
|
||||||
If you have a bank whose CSV dates or ordering occasionally changes,
|
If you have a bank whose CSV dates or ordering change occasionally,
|
||||||
you can reduce the chance of this happening in new transactions by importing more often
|
you can reduce the chance of this happening in new transactions by importing more often
|
||||||
(and in old transactions it doesn't matter).
|
(and in old transactions it doesn't matter).
|
||||||
|
And remember you can use CSV rules files as input, which is one way to ensure a stable file name.
|
||||||
|
|
||||||
Note, `import` avoids reprocessing the same dates across successive runs,
|
`import` doesn't detect other kinds of duplication, such as duplicate transactions within a single run.
|
||||||
but it does not detect transactions that are duplicated within a single run.
|
(In part, because legitimate duplicate transactions can easily occur in real-world data.)
|
||||||
I'll call these "skipping" and "deduplication" respectively.
|
So, say you downloaded but forgot to import `bank.1.csv`, and a week later you downloaded `bank.2.csv` with overlapping data.
|
||||||
|
Now you should not import both of these at once (`hledger import bank.1.csv bank.2.csv`);
|
||||||
|
the overlapping transactions which appear twice would not be deduplicated since this is considered a single import.
|
||||||
|
Instead, import these files one at a time, and also use the same filename each time for a common "latest seen" state:
|
||||||
|
|
||||||
So for example, say you downloaded but did not import `bank.1.csv`, and later downloaded `bank.2.csv` with overlapping data.
|
```cli
|
||||||
Then you should not import both of them at once (`hledger import bank.1.csv bank.2.csv`), as the overlapping data would appear twice and not be deduplicated.
|
$ mv bank.1.csv bank.csv; hledger import bank.csv
|
||||||
Instead, import them one at a time (`hledger import bank.1.csv; hledger import bank.2.csv`), and the second import will skip the overlapping data.
|
$ mv bank.2.csv bank.csv; hledger import bank.csv
|
||||||
|
```
|
||||||
|
|
||||||
Normally you can ignore the `.latest.*` files,
|
Normally you can ignore the `.latest.*` files,
|
||||||
but if needed, you can delete them (to make all transactions unseen),
|
but if needed, you can delete them (to make all transactions unseen),
|
||||||
@ -60,7 +68,7 @@ or construct/modify them (to catch up to a certain date).
|
|||||||
The format is just a single ISO-format date (`YYYY-MM-DD`), possibly repeated on multiple lines.
|
The format is just a single ISO-format date (`YYYY-MM-DD`), possibly repeated on multiple lines.
|
||||||
It means "I have seen transactions up to this date, and this many of them occurring on that date".
|
It means "I have seen transactions up to this date, and this many of them occurring on that date".
|
||||||
|
|
||||||
([`hledger print --new`](#print) also uses and updates these `.latest.*` files, but it is less often used.)
|
[`hledger print --new`](#print) also uses and updates these `.latest.*` files, but it is less often used.
|
||||||
|
|
||||||
Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing).
|
Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing).
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user