;doc: import: deduplication: updates

2024-03-27 09:12:03 -10:00 · 2024-03-27 09:12:03 -10:00 · 57ad739227
commit 57ad739227
parent 4ab55b3757
1 changed files with 21 additions and 13 deletions
--- a/hledger/Hledger/Cli/Commands/Import.md
+++ b/hledger/Hledger/Cli/Commands/Import.md
@ -21,38 +21,46 @@ or perhaps `hledger import *.csv`.
 Note you can import from any file format, though CSV files are the
 most common import source, and these docs focus on that case.

-### Skipping
+### Deduplication

-`import` tries to import only the transactions which are new since the last import, "skipping over" any that it saw last time.
+`import` tries to import only the transactions which are new since the last import, "skipping over" any that it has seen in previous runs.
 So if your bank's CSV includes the last three months of data, you can download and `import` it every month (or week, or day) 
 and only the new transactions will be imported each time.

 It works as follows. For each imported `FILE` (usually CSV, but they could be any of hledger's input formats):

- It tries to find the latest date seen previously, by reading it from a hidden `.latest.FILE` in the same directory.
+- It tries to recall the latest date seen previously, reading it from a hidden `.latest.FILE` in the same directory.
 - Then it processes `FILE`, ignoring any transactions on or before the "latest seen" date.

 And after a successful import, it updates the `.latest.FILE`(s) for next time (unless `--dry-run` was used).

+This is a limited kind of deduplication, let's call it "date skipping".
+Within each input file, it avoids reprocessing the same dates across successive runs.
 This is a simple system that works fairly well for transaction data.
 It assumes:

 1. new items always have the newest dates
-2. item dates are stable across successive CSV downloads
-3. the order of same-date items is stable across CSV downloads
+2. item dates are stable across successive downloads
+3. the order of same-date items is stable across downloads
+4. the name of the input file is stable across downloads

 These are true of most CSV files representing transactions, or true enough.
-If you have a bank whose CSV dates or ordering occasionally changes,
+If you have a bank whose CSV dates or ordering change occasionally,
 you can reduce the chance of this happening in new transactions by importing more often
 (and in old transactions it doesn't matter).
+And remember you can use CSV rules files as input, which is one way to ensure a stable file name.

-Note, `import` avoids reprocessing the same dates across successive runs,
-but it does not detect transactions that are duplicated within a single run.
-I'll call these "skipping" and "deduplication" respectively.
+`import` doesn't detect other kinds of duplication, such as duplicate transactions within a single run.
+(In part, because legitimate duplicate transactions can easily occur in real-world data.)
+So, say you downloaded but forgot to import `bank.1.csv`, and a week later you downloaded `bank.2.csv` with overlapping data.
+Now you should not import both of these at once (`hledger import bank.1.csv bank.2.csv`);
+the overlapping transactions which appear twice would not be deduplicated since this is considered a single import.
+Instead, import these files one at a time, and also use the same filename each time for a common "latest seen" state:

-So for example, say you downloaded but did not import `bank.1.csv`, and later downloaded `bank.2.csv` with overlapping data.
-Then you should not import both of them at once (`hledger import bank.1.csv bank.2.csv`), as the overlapping data would appear twice and not be deduplicated.
-Instead, import them one at a time (`hledger import bank.1.csv; hledger import bank.2.csv`), and the second import will skip the overlapping data.
+```cli
+$ mv bank.1.csv bank.csv; hledger import bank.csv
+$ mv bank.2.csv bank.csv; hledger import bank.csv
+```

 Normally you can ignore the `.latest.*` files, 
 but if needed, you can delete them (to make all transactions unseen),
@ -60,7 +68,7 @@ or construct/modify them (to catch up to a certain date).
 The format is just a single ISO-format date (`YYYY-MM-DD`), possibly repeated on multiple lines.
 It means "I have seen transactions up to this date, and this many of them occurring on that date".

-([`hledger print --new`](#print) also uses and updates these `.latest.*` files, but it is less often used.)
+[`hledger print --new`](#print) also uses and updates these `.latest.*` files, but it is less often used.

 Related: [CSV > Working with CSV > Deduplicating, importing](#deduplicating-importing).