diff --git a/hledger/Hledger/Cli/Commands/Import.txt b/hledger/Hledger/Cli/Commands/Import.txt index 88523b905..4cd9dbc2d 100644 --- a/hledger/Hledger/Cli/Commands/Import.txt +++ b/hledger/Hledger/Cli/Commands/Import.txt @@ -21,22 +21,27 @@ hledger import bank.csv or perhaps hledger import *.csv. Note you can import from any file format, though CSV files are the most common import source, and these docs focus on that case. -"Deduplication" +Skipping import tries to import only the transactions which are new since the -last import. So if your bank's CSV includes the last three months of -data, you can download and import it every month (or week, or day) and -only the new transactions will be imported each time. +last import, "skipping over" any that it saw last time. So if your +bank's CSV includes the last three months of data, you can download and +import it every month (or week, or day) and only the new transactions +will be imported each time. -It works as follows. For each imported FILE (usually a CSV file): - It -tries to find the latest date seen previously, by reading it from a -hidden .latest.FILE in the same directory. - Then it processes FILE, -ignoring any transactions on or before the "latest seen" date. +It works as follows. For each imported FILE: + +- It tries to find the latest date seen previously, by reading it from + a hidden .latest.FILE in the same directory. +- Then it processes FILE, ignoring any transactions on or before the + "latest seen" date. And after a successful import, it updates the .latest.FILE(s) for next time (unless --dry-run was used). -This is simple but fairly effective. It assumes: +This is simple system that works fairly well for transaction data +(usually CSV, but it could be any of hledger's input formats). It +assumes: 1. new items always have the newest dates 2. item dates are stable across successive CSV downloads @@ -49,11 +54,15 @@ by importing more often (and in old transactions it doesn't matter). Note, import avoids reprocessing the same dates across successive runs, but it does not detect transactions that are duplicated within a single -run. So eg if you downloaded but did not import bank.1.csv, and later -downloaded bank.2.csv with overlapping data, you should not import both -of them in a single run (hledger import bank.1.csv bank.2.csv); instead, -import them one at a time (hledger import bank.1.csv, then -hledger import bank.2.csv). +run. I'll call these "skipping" and "deduplication". + +So for example, say you downloaded but did not import bank.1.csv, and +later downloaded bank.2.csv with overlapping data. Then you should not +import both of them at once (hledger import bank.1.csv bank.2.csv), as +the overlapping data would appear twice and not be deduplicated. +Instead, import them one at a time +(hledger import bank.1.csv; hledger import bank.2.csv), and the second +import will skip the overlapping data. Normally you can ignore the .latest.* files, but if needed, you can delete them (to make all transactions unseen), or construct/modify them @@ -63,7 +72,7 @@ seen transactions up to this date, and this many of them occurring on that date". (hledger print --new also uses and updates these .latest.* files, but it -is not often used.) +is less often used.) Related: CSV > Working with CSV > Deduplicating, importing.