html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,issue,performed_via_github_app
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-698178101,https://api.github.com/repos/simonw/sqlite-utils/issues/172,698178101,MDEyOklzc3VlQ29tbWVudDY5ODE3ODEwMQ==,9599,2020-09-24T07:48:57Z,2020-09-24T07:49:20Z,OWNER,"> I wonder if I could make this faster by separating it out into a few steps:
>
> * Create the new lookup table with all of the distinct rows
>
> * Add the blank foreign key column
>
> * run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)`
>
> * Drop the value columns
My prototype of this knocked the time down from 10 minutes to 4 seconds, so I think the change is worth it!
```
% date
sqlite-utils extract salaries.db salaries \
'Department Code' 'Department' \
--table 'departments' \
--fk-column 'department_id' \
--rename 'Department Code' code \
--rename 'Department' name
date
sqlite-utils extract salaries.db salaries \
'Union Code' 'Union' \
--table 'unions' \
--fk-column 'union_id' \
--rename 'Union Code' code \
--rename 'Union' name
date
sqlite-utils extract salaries.db salaries \
'Job Family Code' 'Job Family' \
--table 'job_families' \
--fk-column 'job_family_id' \
--rename 'Job Family Code' code \
--rename 'Job Family' name
date
sqlite-utils extract salaries.db salaries \
'Job Code' 'Job' \
--table 'jobs' \
--fk-column 'job_id' \
--rename 'Job Code' code \
--rename 'Job' name
date
Thu Sep 24 00:48:16 PDT 2020
Thu Sep 24 00:48:20 PDT 2020
Thu Sep 24 00:48:24 PDT 2020
Thu Sep 24 00:48:28 PDT 2020
Thu Sep 24 00:48:32 PDT 2020
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697869886,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697869886,MDEyOklzc3VlQ29tbWVudDY5Nzg2OTg4Ng==,9599,2020-09-23T18:45:30Z,2020-09-23T18:45:30Z,OWNER,"There's something to be said for making this operation pausable and resumable, especially if I'm going to make it available in a Datasette plugin at some point.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697866885,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697866885,MDEyOklzc3VlQ29tbWVudDY5Nzg2Njg4NQ==,9599,2020-09-23T18:43:37Z,2020-09-23T18:43:37Z,OWNER,Also what would happen if the table had new rows added to it while that command was running?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697863116,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697863116,MDEyOklzc3VlQ29tbWVudDY5Nzg2MzExNg==,9599,2020-09-23T18:41:06Z,2020-09-23T18:41:06Z,OWNER,Problem with this approach is it's not compatible with progress bars - but if it's a multiple of times faster it's worth it.,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697859772,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697859772,MDEyOklzc3VlQ29tbWVudDY5Nzg1OTc3Mg==,9599,2020-09-23T18:38:43Z,2020-09-23T18:38:52Z,OWNER,"I wonder if I could make this faster by separating it out into a few steps:
- Create the new lookup table with all of the distinct rows
- Add the blank foreign key column
- run a `UPDATE table SET blah_id = (select id from lookup where thang = table.thang)`
- Drop the value columns","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697835956,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697835956,MDEyOklzc3VlQ29tbWVudDY5NzgzNTk1Ng==,9599,2020-09-23T18:22:49Z,2020-09-23T18:22:49Z,OWNER,"I ran `sudo py-spy top -p 123` against the process while it was running and the most time is definitely spent in `.update()`:
```
Total Samples 1000
GIL: 0.00%, Active: 90.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename:line)
38.00% 38.00% 3.85s 3.85s update (sqlite_utils/db.py:1283)
27.00% 27.00% 2.12s 2.12s execute (sqlite_utils/db.py:161)
10.00% 10.00% 0.890s 0.890s execute (sqlite_utils/db.py:163)
10.00% 17.00% 0.870s 1.54s columns (sqlite_utils/db.py:553)
0.00% 0.00% 0.110s 0.210s (sqlite_utils/db.py:554)
0.00% 3.00% 0.100s 0.320s table_names (sqlite_utils/db.py:191)
0.00% 0.00% 0.100s 0.100s __new__ (:1)
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697473247,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697473247,MDEyOklzc3VlQ29tbWVudDY5NzQ3MzI0Nw==,9599,2020-09-23T14:45:13Z,2020-09-23T14:45:13Z,OWNER,"`lookup_table.lookup(lookups)` is doing a SQL lookup. This could be cached in-memory, maybe with a LRU cache, to avoid looking up the primary key for records that we have recently used.
The `.update()` method it is calling first does a `get()` and then does a SQL `UPDATE ... WHERE`:
https://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L1244-L1264
Batching those updates may have an effect. Or finding a way to skip the `.get()` since we already know we have a valid record.
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697467833,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697467833,MDEyOklzc3VlQ29tbWVudDY5NzQ2NzgzMw==,9599,2020-09-23T14:42:03Z,2020-09-23T14:42:03Z,OWNER,Here's the loop that's taking the time: https://github.com/simonw/sqlite-utils/blob/1ebffe1dbeaed7311e5b61ed988f4cd701e84808/sqlite_utils/db.py#L892-L897,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,
https://github.com/simonw/sqlite-utils/issues/172#issuecomment-697466497,https://api.github.com/repos/simonw/sqlite-utils/issues/172,697466497,MDEyOklzc3VlQ29tbWVudDY5NzQ2NjQ5Nw==,9599,2020-09-23T14:41:17Z,2020-09-23T14:41:17Z,OWNER,"Steps to produce that database:
```
curl -o salaries.csv 'https://data.sfgov.org/api/views/88g8-5mnd/rows.csv?accessType=DOWNLOAD'
sqlite-utils insert salaries.db salaries salaries.csv --csv
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",707427200,