home / github / issue_comments

Menu
  • Search all tables
  • GraphQL API

issue_comments: 1008232075

This data as json

html_url issue_url id node_id user created_at updated_at author_association body reactions issue performed_via_github_app
https://github.com/simonw/sqlite-utils/issues/369#issuecomment-1008232075 https://api.github.com/repos/simonw/sqlite-utils/issues/369 1008232075 IC_kwDOCGYnMM48GGaL 9599 2022-01-09T05:13:15Z 2022-01-09T05:13:56Z OWNER

I think the query that will help solve this is:

explain query plan select * from ny_times_us_counties where state = 1 and county = 2

In this case, the query planner needs to decide if it should use the index for the state column or the index for the county column. That's where the statistics come into play. In particular:

| tbl | idx | stat | |----------------------|---------------------------------|---------------| | ny_times_us_counties | idx_ny_times_us_counties_date | 2092871 2915 | | ny_times_us_counties | idx_ny_times_us_counties_fips | 2092871 651 | | ny_times_us_counties | idx_ny_times_us_counties_county | 2092871 1085 | | ny_times_us_counties | idx_ny_times_us_counties_state | 2092871 37373 |

Those numbers are explained by this comment in the SQLite C code: https://github.com/sqlite/sqlite/blob/5622c7f97106314719740098cf0854e7eaa81802/src/analyze.c#L41-L55

** There is normally one row per index, with the index identified by the ** name in the idx column. The tbl column is the name of the table to ** which the index belongs. In each such row, the stat column will be ** a string consisting of a list of integers. The first integer in this ** list is the number of rows in the index. (This is the same as the ** number of rows in the table, except for partial indices.) The second ** integer is the average number of rows in the index that have the same ** value in the first column of the index. So that table is telling us that using a value in the county column will filter down to an average of 1,085 rows, whereas filtering on the state column will filter down to an average of 37,373 - so clearly the county index is the better index to use here!

Just one catch: against both mycovid.db and my covid-analyzed.db databases the county index is picked for both of them - so SQLite is somehow guessing that county is a better index even though it doesn't have statistics for that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
1097091527  
Powered by Datasette · Queries took 2.44ms · About: github-to-sqlite