home / github

Menu
  • Search all tables
  • GraphQL API

issues

Table actions
  • GraphQL API for issues

3 rows where type = "issue" and user = 96218 sorted by updated_at descending

✖
✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: comments, created_at (date), updated_at (date), closed_at (date)

state 2

  • closed 2
  • open 1

type 1

  • issue · 3 ✖

repo 1

  • sqlite-utils 3
id node_id number title user state locked assignee milestone comments created_at updated_at ▲ closed_at author_association pull_request body repo type active_lock_reason performed_via_github_app reactions draft state_reason
688670158 MDU6SXNzdWU2ODg2NzAxNTg= 147 SQLITE_MAX_VARS maybe hard-coded too low simonwiles 96218 open 0     7 2020-08-30T07:26:45Z 2021-02-15T21:27:55Z   CONTRIBUTOR  

I came across this while about to open an issue and PR against the documentation for batch_size, which is a bit incomplete.

As mentioned in #145, while:

SQLITE_MAX_VARIABLE_NUMBER ... defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

it is common that it is increased at compile time. Debian-based systems, for example, seem to ship with a version of sqlite compiled with SQLITE_MAX_VARIABLE_NUMBER set to 250,000, and I believe this is the case for homebrew installations too.

In working to understand what batch_size was actually doing and why, I realized that by setting SQLITE_MAX_VARS in db.py to match the value my sqlite was compiled with (I'm on Debian), I was able to decrease the time to insert_all() my test data set (~128k records across 7 tables) from ~26.5s to ~3.5s. Given that this about .05% of my total dataset, this is time I am keen to save...

Unfortunately, it seems that sqlite3 in the python standard library doesn't expose the get_limit() C API (even though pysqlite used to), so it's hard to know what value sqlite has been compiled with (note that this could mean, I suppose, that it's less than 999, and even hardcoding SQLITE_MAX_VARS to the conservative default might not be adequate. It can also be lowered -- but not raised -- at runtime). The best I could come up with is echo "" | sqlite3 -cmd ".limits variable_number" (only available in sqlite >= 2015-05-07 (3.8.10)).

Obviously this couldn't be relied upon in sqlite_utils, but I wonder what your opinion would be about exposing SQLITE_MAX_VARS as a user-configurable parameter (with suitable "here be dragons" warnings)? I'm going to go ahead and monkey-patch it for my purposes in any event, but it seems like it might be worth considering.

sqlite-utils 140912432 issue    
{
    "url": "https://api.github.com/repos/simonw/sqlite-utils/issues/147/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
   
688659182 MDU6SXNzdWU2ODg2NTkxODI= 145 Bug when first record contains fewer columns than subsequent records simonwiles 96218 closed 0     2 2020-08-30T05:44:44Z 2020-09-08T23:21:23Z 2020-09-08T23:21:23Z CONTRIBUTOR  

insert_all() selects the maximum batch size based on the number of fields in the first record. If the first record has fewer fields than subsequent records (and alter=True is passed), this can result in SQL statements with more than the maximum permitted number of host parameters. This situation is perhaps unlikely to occur, but could happen if the first record had, say, 10 columns, such that batch_size (based on SQLITE_MAX_VARIABLE_NUMBER = 999) would be 99. If the next 98 rows had 11 columns, the resulting SQL statement for the first batch would have 10 * 1 + 11 * 98 = 1088 host parameters (and subsequent batches, if the data were consistent from thereon out, would have 99 * 11 = 1089).

I suspect that this bug is masked somewhat by the fact that while:

SQLITE_MAX_VARIABLE_NUMBER ... defaults to 999 for SQLite versions prior to 3.32.0 (2020-05-22) or 32766 for SQLite versions after 3.32.0.

it is common that it is increased at compile time. Debian-based systems, for example, seem to ship with a version of sqlite compiled with SQLITE_MAX_VARIABLE_NUMBER set to 250,000, and I believe this is the case for homebrew installations too.

A test for this issue might look like this: python def test_columns_not_in_first_record_should_not_cause_batch_to_be_too_large(fresh_db): # sqlite on homebrew and Debian/Ubuntu etc. is typically compiled with # SQLITE_MAX_VARIABLE_NUMBER set to 250,000, so we need to exceed this value to # trigger the error on these systems. THRESHOLD = 250000 extra_columns = 1 + (THRESHOLD - 1) // 99 records = [ {"c0": "first record"}, # one column in first record -> batch_size = 100 # fill out the batch with 99 records with enough columns to exceed THRESHOLD *[ dict([("c{}".format(i), j) for i in range(extra_columns)]) for j in range(99) ] ] try: fresh_db["too_many_columns"].insert_all(records, alter=True) except sqlite3.OperationalError: raise

The best solution, I think, is simply to process all the records when determining columns, column types, and the batch size. In my tests this doesn't seem to be particularly costly at all, and cuts out a lot of complications (including obviating my implementation of #139 at #142). I'll raise a PR for your consideration.

sqlite-utils 140912432 issue    
{
    "url": "https://api.github.com/repos/simonw/sqlite-utils/issues/145/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed
686978131 MDU6SXNzdWU2ODY5NzgxMzE= 139 insert_all(..., alter=True) should work for new columns introduced after the first 100 records simonwiles 96218 closed 0     7 2020-08-27T06:25:25Z 2020-08-28T22:48:51Z 2020-08-28T22:30:14Z CONTRIBUTOR  

Is there a way to make .insert_all() work properly when new columns are introduced outside the first 100 records (with or without the alter=True argument)?

I'm using .insert_all() to bulk insert ~3-4k records at a time and it is common for records to need to introduce new columns. However, if new columns are introduced after the first 100 records, sqlite_utils doesn't even raise the OperationalError: table ... has no column named ... exception; it just silently drops the extra data and moves on.

It took me a while to find this little snippet in the documentation for .insert_all() (it's not mentioned under Adding columns automatically on insert/update):

The column types used in the CREATE TABLE statement are automatically derived from the types of data in that first batch of rows. Any additional or missing columns in subsequent batches will be ignored.

I tried changing the batch_size argument to the total number of records, but it seems only to effect the number of rows that are committed at a time, and has no influence on this problem.

Is there a way around this that you would suggest? It seems like it should raise an exception at least.

sqlite-utils 140912432 issue    
{
    "url": "https://api.github.com/repos/simonw/sqlite-utils/issues/139/reactions",
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
  completed

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issues] (
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [number] INTEGER,
   [title] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [state] TEXT,
   [locked] INTEGER,
   [assignee] INTEGER REFERENCES [users]([id]),
   [milestone] INTEGER REFERENCES [milestones]([id]),
   [comments] INTEGER,
   [created_at] TEXT,
   [updated_at] TEXT,
   [closed_at] TEXT,
   [author_association] TEXT,
   [pull_request] TEXT,
   [body] TEXT,
   [repo] INTEGER REFERENCES [repos]([id]),
   [type] TEXT
, [active_lock_reason] TEXT, [performed_via_github_app] TEXT, [reactions] TEXT, [draft] INTEGER, [state_reason] TEXT);
CREATE INDEX [idx_issues_repo]
                ON [issues] ([repo]);
CREATE INDEX [idx_issues_milestone]
                ON [issues] ([milestone]);
CREATE INDEX [idx_issues_assignee]
                ON [issues] ([assignee]);
CREATE INDEX [idx_issues_user]
                ON [issues] ([user]);
Powered by Datasette · Queries took 29.528ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows