home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

12 rows where issue = 1426001541 sorted by updated_at descending

✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • simonw 12

issue 1

  • API for bulk inserting records into a table · 12 ✖

author_association 1

  • OWNER 12
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1313128913 https://github.com/simonw/datasette/issues/1866#issuecomment-1313128913 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5ORMHR simonw 9599 2022-11-14T05:48:22Z 2022-11-14T05:48:22Z OWNER

I changed my mind about the "return_rows": true option - I'm going to rename it to "return": true.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1295200988 https://github.com/simonw/datasette/issues/1866#issuecomment-1295200988 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NMzLc simonw 9599 2022-10-28T16:29:55Z 2022-10-28T16:29:55Z OWNER

I wonder if there's something clever I could do here within a transaction?

Start a transaction. Write out a temporary in-memory table with all of the existing primary keys in the table. Run the bulk insert. Then run select pk from table where pk not in (select pk from old_pks) to see what has changed.

I don't think that's going to work well for large tables.

I'm going to go with not returning inserted rows by default, unless you pass a special option requesting that.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1294316640 https://github.com/simonw/datasette/issues/1866#issuecomment-1294316640 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NJbRg simonw 9599 2022-10-28T01:51:40Z 2022-10-28T01:51:40Z OWNER

This needs to support the following: - Rows do not include a primary key - one is assigned by the database - Rows provide their own primary key, any clashes are errors - Rows provide their own primary key, clashes are silently ignored - Rows provide their own primary key, replacing any existing records

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1294306071 https://github.com/simonw/datasette/issues/1866#issuecomment-1294306071 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NJYsX simonw 9599 2022-10-28T01:37:14Z 2022-10-28T01:37:59Z OWNER

Quick crude benchmark: ```python import sqlite3

db = sqlite3.connect(":memory:")

def create_table(db, name): db.execute(f"create table {name} (id integer primary key, title text)")

create_table(db, "single") create_table(db, "multi") create_table(db, "bulk")

def insert_singles(titles): inserted = [] for title in titles: cursor = db.execute(f"insert into single (title) values (?)", [title]) inserted.append((cursor.lastrowid, title)) return inserted

def insert_many(titles): db.executemany(f"insert into multi (title) values (?)", ((t,) for t in titles))

def insert_bulk(titles): db.execute("insert into bulk (title) values {}".format( ", ".join("(?)" for _ in titles) ), titles)

titles = ["title {}".format(i) for i in range(1, 10001)] Then in iPython I ran these: In [14]: %timeit insert_singles(titles) 23.8 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit insert_many(titles) 12 ms ± 520 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit insert_bulk(titles) 2.59 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) ``` So the bulk insert really is a lot faster - 3ms compared to 24ms for single inserts, so ~8x faster.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1294296767 https://github.com/simonw/datasette/issues/1866#issuecomment-1294296767 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NJWa_ simonw 9599 2022-10-28T01:22:25Z 2022-10-28T01:23:09Z OWNER

Nasty catch on this one: I wanted to return the IDs of the freshly inserted rows. But... the insert_all() method I was planning to use from sqlite-utils doesn't appear to have a way of doing that:

https://github.com/simonw/sqlite-utils/blob/529110e7d8c4a6b1bbf5fb61f2e29d72aa95a611/sqlite_utils/db.py#L2813-L2835

SQLite itself added a RETURNING statement which might help, but that is only available from version 3.35 released in March 2021: https://www.sqlite.org/lang_returning.html - which isn't commonly available yet. https://latest.datasette.io/-/versions right now shows 3.34, and https://lite.datasette.io/#/-/versions shows 3.27.2 (from Feb 2019).

Two options then:

  1. Even for bulk inserts do one insert at a time so I can use cursor.lastrowid to get the ID of the inserted record. This isn't terrible since SQLite is very fast, but it may still be a big performance hit for large inserts.
  2. Don't return the list of inserted rows for bulk inserts
  3. Default to not returning the list of inserted rows for bulk inserts, but allow the user to request that - in which case we use the slower path

That third option might be the way to go here.

I should benchmark first to figure out how much of a difference this actually makes.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1294282263 https://github.com/simonw/datasette/issues/1866#issuecomment-1294282263 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NJS4X simonw 9599 2022-10-28T01:00:42Z 2022-10-28T01:00:42Z OWNER

I'm going to set the limit at 1,000 rows inserted at a time. I'll make this configurable using a new max_insert_rows setting (for consistency with max_returned_rows).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1293893789 https://github.com/simonw/datasette/issues/1866#issuecomment-1293893789 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NH0Cd simonw 9599 2022-10-27T18:13:00Z 2022-10-27T18:13:00Z OWNER

If people care about that kind of thing they could always push all of their inserts to a table called _tablename and then atomically rename that once they've uploaded all of the data (assuming I provide an atomic-rename-this-table mechanism).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1293892818 https://github.com/simonw/datasette/issues/1866#issuecomment-1293892818 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NHzzS simonw 9599 2022-10-27T18:12:02Z 2022-10-27T18:12:02Z OWNER

There's one catch with batched inserts: if your CLI tool fails half way through you could end up with a partially populated table - since a bunch of batches will have succeeded first.

I think that's OK. In the future I may want to come up with a way to run multiple batches of inserts inside a single transaction, but I can ignore that for the first release of this feature.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1293891876 https://github.com/simonw/datasette/issues/1866#issuecomment-1293891876 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NHzkk simonw 9599 2022-10-27T18:11:05Z 2022-10-27T18:11:05Z OWNER

Likewise for newline-delimited JSON. While it's tempting to want to accept that as an ingest format (because it's nice to generate and stream) I think it's better to have a client application that can turn a stream of newline-delimited JSON into batched JSON inserts.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1293891191 https://github.com/simonw/datasette/issues/1866#issuecomment-1293891191 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NHzZ3 simonw 9599 2022-10-27T18:10:22Z 2022-10-27T18:10:22Z OWNER

So for the moment I'm just going to concentrate on the JSON API. I can consider CSV variants later on, or as plugins, or both.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1293890684 https://github.com/simonw/datasette/issues/1866#issuecomment-1293890684 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NHzR8 simonw 9599 2022-10-27T18:09:52Z 2022-10-27T18:09:52Z OWNER

Should this API accept CSV/TSV etc in addition to JSON?

I'm torn on this one. My initial instinct is that it should not - and there should instead be a Datasette client library / CLI tool you can use that knows how to turn CSV into batches of JSON calls for when you want to upload a CSV file.

I don't think the usability of curl https://datasette/db/table -F 'data=@path/to/file.csv' -H 'Authentication: Bearer xxx' is particularly great compared to something likedatasette client insert https://datasette/ db table file.csv --csv (where the command version could store API tokens for you too).

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  
1293887808 https://github.com/simonw/datasette/issues/1866#issuecomment-1293887808 https://api.github.com/repos/simonw/datasette/issues/1866 IC_kwDOBm6k_c5NHylA simonw 9599 2022-10-27T18:07:02Z 2022-10-27T18:07:02Z OWNER

Error handling is really important here.

What should happen if you submit 100 records and one of them has some kind of validation error? How should that error be reported back to you?

I'm inclined to say that it defaults to all-or-nothing in a transaction - but there should be a "continue_on_error": true option (or similar) which causes it to insert the ones that are valid while reporting back the ones that are invalid.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
API for bulk inserting records into a table 1426001541  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 18.717ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows