home / github

Menu
  • Search all tables
  • GraphQL API

issue_comments

Table actions
  • GraphQL API for issue_comments

10 rows where author_association = "CONTRIBUTOR" and issue = 459882902 sorted by updated_at descending

✖
✖
✖

✎ View and edit SQL

This data as json, CSV (advanced)

Suggested facets: created_at (date), updated_at (date)

user 1

  • fgregg 10

issue 1

  • Stream all results for arbitrary SQL and canned queries · 10 ✖

author_association 1

  • CONTRIBUTOR · 10 ✖
id html_url issue_url node_id user created_at updated_at ▲ author_association body reactions issue performed_via_github_app
1259718517 https://github.com/simonw/datasette/issues/526#issuecomment-1259718517 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LFcd1 fgregg 536941 2022-09-27T16:02:51Z 2022-09-27T16:04:46Z CONTRIBUTOR

i think that max_returned_rows is a defense mechanism, just not for connection exhaustion. max_returned_rows is a defense mechanism against memory bombs.

if you are potentially yielding out hundreds of thousands or even millions of rows, you need to be quite careful about data flow to not run out of memory on the server, or on the client.

you have a lot of places in your code that are protective of that right now, but max_returned_rows acts as the final backstop.

so, given that, it makes sense to have removing max_returned_rows altogether be a non-goal, but instead allow for for specific codepaths (like streaming csv's) be able to bypass.

that could dramatically lower the surface area for a memory-bomb attack.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258910228 https://github.com/simonw/datasette/issues/526#issuecomment-1258910228 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCXIU fgregg 536941 2022-09-27T03:11:07Z 2022-09-27T03:11:07Z CONTRIBUTOR

i think this feature would be safe, as its really only the time limit that can, and imo, should protect against long running queries, as it is pretty easy to make very expensive queries that don't return many rows.

moving away from max_returned_rows will requires some thinking about:

  1. memory usage and data flows to handle potentially very large result sets
  2. how to avoid rendering tens or hundreds of thousands of html rows.
{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258878311 https://github.com/simonw/datasette/issues/526#issuecomment-1258878311 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCPVn fgregg 536941 2022-09-27T02:19:48Z 2022-09-27T02:19:48Z CONTRIBUTOR

this sql query doesn't trip up maximum_returned_rows but does timeout

sql with recursive counter(x) as ( select 0 union select x + 1 from counter ) select * from counter LIMIT 10 OFFSET 100000000

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258871525 https://github.com/simonw/datasette/issues/526#issuecomment-1258871525 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCNrl fgregg 536941 2022-09-27T02:09:32Z 2022-09-27T02:14:53Z CONTRIBUTOR

thanks @simonw, i learned something i didn't know about sqlite's execution model!

Imagine if Datasette CSVs did allow unlimited retrievals. Someone could hit the CSV endpoint for that recursive query and tie up Datasette's SQL connection effectively forever.

why wouldn't the sqlite_timelimit guard prevent that?


on my local version which has the code to turn off truncations for query csv, sqlite_timelimit does protect me.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258849766 https://github.com/simonw/datasette/issues/526#issuecomment-1258849766 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LCIXm fgregg 536941 2022-09-27T01:27:03Z 2022-09-27T01:27:03Z CONTRIBUTOR

i agree with that concern! but if i'm understanding the code correctly, maximum_returned_rows does not protect against long-running queries in any way.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258337011 https://github.com/simonw/datasette/issues/526#issuecomment-1258337011 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5LALLz fgregg 536941 2022-09-26T16:49:48Z 2022-09-26T16:49:48Z CONTRIBUTOR

i think the smallest change that gets close to what i want is to change the behavior so that max_returned_rows is not applied in the execute method when we are are asking for a csv of query.

there are some infelicities for that approach, but i'll make a PR to make it easier to discuss.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1258167564 https://github.com/simonw/datasette/issues/526#issuecomment-1258167564 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5K_h0M fgregg 536941 2022-09-26T14:57:44Z 2022-09-26T15:08:36Z CONTRIBUTOR

reading the database execute method i have a few questions.

https://github.com/simonw/datasette/blob/cb1e093fd361b758120aefc1a444df02462389a3/datasette/database.py#L229-L242


unless i'm missing something (which is very likely!!), the max_returned_rows argument doesn't actually offer any protections against running very expensive queries.

It's not like adding a LIMIT max_rows argument. it make sense that it isn't because, the query could already have an LIMIT argument. Doing something like select * from (query) limit {max_returned_rows} might be protective but wouldn't always.

Instead the code executes the full original query, and if still has time it fetches out the first max_rows + 1 rows.

this does offer some protection of memory exhaustion, as you won't hydrate a huge result set into python (however, there are data flow patterns that could avoid that too)

given the current architecture, i don't see how creating a new connection would be use?


If we just removed the max_return_rows limitation, then i think most things would be fine except for the QueryViews. Right now rendering, just 5000 rows takes a lot of client-side memory so some form of pagination would be required.

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
1254064260 https://github.com/simonw/datasette/issues/526#issuecomment-1254064260 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c5Kv4CE fgregg 536941 2022-09-21T18:17:04Z 2022-09-21T18:18:01Z CONTRIBUTOR

hi @simonw, this is becoming more of a bother for my labor data warehouse. Is there any research or a spike i could do that would help you investigate this issue?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
993078038 https://github.com/simonw/datasette/issues/526#issuecomment-993078038 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c47MSsW fgregg 536941 2021-12-14T01:46:52Z 2021-12-14T01:46:52Z CONTRIBUTOR

the nested query idea is very nice, and i stole if for my client side paginator. However, it won't do the right thing if the original query orders by random().

If you go the nested query route, maybe raise a 4XX status code if the query has such a clause?

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  
992971072 https://github.com/simonw/datasette/issues/526#issuecomment-992971072 https://api.github.com/repos/simonw/datasette/issues/526 IC_kwDOBm6k_c47L4lA fgregg 536941 2021-12-13T22:29:34Z 2021-12-13T22:29:34Z CONTRIBUTOR

just came by to open this issue. would make my data analysis in observable a lot better!

{
    "total_count": 0,
    "+1": 0,
    "-1": 0,
    "laugh": 0,
    "hooray": 0,
    "confused": 0,
    "heart": 0,
    "rocket": 0,
    "eyes": 0
}
Stream all results for arbitrary SQL and canned queries 459882902  

Advanced export

JSON shape: default, array, newline-delimited, object

CSV options:

CREATE TABLE [issue_comments] (
   [html_url] TEXT,
   [issue_url] TEXT,
   [id] INTEGER PRIMARY KEY,
   [node_id] TEXT,
   [user] INTEGER REFERENCES [users]([id]),
   [created_at] TEXT,
   [updated_at] TEXT,
   [author_association] TEXT,
   [body] TEXT,
   [reactions] TEXT,
   [issue] INTEGER REFERENCES [issues]([id])
, [performed_via_github_app] TEXT);
CREATE INDEX [idx_issue_comments_issue]
                ON [issue_comments] ([issue]);
CREATE INDEX [idx_issue_comments_user]
                ON [issue_comments] ([user]);
Powered by Datasette · Queries took 27.114ms · About: github-to-sqlite
  • Sort ascending
  • Sort descending
  • Facet by this
  • Hide this column
  • Show all columns
  • Show not-blank rows