html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,issue,performed_via_github_app https://github.com/simonw/datasette/issues/526#issuecomment-1259718517,https://api.github.com/repos/simonw/datasette/issues/526,1259718517,IC_kwDOBm6k_c5LFcd1,536941,2022-09-27T16:02:51Z,2022-09-27T16:04:46Z,CONTRIBUTOR,"i think that `max_returned_rows` **is** a defense mechanism, just not for connection exhaustion. `max_returned_rows` is a defense mechanism against **memory bombs**. if you are potentially yielding out hundreds of thousands or even millions of rows, you need to be quite careful about data flow to not run out of memory on the server, or on the client. you have a lot of places in your code that are protective of that right now, but `max_returned_rows` acts as the final backstop. so, given that, it makes sense to have removing `max_returned_rows` altogether be a non-goal, but instead allow for for specific codepaths (like streaming csv's) be able to bypass. that could dramatically lower the surface area for a memory-bomb attack.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-1258910228,https://api.github.com/repos/simonw/datasette/issues/526,1258910228,IC_kwDOBm6k_c5LCXIU,536941,2022-09-27T03:11:07Z,2022-09-27T03:11:07Z,CONTRIBUTOR,"i think this feature would be safe, as its really only the time limit that can, and imo, should protect against long running queries, as it is pretty easy to make very expensive queries that don't return many rows. moving away from `max_returned_rows` will requires some thinking about: 1. memory usage and data flows to handle potentially very large result sets 2. how to avoid rendering tens or hundreds of thousands of [html rows](#1655).","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-1258878311,https://api.github.com/repos/simonw/datasette/issues/526,1258878311,IC_kwDOBm6k_c5LCPVn,536941,2022-09-27T02:19:48Z,2022-09-27T02:19:48Z,CONTRIBUTOR,"this sql query doesn't trip up `maximum_returned_rows` but does timeout ```sql with recursive counter(x) as ( select 0 union select x + 1 from counter ) select * from counter LIMIT 10 OFFSET 100000000 ```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-1258871525,https://api.github.com/repos/simonw/datasette/issues/526,1258871525,IC_kwDOBm6k_c5LCNrl,536941,2022-09-27T02:09:32Z,2022-09-27T02:14:53Z,CONTRIBUTOR,"thanks @simonw, i learned something i didn't know about sqlite's execution model! > Imagine if Datasette CSVs did allow unlimited retrievals. Someone could hit the CSV endpoint for that recursive query and tie up Datasette's SQL connection effectively forever. why wouldn't the `sqlite_timelimit` guard prevent that? --- on my local version which has the code to [turn off truncations for query csv](#1820), `sqlite_timelimit` does protect me. ![Screenshot 2022-09-26 at 22-14-31 Error 500](https://user-images.githubusercontent.com/536941/192415680-94b32b7f-868f-4b89-8194-5752d45f6009.png) ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-1258849766,https://api.github.com/repos/simonw/datasette/issues/526,1258849766,IC_kwDOBm6k_c5LCIXm,536941,2022-09-27T01:27:03Z,2022-09-27T01:27:03Z,CONTRIBUTOR,"i agree with that concern! but if i'm understanding the code correctly, `maximum_returned_rows` does not protect against long-running queries in any way.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-1258337011,https://api.github.com/repos/simonw/datasette/issues/526,1258337011,IC_kwDOBm6k_c5LALLz,536941,2022-09-26T16:49:48Z,2022-09-26T16:49:48Z,CONTRIBUTOR,"i think the smallest change that gets close to what i want is to change the behavior so that `max_returned_rows` is not applied in the `execute` method when we are are asking for a csv of query. there are some infelicities for that approach, but i'll make a PR to make it easier to discuss.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-1258167564,https://api.github.com/repos/simonw/datasette/issues/526,1258167564,IC_kwDOBm6k_c5K_h0M,536941,2022-09-26T14:57:44Z,2022-09-26T15:08:36Z,CONTRIBUTOR,"reading the database execute method i have a few questions. https://github.com/simonw/datasette/blob/cb1e093fd361b758120aefc1a444df02462389a3/datasette/database.py#L229-L242 --- unless i'm missing something (which is very likely!!), the `max_returned_rows` argument doesn't actually offer any protections against running very expensive queries. It's not like adding a `LIMIT max_rows` argument. it make sense that it isn't because, the query could already have an `LIMIT` argument. Doing something like `select * from (query) limit {max_returned_rows}` **might** be protective but wouldn't always. Instead the code executes the full original query, and if still has time it fetches out the first `max_rows + 1` rows. this *does* offer some protection of memory exhaustion, as you won't hydrate a huge result set into python (however, there are [data flow patterns](https://github.com/simonw/datasette/issues/1727#issuecomment-1258129113) that could avoid that too) given the current architecture, i don't see how creating a new connection would be use? --- If we just removed the `max_return_rows` limitation, then i think most things would be fine **except** for the QueryViews. Right now rendering, just [5000 rows takes a lot of client-side memory](https://github.com/simonw/datasette/issues/1655) so some form of pagination would be required. ","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-1254064260,https://api.github.com/repos/simonw/datasette/issues/526,1254064260,IC_kwDOBm6k_c5Kv4CE,536941,2022-09-21T18:17:04Z,2022-09-21T18:18:01Z,CONTRIBUTOR,"hi @simonw, this is becoming more of a bother for my [labor data warehouse](https://labordata.bunkum.us/). Is there any research or a spike i could do that would help you investigate this issue?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-993078038,https://api.github.com/repos/simonw/datasette/issues/526,993078038,IC_kwDOBm6k_c47MSsW,536941,2021-12-14T01:46:52Z,2021-12-14T01:46:52Z,CONTRIBUTOR,"the nested query idea is very nice, and i stole if for [my client side paginator](https://observablehq.com/d/1d5da3a3c3f2f347#DatasetteClient). However, it won't do the right thing if the original query orders by random(). If you go the nested query route, maybe raise a 4XX status code if the query has such a clause?","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902, https://github.com/simonw/datasette/issues/526#issuecomment-992971072,https://api.github.com/repos/simonw/datasette/issues/526,992971072,IC_kwDOBm6k_c47L4lA,536941,2021-12-13T22:29:34Z,2021-12-13T22:29:34Z,CONTRIBUTOR,just came by to open this issue. would make my data analysis in observable a lot better!,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",459882902,