html_url,issue_url,id,node_id,user,created_at,updated_at,author_association,body,reactions,issue,performed_via_github_app
https://github.com/simonw/datasette/issues/485#issuecomment-1264769569,https://api.github.com/repos/simonw/datasette/issues/485,1264769569,IC_kwDOBm6k_c5LYtoh,9599,2022-10-03T00:04:42Z,2022-10-03T00:04:42Z,OWNER,"I love these tips - tools that can compile a simple machine learning model to a SQL query! Would be pretty cool if I could bundle a model in Datasette itself as a big in-memory SQLite SQL query:
- https://github.com/Chryzanthemum/xgb2sql
- https://github.com/konstantint/SKompiler","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-1264737290,https://api.github.com/repos/simonw/datasette/issues/485,1264737290,IC_kwDOBm6k_c5LYlwK,9599,2022-10-02T21:29:59Z,2022-10-02T21:29:59Z,OWNER,"To clarify: the feature this issue is talking about relates to the way Datasette automatically displays foreign key relationships, for example on this page: https://github-to-sqlite.dogsheep.net/github/commits
Each of those columns is a foreign key to another table. The link text that is displayed there comes from the ""label column"" that has either been configured or automatically detected for that other table.
I wonder if this could be handled with a tiny machine learning model that's trained to help pick the best label column?
Inputs to that model could include:
- The names of the columns
- The number of unique values in each column
- The type of each column (or maybe only `TEXT` columns should be considered)
- How many `null` values there are
- Is the column marked as unique?
- What's the average (or median or some other statistic) string length of values in each column?
Output would be the most likely label column, or some indicator that no likely candidates had been found.
My hunch is that this would be better solved using a few extra heuristics rather than by training a model, but it does feel like an interesting opportunity to experiment with a tiny ML model.
Asked for tips about this on Twitter: https://twitter.com/simonw/status/1576680930680262658
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-497116074,https://api.github.com/repos/simonw/datasette/issues/485,497116074,MDEyOklzc3VlQ29tbWVudDQ5NzExNjA3NA==,9599,2019-05-29T21:29:16Z,2019-05-29T21:29:16Z,OWNER,Another good rule of thumb: look for text fields with a unique constraint?,"{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-496367866,https://api.github.com/repos/simonw/datasette/issues/485,496367866,MDEyOklzc3VlQ29tbWVudDQ5NjM2Nzg2Ng==,9599,2019-05-28T05:14:06Z,2019-05-28T05:14:06Z,OWNER,"I'm going to generate statistics for every TEXT column.
Any column with more than 90% distinct rows (compared to the total count of rows) will be a candidate for the label.
I will then pick the candidate column with the shortest average length.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-496283728,https://api.github.com/repos/simonw/datasette/issues/485,496283728,MDEyOklzc3VlQ29tbWVudDQ5NjI4MzcyOA==,9599,2019-05-27T18:44:07Z,2019-05-27T18:44:07Z,OWNER,"This code now lives in a method on the new `datasette.database.Database` class, which should make it easier to write unit tests for.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-496039483,https://api.github.com/repos/simonw/datasette/issues/485,496039483,MDEyOklzc3VlQ29tbWVudDQ5NjAzOTQ4Mw==,9599,2019-05-26T23:22:53Z,2019-05-26T23:22:53Z,OWNER,"Comparing these two SQL queries (the one with union and the one without) using explain:
With union: https://latest.datasette.io/fixtures?sql=explain+select+%27name%27+as+column%2C+count+%28distinct+name%29+as+count_distinct%2C+avg%28length%28name%29%29+as+avg_length+from+roadside_attractions%0D%0A++union%0D%0Aselect+%27address%27+as+column%2C+count%28distinct+address%29+as+count_distinct%2C+avg%28length%28address%29%29+as+avg_length+from+roadside_attractions produces 52 rows
Without union: https://latest.datasette.io/fixtures?sql=explain+select%0D%0A++count+(distinct+name)+as+count_distinct_column_1%2C%0D%0A++avg(length(name))+as+avg_length_column_1%2C%0D%0A++count(distinct+address)+as+count_distinct_column_2%2C%0D%0A++avg(length(address))+as+avg_length_column_2%0D%0Afrom+roadside_attractions produces 32 rows
So I'm going to use the one without the union.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-496039267,https://api.github.com/repos/simonw/datasette/issues/485,496039267,MDEyOklzc3VlQ29tbWVudDQ5NjAzOTI2Nw==,9599,2019-05-26T23:19:38Z,2019-05-26T23:20:10Z,OWNER,"Thinking about that union query: I imagine doing this with union could encourage multiple full table scans. Maybe this query would only do one? https://latest.datasette.io/fixtures?sql=select%0D%0A++count+%28distinct+name%29+as+count_distinct_column_1%2C%0D%0A++avg%28length%28name%29%29+as+avg_length_column_1%2C%0D%0A++count%28distinct+address%29+as+count_distinct_column_2%2C%0D%0A++avg%28length%28address%29%29+as+avg_length_column_2%0D%0Afrom+roadside_attractions
```
select
count (distinct name) as count_distinct_column_1,
avg(length(name)) as avg_length_column_1,
count(distinct address) as count_distinct_column_2,
avg(length(address)) as avg_length_column_2
from roadside_attractions
```
","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-495085021,https://api.github.com/repos/simonw/datasette/issues/485,495085021,MDEyOklzc3VlQ29tbWVudDQ5NTA4NTAyMQ==,9599,2019-05-23T06:27:57Z,2019-05-26T23:15:51Z,OWNER,"I could attempt to calculate the statistics needed for this in a time limited SQL query something like this one: https://latest.datasette.io/fixtures?sql=select+%27name%27+as+column%2C+count+%28distinct+name%29+as+count_distinct%2C+avg%28length%28name%29%29+as+avg_length+from+roadside_attractions%0D%0A++union%0D%0Aselect+%27address%27+as+column%2C+count%28distinct+address%29+as+count_distinct%2C+avg%28length%28address%29%29+as+avg_length+from+roadside_attractions
```
select 'name' as column, count (distinct name) as count_distinct, avg(length(name)) as avg_length from roadside_attractions
union
select 'address' as column, count(distinct address) as count_distinct, avg(length(address)) as avg_length from roadside_attractions
```","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-496038601,https://api.github.com/repos/simonw/datasette/issues/485,496038601,MDEyOklzc3VlQ29tbWVudDQ5NjAzODYwMQ==,9599,2019-05-26T23:08:41Z,2019-05-26T23:08:41Z,OWNER,"The code currently assumes the primary key is called ""id"" or ""pk"" - improving it to detect the primary key using database introspection should work much better.","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,
https://github.com/simonw/datasette/issues/485#issuecomment-495083670,https://api.github.com/repos/simonw/datasette/issues/485,495083670,MDEyOklzc3VlQ29tbWVudDQ5NTA4MzY3MA==,9599,2019-05-23T06:21:52Z,2019-05-23T06:22:36Z,OWNER,"If a table has more than two columns we could do a betterl job at guessing the label column. A few potential tricks:
* look for a column called name or title
* look for the first column of type text
* check for the text column with the most diversity in values","{""total_count"": 0, ""+1"": 0, ""-1"": 0, ""laugh"": 0, ""hooray"": 0, ""confused"": 0, ""heart"": 0, ""rocket"": 0, ""eyes"": 0}",447469253,