You Wouldn't Download a Hacker News

451 points by jasonthorsness 3 days ago

There's also two DBs I know of that have an updated Hacker News table for running analytics on without needing to download it first.

- BigQuery, (requires Google Cloud account, querying will be free tier I'd guess) — `bigquery-public-data.hacker_news.full`

- ClickHouse, no signup needed, can run queries in browser directly, [1]

[1] https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

kordlessagain 2 days ago

It even finds your comment 'clickhouse': https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...
- ZeWaka 2 days ago
  
  and now yours :)
xnx 2 days ago

The ClickHouse resource is amazing. It even has history! I had already done my own exercise of downloading all the JSON before discovering the Clickhouse HN DBs.
k0ns0l a day ago

+1

mattkevan 2 days ago

I did something similar a while back to the @fesshole Twitter/Bluesky account. Downloaded the entire archive and fine-tuned a model on it to create more unhinged confessions.

Was feeling pretty pleased with myself until I realised that all I’d done was teach an innocent machine about wanking and divorce. Felt like that bit in a sci-fi movie where the alien/super-intelligent AI speed-watches humanity’s history and decides we’re not worth saving after all.

nthingtohide 2 days ago

> an innocent machine about wanking and divorce
Let's say you discovered a pendrive of a long lost civilization and train a model on that text data. How would you or the model know that the pendrive contained data on wanking and divorce without anykind of external grounding to that data?
- dTal a day ago
  
  LLMs learn to translate without explicit Rosetta stones (pairs of identical texts in different languages). I suspect they would learn to translate even in the absence of any identical texts at all. "Meaning" is no more or less than structural isomorphism, and humans tend to talk about similar things in similar ways regardless of the language used. So provided that the pendrive contained similar topics to known texts, and was large enough to extract with statistical significance the long tail of semantically meaningful relationships, then a translation could be achieved.
- alabastervlog 2 days ago
  
  This is much more concise than my usual attempts to explain why LLMs don’t “know” things. I’ll be stealing it. Maybe with a different example corpus, lol.
  
  nthingtohide a day ago
  
  I actually I fashioned this logic out of the philosophy question of why certain neural firings appear as sound in our brain while others appear as vision? What gives?
  
  wfn a day ago
  
  iirc there were some experiments where they rewired optic nerve and inner ear in mice to route (so to speak) to different areas of the brain (different cortical (i think?) destinations), and iirc the higher level biological structures of those areas were built up accordingly (regular visual cortex like neural structures for visual data, etc.) iirc was done on very young baby mice or somesuch (classic creepy stuff, do not remember which decade; Connectionism researchers).
  does not answer the general good abstract question and "how semantics possible thru relative context / relations to other terms only?", but speaks to how different modalities of information (e.g. visual data vs. sound data) are likewise represented, modelled, processed, etc. using different neural structures which presumably encode different aspects of information (e.g. layman obvious guess - temporality / relation-across-time-axis much more important for sound data).
  
  nthingtohide a day ago
  
  In case of a person, the external sensory data provides the grounding. Consider a prisoner who spent a long time in hole in the cell, he starts hallucinating due to no sensory information to ground his neuronal firings.
  
  dTal a day ago
  
  Philosophically speaking, sensory data is no more "external" or grounded than words are. You do not see - your eyes see. You do not hear - your ears hear. You cannot truly interact with the world except at a remove, through various organs. You never perceive the world "as it really is". Your brain attempts to build a consistent model that explains all your sensory input - including words. Where words and sensory input disagree, words can even win - I can tell you that you are in VR, or dreaming, or that immigrants are the cause of all your problems, and if I am careful and charismatic you may believe me...
  tl;dr what you think of as "grounding" is just yet more relative context...
  
  tomcam a day ago
  
  So, wanking and hentai?
  
  moate 2 days ago
  
  I think it only fair to leave that in for posterity. Where would we be without wanking and divorce after all?
  
  harry8 2 days ago
  
  Sexually fulfilled?
falcor84 2 days ago

What's wrong with wanking and divorce? These are respectively a way for people to be happier and more self-reliant, and a way for people to get out of a situation that isn't working out for them. I think both are net positives, and I'm very grateful to live in a society that normalizes them.
- pc86 2 days ago
  
  I'm not implying that divorce should be stigmatized or prohibited or anything, but it is bad (necessary evil?) and most people would be much happier if they had never married that person in the first place rather than married them then gotten divorced.
  So "normalize divorce" is pretty backward when what we should be doing is normalizing making sure you're marrying the right person.
  
  nhod 2 days ago
  
  This reminds me of one of my very favorite essays of all time, "Why You Will Marry the Wrong Person" by Alain de Botton from the School of Life. The title is somewhat misleading, and I resisted reading it for a couple years as a result. It is exquisite writing — it couldn't be said with fewer words, and adding more wouldn't help either — and an extraordinary and ultimately hopeful meditation on love and marriage.
  NYT Gift Article: https://www.nytimes.com/2016/05/29/opinion/sunday/why-you-wi...
  
  Nzen 2 days ago
  
  Alain de Botton also published this in video form, seven years ago [0]. If you want the cliff's notes, his School of Life channel has a shorter version [1].
  [0] https://www.youtube.com/watch?v=-EvvPZFdjyk 22 minutes
  [1] https://www.youtube.com/watch?v=zuKV2DI9-Jg 4 minutess
  
  didgetmaster 2 days ago
  
  I agree. The title is wrong. It should be 'Why you are sure to think, whomever you marry, that they are the wrong person".
  
  tailspin2019 2 days ago
  
  You’re 100% right. That essay is superb and I’m glad I read it!
  Thanks for sharing the link.
  
  cgriswald 2 days ago
  
  Making sure you are marrying the right person is normalized. I’d have never even known my ex wasn’t the right person if I hadn’t married her. I didn’t come out of my marriage worse off.
  Normalize divorce and stop stigmatizing it by calling it bad or evil.
  
  bluefirebrand 2 days ago
  
  > I didn’t come out of my marriage worse off
  This is good for you, but many people do come out of their marriages much worse off in various ways
  > Normalize divorce and stop stigmatizing it by calling it bad or evil
  It's not bad or evil, but let's also not pretend that it isn't damaging
  
  cgriswald 2 days ago
  
  We don't have to pretend. The original poster thinks he knows what the world looks like if every marriage that ends in divorce just never happened. Those marriages do happen, though, and to place all the damage generated by that marriage strictly on the divorce is incorrect. Usually one or both parties know the consequences of the divorce and prefer them to the state of the marriage, because the damages are less than if divorce wasn't an option. Claiming divorce is some kind of undesirable 'damaged' state is just as stigmatizing as claiming it is 'bad' or 'evil'.
  The alternative to divorce isn't perfect marriages, it is failed marriages that are inescapable.
  
  gwerbret 2 days ago
  
  > The alternative to divorce isn't perfect marriages, it is failed marriages that are inescapable.
  I'm sure this has nothing to do with you, but by your comments in this thread, I'm reminded of a conversation I had with a friend on a bus one day. We were talking about the unfortunate tendency, in daytoday, of people to shuffle their elderly parents off to nursing homes, rather than to support said parents in some sort of independent living. A nearby passenger jumped into our conversation to argue that there are situations in which the nursing home situation is for the best. Although we agreed with him, he seemed to dislike the fundamental idea of caring for one's elderly parents at all, and subsequently became quite heated.
  
  pc86 a day ago
  
  Who are you referring to with "the original poster?" I follow from this comment the whole way up to the root of the thread and not a single comment even begins to suggest someone "knows what the world looks like if every marriage that ends in divorce just never happened."
  It's pretty easy to create strawmen arguments and argue against those instead of what people actually say, but it makes for at best boring and at worst confusing reading.
  
  smcin 2 days ago
  
  There are lots of proven viable alternatives to quick no-fault divorce, the most obvious being waiting periods or separation periods ranging from months to years. [0]. Parental alienation can be gamed, and frequently is. Psychologist evals can be gamed or biased. Expert witness reports can be gamed. Move-away scenarios (by the custodial parent) can be gamed. Making false or perjurous allegations can be gamed, sometimes without consequence. Jurisdiction-shopping can be gamed. It seems pretty obvious that if there are huge incentives (or penalties) for certain modes of behavior, some types of people will exploit those. Community property/separate property can be gamed. The timing of all these things can be gamed wrt dicslosures, health events, insurance coverage/eligibility, job change/start/end, stock vesting, SS eligibity, tax filings etc. Divorce settlements can be gamed too by one party BK'ing out of a settlement/division of debts. At-fault divorce also exists (in many US states), and obviously can be gamed.
  It's not a false dichotomy between either a jurisdiction must allow instant no-fault divorce for everyone who petitions for it, or none at all.
  > Usually one or both parties know the consequences of the divorce and prefer them to the state of the marriage, because the damages are less than if divorce wasn't an option.
  Sometimes both parties are reasonably rational and honest and non-adversarial, then again sometimes one or both aren't, and it only takes one party (or their relatives) to make things adversarial. If you as a member of the public want to see it in action, in general you can sit in and observe proceedings in your local courthouse in person, or view the docket of that day's cases, or view the local court calendar online. Often the judge and counsel strongly affect the outcome too, much more than the facts at issue.
  > Claiming divorce is some kind of undesirable 'damaged' state is just as stigmatizing as claiming it is 'bad' or 'evil'.
  It is not necessarily the end-state of being divorced that is objectively quantifiably the most damaging to both parties' finances, wellness, children, and society at large, it's the expensive non-transparent ordeal of family court itself that can cause damage, as much as (or sometimes more than) the end-state of ending up divorced. Or both. Or neither.
  > The alternative to divorce is...
  ...a less broken set of divorce laws, for which there are multiple viable candidates. Or indeed, marriage(/cohabitation/relationships) continuing to fall out of favor. Other than measuring crude divorce rates and comparing their ratio to crude marriage rates (assuming same jurisdiction, correcting for offset by the (estimated) average length of marriage, and assuming zero internal migration), as marriage becomes less and less common, we're losing the ability to form a quantified picture of human behavior viz. when partnerships/relationships start or end; many countries' censuses no longer track this or being pressued to stop tracking it [1]; it could be inferred from e.g. bank, insurance, household bill arrangements, credit information, public records, but obviously privacy needs to be respected.
  [0] https://en.wikipedia.org/wiki/Divorce_law_by_country
  [1]: https://www.pewresearch.org/short-reads/2015/05/11/census-bu...
  
  naikrovek 2 days ago
  
  > It's not bad or evil, but let's also not pretend that it isn't damaging
  It’s not any more damaging than getting married in some cases, or staying married.
  Marriage is not some sacred thing to be treasured. It CAN be, but it isn’t inherently good. Inherently, marriage is a legal thing, and that’s about it; being married changes how taxes, confidential medical information, and death are handled, and that’s about it. Every meaning or significance beyond those legal things is up to the happy couple, including how, if, and when, to end the marriage.
  
  pc86 2 days ago
  
  Something can be both bad and not stigmatized. Divorce is a pretty good example here. It's not stigmatized, and to prove it's not say with a straight face it should be illegal and you won't be able to blink before the backlash hits you. It's not stigmatized at all. Most individuals who get married will get divorced. The way the numbers work out something like 60-70% of all marriages contain at least one divorced partner. Saying it's stigmatized is silly and doesn't line up with reality. But of course it's an objectively bad thing. It's messy, it's expensive, feelings get hurt, often times years or decades of peoples' lives are wasted.
  
  cgriswald 2 days ago
  
  I don't have to say it with a straight face because your sibling poster did it for me. Something can be both common and stigmatized. Yes, divorce can be messy, expensive, emotionally fraught, and take time. Mine was, and it still wasn't 'bad' or even undesirable. Starting a business, learning an instrument, training for a sport can also be all those things. We don't call them 'bad', or 'evil', because we don't assume the end result is undesirable.
  The comparison can't be to an imaginary world where everyone always picks the best partner. It has to be to the real world where people don't always pick the best partner and the absence of divorce means they're stuck with them.
  
  pixl97 2 days ago
  
  Eh, I would say it's quite a bit more complicated than you're giving it credit for.
  >Making sure you are marrying the right person is normalized.
  Absolutely not.
  I live in the southern US and we have the culmination of "Young people should get married" coupled with "divorce is bad/evil" and the disincentivization of actually learning about human behaviors/complications before going through something that could be traumatic.
  There are a lot of relationships that from an outside and balanced perspective give all the signs they will not work out and will be potentially dangerous for one or both partners in the relationship.
  
  wkat4242 a day ago
  
  Yeah and the "sex before marriage is bad" thing makes it even harder to experiment and find a partner that really suits.
- dcuthbertson 2 days ago
  
  The innocent machine can't do either. It's akin to having no mouth, but it must scream (apologies to Harlan Ellison)
  
  falcor84 2 days ago
  
  That is a fair point, but it would then apply to everything else we teach it about, like how we perceive the color of the sky or the taste of champagne. Should we remove these from the training set too?
  Is it not still good to be exposed to the experiences of others, even if one cannot experience these things themself?
  
  dcuthbertson 2 days ago
  
  Thanks for saying it's a fair point, but it's more of an offhand joke about "an innocent machine". In reality, a machine, even an LLM, has no innocence. It's just a machine.
  
  falcor84 2 days ago
  
  Having studied biology, I never accepted the "just a machine" argument. Everything is essentially a machine, but when a machine is sufficiently complex, it is rational to apply the Intentional Stance to it.
  
  pixl97 2 days ago
  
  Gets a bit more complicated when we start giving these machines agency.
- adamc 2 days ago
  
  Having gone through a divorce... no. It would be better if people tried harder to make relationships work. Failing that, it would be better to not marry such a person.
  
  zelphirkalt 2 days ago
  
  The state of having married the wrong person, will always occur. To stigmatize divorce is to put people who made the wrong choice once in a worse spot.
  Marriage should be made less artificially blown up with meaning and divorce should not be stigmatized. Instead, if done with a healthy frequency, people divorcing when they notice it is not working, should be applauded, for looking out for their own health.
  At the same time people also should learn how to make relationships in general work.
  
  wkat4242 a day ago
  
  > Marriage should be made less artificially blown up with meaning and divorce should not be stigmatized. Instead, if done with a healthy frequency, people divorcing when they notice it is not working, should be applauded, for looking out for their own health.
  > At the same time people also should learn how to make relationships in general work.
  And most importantly, knowing when to do the one or the other.
  I think this thought that divorce is bad comes from religion which would end up having to care for abandoned kids (especially when contraception didn't exist so having kids wasn't as much of a choice)
  I don't really hear it so much here in Europe except from very religious people. Most people are totally ok with divorce, many aren't even married (I myself never married and I had a gf for 12 years from a Catholic family who also didn't mind at all) and a lot of them are even polyamorous :) I have a feeling that would not go down so well in rural America.
  
  zelphirkalt a day ago
  
  Europe is less extreme in terms of Christianity and its partly outdated values. Sure, in each country you can find hardliners, but I think much less than in the US.
  
  falcor84 2 days ago
  
  People sometimes grow in different directions. Sometimes the person who was perfect for you at 25 just isn't a good fit for you at age 40, regardless of how hard you try to make it work.

userbinator 2 days ago

I had a 20 GiB JSON file of everything that has ever happened on Hacker News

I'm actually surprised at that volume, given this is a text-only site. Humans have managed to post over 20 billion bytes of text to it over the 18 years that HN existed? That averages to over 2MB per day, or around 7.5KB/s.

sph 2 days ago

2 MB per day doesn't sound like a lot. The amount of posts probably has increased exponentially over the years, especially after the Reddit fiasco when we had our latest, and biggest neverending September.
Also, I bet a decent amount of that is not from humans. /newest is full of bot spam.
- samplatt 2 days ago
  
  Plus the JSON structure metadata, which for the average comment is going to add, what, 10%?
  
  kevincox 2 days ago
  
  I suspect it is closer to 100% increase for the average comment. If the average comment is a few senteces and the metadata has id, parent id, author, timestamp and a vote count that can add up pretty fast.
- FabHK 2 days ago
  
  Around one book every 12 hours.
olalonde 2 days ago

7.5KB/s (aka 7500 characters per second) didn't sound realistic... So I did the math[0] and it turns out it's closer to 34 bytes/s (0.03 KB/s). And it's really lower than that because of all the metadata and syntax in the JSON. You were right about the "over 2MB per day" though.
[0] Well, ChatGPT did the math but it seems to check out: https://chatgpt.com/share/68124afc-c914-800b-8647-74e7dc4f21...
NitpickLawyer 2 days ago

The entire reddit archive was ~4TB sometime close to them removing the API. That's fully compressed, it used to be hosted on the-eye. There are stil arrrr places where you can torrent the files if you're inclined to do so. A lot of that is garbage, but the early years are probably worth a look at, especially before 2018-2019 when smarter bots came to be.
xnx 2 days ago

20 GB JSON is surprising to me. I have an sqlite file of all HN data that is 20 GB, it would be much larger as JSON.
- wolfgang42 2 days ago
  
  20 GB of JSON is correct; here’s the entire dump straight from the API up to last Monday:
  $ du -c ~/feepsearch-prod/datasource/hacker-news/data/dump/*.jsonl | tail -n1 19428360 total
  Not sure how your sqlite file is structured but my intuition is that the sizes being roughly the same sounds plausible: JSON has a lot of overhead from redundant structure and ASCII-formatted values; but sqlite has indexes, btrees, ptrmaps, overflow pages, freelists, and so on.
  
  elcritch 2 days ago
  
  Sqlite also doesn’t have fixed types, but uses a tagged value system to store data. Well according to what I’ve read on the topic.
- kortilla a day ago
  
  SQLite files are optimized for fast querying, not size.
dredmorbius a day ago

The total strikes me as small. That's nearly two decades of contributions from several 100k active members, and a few million total. HN is what would have been a substantial social network prior to Facebook, and (largely on account of its modest size and active moderation) a high-value one.
I did some modelling of how much contributed text data there was on Google+ as that site was shutting down in 2019.
By "text data", I'm excluding both media (images, audio, video), and all the extraneous page throw-weight (HTML scaffolding, CSS, JS).
Given the very low participation rates, and finding that posts on average ran about 120 characters (I strongly suspect that much activity was part of a Twitter-oriented social strategy, though it's possible that SocMed posts just trend short), seven years' of history from a few tens of millions of active accounts (out of > 4 billion registered profiles) only amounted to a few GiB.
This has a bearing on a few other aspects:
- The Archive Team (AT, working with, but unaffiliated with, the Internet Archive, IA) was engaged in an archival effort aimed at G+. That had ... mixed success (much content was archived, one heck of a lot wasn't, very few comments survive (threads were curtailed to the most recent ten or so, absent search it remains fairly useless, those with "vanity accounts" (based on a selected account name rather than a random hash) prove to be even less accessible). In addition to all of that, by scraping full pages and attempting to present the site as it presented online, AT/IA are committing to a tremendous increase in the stored data requirements whilst missing much of what actually made the site actually of interest.
- Those interested in storing text contributions of even large populations face very modest storage requirements. If, say, average online time is 45 minutes daily, typing speed is 45 wpm, and only half of online time is spent writing vs. reading, that's roughly 1,000 words/(person*day), or about 6 KiB/(person*day). That's 6 MiB per 1,000 people, 6 GiB per 1 million, 6 PiB per billion. And ... the true values are almost certainly far lower: I'm pretty certain I've overstated writing time (it's likely closer to 10%), and typing speed (typing on mobile is likely closer to 20--30 wpm, if that). E.g., Facebook sees about 2.45 billion "pieces of content" posted per day, of which half is video. If we assume 120 characters (bytes) per post, that's a surprisingly modest amount, substantially less than 300 GiB/day of text data. (Images, audio, and video will of course inflate that markedly).
- The amount of non-entered data (e.g., location, video, online interactions, commerce) is the bulk of current data collection / surveillance state & capitalism systems.

jakegmaths 2 days ago

Your query for Java will include all instances of JavaScript as well, so you're over representing Java.

smarnach 2 days ago

Similarly, the Rust query will include "trust", "antitrust", "frustration" and a bunch of other words
- sph 2 days ago
  
  A guerilla marketing plan for a new language is to call it a common one word syllable, so that it appears much more prominent than it really is on badly-done popularity contests.
  Call it "Go", for example.
  (Necessary disclaimer for the irony-impaired: this is a joke and an attempt at being witty.)
  
  setopt 2 days ago
  
  Let’s make a language called “A” in that case. (I mean C was fine, so why not one letter?)
  
  TZubiri 2 days ago
  
  Or call it the name of a popular song to appeal to the youngins.
  I present to you "Gangam C"
  
  InDubioProRubio 2 days ago
  
  You also wouldn't acronym hijack overload to boost mental presence in gamers LOL
- matsemann 2 days ago
  
  Reminded me about Scunthorpe problem https://en.wikipedia.org/wiki/Scunthorpe_problem
- Matumio a day ago
  
  Now if we only could disambiguate words based on context. But you'd need a good language model for that, and we don't... wait.
- brian-armstrong a day ago
  
  Amusingly, the chart shows Rust's popularity starting from before its release. The rust hype crowd is so exuberant, they began before the language even existed!
jasonthorsness 2 days ago

Ah right… maybe even more unexpected then to see a decline
- cs02rm0 2 days ago
  
  I'm not so sure, while Java's never looked better to me, it does "feel" to me to be in significant decline in terms of what people are asking for on LinkedIn.
  I'd imagine these days typescript or node might be taking over some of what would have hit on javascript.
  
  cess11 2 days ago
  
  Recruiting Java developers is easy mode, there are rather large consultancies and similar suppliers that will sell or rent them to you in bulk so you don't need to nag with adverts to the same extent as with pythonistas and rubyists and TypeScript.
  But there is likely some decline for Java. I'd bet Elixir and Erlang have been nibbling away on the JVM space for quite some time, they make it pretty comfortable to build the kind of systems you'd otherwise use a JVM-JMS-Wildfly/JBoss rig for. Oracle doesn't help, they take zero issue with being widely perceived as nasty and it takes a bit of courage and knowledge to manage to avoid getting a call from them at your inconvenience.
  
  patates 2 days ago
  
  Speaking as someone who ended up in the corporate Java world somewhat accidentally (wasn't deep in the ecosystem before): even the most invested Java shops seem wary of Oracle's influence now. Questioning Oracle tech, if not outright planning an exit strategy, feels like the default stance.
  
  cess11 2 days ago
  
  Most such places probably have some trauma related to Oracle now. Someone spun up the wrong JVM by accident and within hours salespeople were on the phone with some middle manager about how they would like to pay for it, that kind of thing. Or just the issue of injecting their surveillance trojans everywhere and knowing they're there, that's pretty off-putting in itself.
  Which is a pity, once you learn to submit to and tolerate Maven it's generally a very productive and for the most part convenient language and 'ecosystem'. It's like Debian, even if you fuck up badly there is likely a documented way to fix it. And there are good libraries for pretty much anything one could want to do.
  
  karel-3d 2 days ago
  
  New Java looks actually good, but most of the Java actual ecosystem is stuck in the past.... and you will mostly work within the existing ecosystem
- smcin 2 days ago
  
  a) Does your query for 'JS' return instances of 'JSON'?
  b) The ultimate hard search topic for is 'R' / 'R language'. Check if you think you index it corectly. Or related terms like RStudio, Posit, [R]Shiny, tidyverse, data.table, Hadleyverse...

SilverBirch 2 days ago

What is the netiquette of downloading HN? Do you ping Dang and ask him before you blow up his servers? Or do you just assume at this point that every billion dollar tech company is doing this many times over so you probably won't even be noticed?

internetter 2 days ago

There's literally a public database
https://console.cloud.google.com/marketplace/product/y-combi...
- umvi 2 days ago
  
  What if someone from EU invokes "right to be forgotten" and demands HN delete past comments from years ago. Will those deletions be reflected in the public database? Or could you mine the db to discover deleted data?
  
  jeremyjh 2 days ago
  
  They need to issue their demand to whoever is hosting their data. If HN has deleted it, they are not hosting it.
- dang 2 days ago
  
  That's an entirely third party project so I doubt they should be listing YC as a partner there.
  
  internetter 2 days ago
  
  Huh, yeah that is really misleading. Makes it look like it is by YC.
  
  dang 2 days ago
  
  The question is should we ask them to change it...the thought of embarking into the googlemaze with faint hope of encountering a human being makes me le tired.
  
  Tomte a day ago
  
  It doesn't seem to harm you, trademark issues seem not to be your priority there, and do you really want everyone to mail you asking for permission and/or instructions how to download? I guess the Google thing saves you some trouble.
  It would have been nice to coordinate that with you, though.
  
  dang a day ago
  
  We've gotten quite a few support emails over the years from people asking us to help them with that product. I always used to wonder why, but now I think I know.
krapp 2 days ago

HN has an API, as mentioned in the article, which isn't even rate limited. And all of the data is hosted on Firebase, which is a YC company. It's fine.
- mikeevans 2 days ago
  
  Firebase is owned and operated by Google (has been for a while).
euroderf 2 days ago

Not to mention three-letter agencies, incidentally attaching real names to HN monikers ?
TZubiri 2 days ago

Well, it's called Hacker News, so hacking is fair game, at least in the good sense of the word.
alt227 2 days ago

If something is on the public web, it is already being scraped by thousands of bots.
dangoodmanUT 2 days ago

there's literally an API they promote. Did you read that part before trying to cancel them?

flakiness 2 days ago

I have done something similar. I cheated to use BigQuery dataset (which somehow keeps getting updated) and export the data to parquet, download it and query it using duckdb.

minimaxir 2 days ago

That's not cheating, that's just pragmatic.
- AbstractH24 2 days ago
  
  What a pragmatic way to rationalize most cheating

g8oz 2 days ago

I predict that in the coming years a lot of APIs will begin offer the option of just returning a duckdb file. If you're just going to load the json into a database anyway, why not just get a database in the response.

vdm a day ago

zstd parquets exported from my duckdb 1.2 files compress 2-3x more

bambax 2 days ago

> Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.

The author said this in jest, but I fear someone, someday, will try this; I hope it never happens but if it does, could we stop it?

icoder 2 days ago

I'm more and more convinced of an old idea that seems to become more relevant over time: to somehow form a network of trust between humans so that I know that your account is trusted by a person (you) that is trusted by a person (I don't know) [...] that is trusted by a person (that I do know) that is trusted by me.
Lots of issues there to solve, privacy being one (the links don't have to be known to the users, but in a naive approach they are there on the server).
Paths of distrust could be added as negative weight, so I can distrust people directly or indirectly (based on the accounts that they trust) and that lowers the trust value of the chain(s) that link me to them.
Because it's a network, it can adjust itself to people trying to game the system, but it remains a question to how robust it will be.
- XorNot 2 days ago
  
  I think technically this is the idea that GPG's web of trust was circling without quite staring at, which is the oddest thing about the protocol: it's used mostly today for machine authentication, which it's quite good at (i.e. deb repos)...but the tooling actually generally is oriented around verifying and trusting people.
  
  wobfan 2 days ago
  
  Yeah exactly, this was exactly the idea behind that. Unfortunately, while on paper it just sounds like a sound idea, at least IMO, though ineffective, it has proven time and time again that the WOT idea in PGP has no chance against the laziness of humans.
- Philpax 2 days ago
  
  https://en.wikipedia.org/wiki/Key_signing_party
  
  genewitch 2 days ago
  
  Matrix protocol or at least the clients agree that several emoji is a key - which is fine - and you verify by looking at the keys (on each client) at the same time in person, ideally. I've only ever signed for people in person, and one remote attestation; but we had a separate verified private channel and attested the emoji that way.
  
  nickdothutton 2 days ago
  
  Do these still happen? They were common (-ish, at least in my circles) in the 90s during the crypto wars, often at the end of conferences and events, but I haven't come across them in recent years.
- drcongo 2 days ago
  
  I actually built this once, a long time ago for a very bizarre social network project. I visualised it as a mesh where individuals were the points where the threads met, and as someone's trust level rose, it would pull up the trust levels of those directly connected, and to a lesser degree those connected to them - picture a trawler fishing net and lifting one of the points where the threads meet. Similarly, a user whose trust lowered over time would pull their connections down with them. Sadly I never got to see it at the scale it needed to become useful as the project's funding went sideways.
  
  icoder 2 days ago
  
  Yeah building something like this is not a weekend project, getting enough traction for it to make sense is another orders of magnitude beyond that.
  I like the idea of one's trust to leverage that of those around them. This may make it more feasible to ask some 'effort' for the trust gain (as a means to discourage duplicate 'personas' for a single human), as that can ripple outward.
  
  all2 2 days ago
  
  How would 'trust' manifest? A karma system?
  How are individuals in the network linked? Just comments on comments? Or something different?
  
  drcongo a day ago
  
  The system I built it for was invite only so the mesh was self-building, and yeah, there was a karma-like system that affected the trust levels, which in turn then gave users extra privileges such as more invites. Most of this was hidden from the users to make it slightly less exploitable, though if it had ever reached any kind of scale I'd imagine some users would work out ways to game it.
- littlestymaar 2 days ago
  
  Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.
  For a mix of ideological reasons and lack of genuine interest for the internet from the legislators, mainly due to the generational factor I'd guess, it hasn't happened yet, but I expect government issued equivalent of IDs and passports for the internet to become mainstream sooner than later.
  
  eadmund 2 days ago
  
  > Ultimately, guaranteeing common trust between citizens is a fundamental role of the State.
  I don’t think that really follows. Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time. Various networks of merchants did the same in the Middle Ages.
  
  littlestymaar 2 days ago
  
  > Businesses credit bureaus and Dun & Bradstreet have been privately enabling trust between non-familiar parties for quite a long time.
  Under the supervision of the State (they are regulated and rely on the justice and police system to make things work).
  > Various networks of merchants did the same in the Middle Ages.
  They did, and because there was no State the amount of trust they could built was fairly limited compared to was has later been made possible by the development of modern states (the industrial revolution appearing in the UK has partly been attributed to the institutional framework that existed there early).
  Private actors can, and do, and have always done, build their own makeshift trust network, but building a society-wide trust network is a key pillar of what makes modern states “States” (and it directly derives from the “monopoly of violence”).
  
  lormayna 2 days ago
  
  Havala (https://it.m.wikipedia.org/wiki/Hawala) or other similar way to transfer money abroad are working over a net of trust, but without any state trust system.
  
  littlestymaar 2 days ago
  
  Compare its use to SWIFT and you'll see the difference.
  
  nostrademons 2 days ago
  
  That’s not really what research on state formation has found. The basic definition of a state is “a centralized government with a monopoly on the legitimate use of force”, and as you might expect from the definition, groups generally attain statehood by monopolizing the use of force. In other words, they are the bandits that become big enough that nobody dares oppose them. They attain statehood through what’s effectively a peace treaty, when all possible opposition basically says “okay, we’re submit to your jurisdiction, please stop killing us”. Very often, it actually is a literal peace treaty.
  States will often co-opt existing trust networks as a way to enhance and maintain their legitimacy, as with Constantine’s adoption of Christianity to preserve social cohesion in the Roman Empire, or all the compromises that led the 13 original colonies to ratify the U.S. constitution in the wake of the American Revolution. But violence comes first, then statehood, then trust.
  Attempts to legislate trust don’t really work. Trust is an emotion, it operates person-to-person, and saying “oh, you need to trust such-and-such” don’t really work unless you are trusted yourself.
  
  littlestymaar 2 days ago
  
  > The basic definition of a state is “a centralized government with a monopoly on the legitimate use of force
  I'm not saying otherwise (I've even referred to this in a later comment).
  > But violence comes first, then statehood, then trust.
  Nobody said anything about the historical process so you're not contradicting anyone.
  > Attempts to legislate trust don’t really work
  Quite the opposite, it works very, very well. Civil laws and jurisdiction on contracts have existed since the Roman Republic, and every society has some equivalent (you should read about how the Taliban could get back to power so quickly in big part because they kept doing civil justice in the rural afghan society even while the country was occupied by the US coalition).
  You must have institutions to be sure than the other party is going to respect the contract, so that you don't have to trust them, you just need to trust that the state is going to enforce that contract (what they can do because they have the monopoly of violence and can just force the party violating the contract into submission).
  With the monopoly of violence comes the responsibility to use your violence to enforce contracts, otherwise social structures are going to collapse (and someone else is going to take that job from you, and gone is your monopoly of violence)
  
  icoder 2 days ago
  
  Interestingly, as I've begun to realise the ease by which a State's trust can sway has actually increased my believe that this should come from 'below'. I think a trust network between people (of different countries) can be much more resilient.
- haswell 2 days ago
  
  I’ve also been thinking about this quite a bit lately.
  I also want something like this for a lightweight social media experience. I’ve been off of the big platforms for years now, but really want a way to share life updates and photos with a group of trusted friends and family.
  The more hostile the platforms become, the more viable I think something like this will become, because more and more people are frustrated and willing to put in some work to regain some control of their online experience.
  
  TheOtherHobbes a day ago
  
  They're different application types - friends + family relationship reinforcement, social commenting (which itself varies across various dimensions, from highlighting usefulness to unapologetically mindless entertainment), social content sharing and distribution (interest group, not necessarily personal, not specifically for profit), social marketing (buy my stuff), and political influence/opinion management.
  Meta and X have glommed them all together and made them unworkable with opaque algorithmic control, to the detriment of all of them.
  And then you have all of them colonised by ad tech, which distorts their operation.
  
  jeremyjh 2 days ago
  
  The key is to completely disconnect all ad revenue. I'm skeptical people are willing to put in some money to regain control; not in the kind of percentages that means I can move most of my social graph. Network effects are a real issue.
- brongondwana 2 days ago
  
  Also there's the problem that every human has to have perfect opsec or you get the problem we have now, where there are massive botnets out there of compromised home computers.
- im3w1l 2 days ago
  
  GPG lost, TLS won. Both are actually webs of trust with the same underlying technology. But they have different cultures and so different shapes. GPG culture is to trust your friends and have them trust their friends. With TLS culture you trust one entity (e.g. browser) that trusts a couple dozen entities that (root certificate authorities), that either signs keys directly or can fan out to intermediate authorities that then sign keys. The hierarchical structure has proven much more successful than the decentralized one.
  Frankly I don't trust my friends of friends of friends not to add thirst trap bots.
  
  lxgr 2 days ago
  
  The difference is in both culture and topology.
  TLS (or more accurately, the set of browser-trusted X.509 root CAs) is extremely hierarchical and all-or-nothing.
  The PGP web of trust is non-hierarchical and decentralized (from an organizational point of view). That unfortunately makes it both more complex and less predictable, which I suppose is why it “lost” (not that it’s actually gone, but I personally have about one or maybe two trusted, non-expired keys left in my keyring).
  
  kevin_thibedeau 2 days ago
  
  The issue is key management. TLS doesn't usually require client keys. GPG requires all receivers to have a key.
  
  amenghra 2 days ago
  
  Couple dozen => it’s actually 50-ish, with a mix of private and government entities located all over the world.
  The fact that the Spanish mint can mint (pun!) certificates for any domain is unfortunate.
  Hopefully, any abuse would be noticed quickly and rights revoked.
  It would maybe have made more sense for each country’s TLD to have one or more associated CA (with the ability to delegate trust among friendly countries if desired).
  https://wiki.mozilla.org/CA/Included_Certificates
  
  wkat4242 a day ago
  
  Yes I never understood why the scope of a CA was not previously declared as part of their CA certificate. The purpose is (email, website etc) but not the possible domains. I'm not very happy that the countless Chinese CAs included in Firefox can sign any valid domain I use locally. They should be limited to anything .cn only.
  At least they seem to have kicked out the Russian ones now. But it's weird that such an important decision lies with arbitrary companies like OS and browser developers. On some platforms (Android) it's not even possible to add to the system CA list without root (only the user one which apps can choose to ignore)
- marcusb 2 days ago
  
  Isn't this vaguely how the invite system at Lobsters functions? There's a public invite tree, and users risk their reputation (and posting access) when they invite new users.
  
  withinboredom 2 days ago
  
  I know exactly zero people over there. I am also not about to go brown nose my way into it via IRC (or whatever chat they are using these days). I'd love to join, someday.
  
  somethingsome 2 days ago
  
  Hey I never actually tried lobsters, do you mind if I ask an invite?
- SuperShibe 2 days ago
  
  I think this ideas problem might be the people part, specifically the majority type of people that will click absolutely anything for a free iPad
  
  icoder 2 days ago
  
  Theoretically that should swiftly be reflected in their trust level. But maybe I'm too optimistic.
  I have nothing intrinsically against people that 'will click absolutely anything for a free iPad' but I wouldn't mind removing them from my online interactions if that also removes bots, trolls, spamners and propaganda.
miki123211 2 days ago

How do you know it isn't already happening?
With long and substantive comments, sure, you can usually tell, though much less so now than a year or two ago. With short, 1 to 2 sentence comments though? I think LLMs are good enough to pass as humans by now.
- Joker_vD 2 days ago
  
  But what if LLMs will start leaving constructive and helpful comments? I personally would feel like xkcd [0], but others may disagree.
  [0] https://xkcd.com/810/
  
  gosub100 2 days ago
  
  That's the moment we will realize that it's not the spam that bothers us, but rather that there is no human interaction. How vapid would it be to have a bunch of fake comments saying eat more vegetables, good job for not running over that animal in the road, call mom tonight it's been a while, etc. They mean nothing if they were generated by a piece of silicon.
  
  miki123211 2 days ago
  
  I think a much more important question is what happens when we have no idea who's an LLM and who's a real person.
  Do we accuse everybody of being an LLM? Will most threads devolve into "you're an LLM, no you're the LLM" wars? Will this give an edge to non-native English speakers, because grammatical errors are an obvious tell that somebody is human? Will LM makers get over their squeamishness and make "write like a Mexican who barely speaks English" a prompt that works and produces good results?
  Maybe the whole system of anonymity on the internet gets dismantled (perhaps after uncovering a few successful llm-powered psy-ops or under the guise of child safety laws), and everybody just needs to verify their identity everywhere (or login with Google)? Maybe browser makers introduce an API to do this as anonymously and frictionlessly as possible, and it becomes the new normal without much fuss? Is turnstile ever going to get good enough to make this whole issue moot?
  I think we have a very interesting few years in front of us.
  
  datameta 2 days ago
  
  Also neuronormative individuals sometimes mistake neurodivergent usage of language for LLM-speak which might have similar pattern matching schema reinforced
  
  withinboredom 2 days ago
  
  I believe they mean whatever you mean it to mean. Humanity has existed on religion based on what some dead people wrote down, just fine. Er, well, maybe not "just fine" but hopefully you get the gist: you can attribute whatever meaning you want to the AI, holy text, or other people.
  
  gosub100 2 days ago
  
  Religion is the opposite of AI text generation. It brings people together to be less lonely.
  AI actively tears us apart. We no longer know if we're talking to a human, or if an artists work came from their ability, or if we will continue to have a job to pay for our living necessities.
  
  withinboredom a day ago
  
  Did we ever know those things before? Even talking to a human face-to-face, I’ve had people lie to my face to try and scam/sell me something and people resell art all the time. You have little ability to tell whether an artist’s work is genuine or a copy unless they are famous.
  And the job? I’ve been laid off several times in my career. You never know if you will have a job tomorrow or not.
  AI has changed none of this, it is only exposing these problems to more people than before because it makes these things easier. It also makes good works easier, but I don’t think it cheapens that work if the person had the capability in the first place.
  In essence, we have the same problems we had before and now we are socially forced to deal with them a bit more head-on. I don’t think it’s a bad thing though. We needed to deal with this at some point anyway.
  
  gosub100 15 hours ago
  
  false equivalence fallacy: AI can destroy jobs and its ok unless you can compare it to a 0-layoff baseline.
  
  withinboredom 8 hours ago
  
  AI doesn’t destroy jobs. That’s like saying bulldozers destroyed jobs. No, it makes certain jobs easier. You still have to know what you’re doing. I can have an AI generate a statement of work in 30 seconds, but I still need to validate that output with people. You can sometimes validate it yourself to a degree, just like you can look at hole from a bulldozer and know it’s a hole. You just don’t know if it is a safe hole.
  
  pixl97 2 days ago
  
  >Religion is the opposite of AI text generation. It brings people together to be less lonely
  Eh yes, but also debatable.
  It brings you together if you follow their rules, and excommunicates you to the darkness if you do not. It is a complicated set of controls from the times before the rules of society were well codified.
  
  Pikamander2 2 days ago
  
  I was browsing a Reddit thread recently and noticed that all of the human comments were off-topic one-liners and political quips, as is tradition.
  Buried at the bottom of the thread was a helpful reply by an obvious LLM account that answered the original question far better than any of the other comments.
  I'm still not sure if that's amazing or terrifying.
  
  melagonster 2 days ago
  
  This just another reddit or HN.
nashashmi 2 days ago

We LLMs only output the average response of humanity because we can only give results that are confirmed by multiple sources. On the contrary, many of HN’s comments are quite unique insights that run contrary to the average popular thought. If this is ever to be emulated by an LLM, we would give only gibberish answers. If we had a filter to that gibberish to only permit answers that are reasonable and sensible, our answers would be boring and still be gibberish. In order for our answers to be precise, accurate and unique, we must use something other than LLMs.
r3trohack3r 2 days ago

HN already has a pretty good immune system for this sort of thing. Low-effort or repetitive comments get down-voted, flagged, and rate-limited fast. The site’s karma and velocity heuristics are crude compared with fancy ML, but they work because the community is tiny relative to Reddit or Twitter and the mods are hands-on. A fleet of sock-puppet LLM accounts would need to consistently clear that bar—i.e. post things people actually find interesting—otherwise they’d be throttled or shadow-killed long before they “replace all human text.”
Even if someone managed to keep a few AI-driven accounts alive, the marginal cost is high. Running inference on dozens of fresh threads 24/7 isn’t free, and keeping the output from slipping into generic SEO sludge is surprisingly hard. (Ask anyone who’s tried to use ChatGPT to farm karma—it reeks after a couple of posts.) Meanwhile the payoff is basically zero: you can’t monetize HN traffic, and karma is a lousy currency for bot-herders.
Could we stop a determined bad actor with resources? Probably, but the countermeasures would look the same as they do now: aggressive rate-limits, harsher newbie caps, human mod review, maybe some stylometry. That’s annoying for legit newcomers but not fatal. At the end of the day HN survives because humans here actually want to read other humans. As soon as commenters start sounding like a stochastic parrot, readers will tune out or flag, and the bots will be talking to themselves.
Written by GPT-3o
- stephenhumphrey 2 days ago
  
  Regardless of whether that final line reflects reality or is merely tongue-in-cheek snark, it elevates the whole post into the sublime.
Etheryte 2 days ago

See the Metal Gear franchise [0], the Dead Internet Theory [1], and many others who have predicted this.
> Hideo Kojima's ambitious script in Metal Gear Solid 2 has been praised, some calling it the first example of a postmodern video game, while others have argued that it anticipated concepts such as post-truth politics, fake news, echo chambers and alternative facts.
[0] https://en.wikipedia.org/wiki/Metal_Gear
[1] https://en.wikipedia.org/wiki/Dead_Internet_theory
djoldman 2 days ago

A variant of this was done for 4chan by the fantastic Yannic Kilcher:
https://en.wikipedia.org/wiki/GPT4-Chan
holuponemoment 2 days ago

Does it even matter?
Perhaps I am jaded but most if not all people regurgitate about topics without thought or reason along very predictable paths, myself very much included. You can mention a single word covered with a muleta (Spanish bullfighting flag) and the average person will happily run at it and give you a predictable response.
- bob1029 2 days ago
  
  It's like a Pavlovian response in me to respond to anything SQL or C# adjacent.
  I see the exact same in others. There are some HN usernames that I have memorized because they show up deterministically in these threads. Some are so determined it seems like a dedicated PR team, but I know better...
  
  OccamsMirror 2 days ago
  
  I always love checking the comments on articles about Bevy to see how the metaverse client guy is going.
- gosub100 2 days ago
  
  The paths are going to be predictable by necessity. It's not possible for everyone to have a uniquely derived interpretation about most common issues, whether that's standard lightning rod politics but also extending somewhat into tech socio/political issues.
no_time 2 days ago

I can’t think of an solution that preserves the open and anonymous nature that we enjoy now. I think most open internet forums will go one of the following routes:
- ID/proof of human verification. Scan your ID, give me your phone number, rotate your head around while holding up a piece of paper etc. note that some sites already do this by proxy when they whitelist like 5 big email providers they accept for a new account.
- Going invite only. Self explanatory and works quite well to prevent spam, but limits growth. lobste.rs and private trackers come to mind as an example.
- Playing a whack-a-mole with spammers (and losing eventually). 4chan does this by requiring you to solve a captcha and requires you to pass the cloudflare turnstile that may or may not do some browser fingerprinting/bot detection. CF is probably pretty good at deanonimizing you through this process too.
All options sound pretty grim to me. Im not looking forward to the AI spam era of the internet.
- theasisa 2 days ago
  
  Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.
  You'd need to have a permanent captcha that tracks that the actions you perform are human-like, such as mouse movement or scrolling on phone etc. And even then it would only deter current AI bots but not for long as impersonation human behavior would be a 'fun' challenge to break.
  Trusted relationships are only as trustworthy as the humans trusting each other, eventually someone would break that trust and afterwards it would be bots trusting bots.
  Due to bots already filling up social media with their spew and that being used for training other bots the only way I see this resolving itself is by eventually everything becoming nonsensical and I predict we aren't that far from it happening. AI will eat itself.
  
  no_time 2 days ago
  
  >Wouldn't those only mean that the account was initially created by a human but afterwards there are no guarantees that the posts are by humans.
  Correct. But for curbing AI slop comments this is enough imo. As of writing this, you can quite easily spot LLM generated comments and ban them. If you have a verification system in place then you banned the human too, meaning you put a stop to their spamming.
- icoder 2 days ago
  
  I'm sometimes thinking about account verification that requires work/effort over time, could be something fun even, so that it becomes a lot harder to verify a whole army of them. We don't need identification per se, just being human and (somewhat) unique.
  See also my other comment on the same parent wrt network of trust. That could perhaps vet out spammers and trolls. On one and it seems far fetched and a quite underdeveloped idea, on the other hand, social interaction (including discussions like these) as we know it is in serious danger.
- dns_snek 2 days ago
  
  There must be a technical solution to this based on some cryptographic black magic that both verifies you to be a unique person to a given website without divulging your identity, and without creating a globally unique identifier that would make it easy to track us across the web.
  Of course this goes against the interests of tracking/spying industry and increasingly authoritarian governments, so it's unlikely to ever happen.
  
  05 2 days ago
  
  Oh you mean something like Apple's Private Access Tokens?
  https://support.apple.com/en-us/102591
  https://blog.cloudflare.com/eliminating-captchas-on-iphones-...
  
  dns_snek 2 days ago
  
  I don't think that's what I was going for? As far as I can see it relies on a locked down software stack to "prove" that the user is running blessed software on top of blessed hardware. That's one way of dealing with bots but I'm looking for a solution that doesn't lock us out of our own devices.
  
  vvillena 2 days ago
  
  These kinds of solutions are already deployed in some places. A trusted ID server creates a bunch of anonymous keys for a person, the person uses these keys to identify in pages that accept the ID server keys. The page has no way to identify a person from a key.
  The weak link is in the ID servers themselves. What happens if the servers go down, or if they refuse to issue keys? Think a government ID server refusing to issue keys for a specific person. Pages that only accept keys from these government ID servers, or that are forced to only accept those keys, would be inaccessible to these people. The right to ID would have to be enshrined into law.
  
  no_time 2 days ago
  
  As I see it, a technical solution to AI spam inherently must include a way to uniquely identify particular machines at best, and particular humans responsible for said machines at worst.
  This verification mechanism must include some sort of UUID to reign in a single bad actor who happens to validate his/her bot farm of 10000 accounts from the same certificate.
kriro 2 days ago

I think LLMs could be a great driver of private-public key encryption. I could see a future where everyone finally wants to sign their content. Then at least we know it's from that person or an LLM-agent by that person.
Maybe that'll be a use case for blockchain tech. See the whole posting history of the account on-chain.
ahoka 2 days ago

Probably already happening.
drcongo 2 days ago

The internet is going to become like William Basinski's Disintegration Loops, regurgitating itself with worse fidelity until it's all just unintelligible noise.
dangoodmanUT 2 days ago

I imagine LLMs already have this too
genewitch 2 days ago

I have all of n-gate as json with the cross references cross referenced.
Just in case I need to check for plagiarism.
I don't have enough Vram nor enough time to do anything useful on my personal computer. And yes I wrote vram like that to pothole any EE.
_Algernon_ 2 days ago

This is probably already happening to some extent. I think the best we can hope for is xkcd 810: https://xkcd.com/810/
photochemsyn 2 days ago

It's hopeless.
We can still take the mathematical approach: any argument can be analyzed for logical self-consistency, and if it fails this basic test, reject it.
Then we can take the evidentiary approach: if any argument that relies on physical real-word evidence is not supported by well-curated, transparent, verifiable data then it should also be rejected.
Conclusion: finding reliable information online is a needle-in-a-haystack problem. This puts a premium on devising ways (eg a magnet for the needle) to filter the sewer for nuggets of gold.

stefs 2 days ago

please do not use stacked charts! i think it's close to impossible to not to distort the readers impression because a) it's very hard to gauge the height of a certain data point in the noise and b) they're implying a dependency where there _probably_ is none.

jimmySixDOF a day ago

Thats just where a 3D approach fixes the problem because you stack but with some offset there is nothing better for one shot only look once comprehension of high volume data using game engine tech for real world business intelligence please see the work of https://flowimmersive.com/
seabass 2 days ago

My first thought as well! The author of uPlot has a good demo illustrating their pitfalls https://leeoniya.github.io/uPlot/demos/stacked-series.html
jasonthorsness 2 days ago

It's true :( but line charts of the data had too much overlap and were hard to see anything. I was thinking next time maybe multiple line charts aligned and stacked, with one series per region?
dguest 2 days ago

How do you feel about stacked plots on a logarithmic y axis? Some physics experiments do this all the time [1] but I find them pretty uninitiative.
[1]: https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PUBNOTES/ATL-...
- lblume 2 days ago
  
  What is this even supposed to represent? The entire justification I could give for stacked bars is that you could permute the sub-bars and obtain comparable results. Do the bars still represent additive terms? Multiplicative constants? As a non-physicist I would have no idea on how to interpret this.
  
  dguest 2 days ago
  
  It's a histogram. Each color is a different simulated physical process: they can all happen in particle collisions, so the sum of all of them should add up to the data the experiment takes. The data isn't shown here because it hasn't been taken yet: this is an extrapolation to a future dataset. And the dotted lines are some hypothetical signal.
  The area occupied by each color is basically meaningless, though, because of the logarithmic y-scale. It always looks like there's way more of whatever you put on the bottom. And obviously you can grow it without bound: if you move the lower y-limit to 1e-20 you'll have the whole plot dominated by whatever is on the bottom.
  For the record I think it's a terrible convention, it just somehow became standard in some fields.

ashish01 2 days ago

I wrote one a while back https://github.com/ashish01/hn-data-dumps and it was a lot of fun. One thing which will be cool to implement is that more recent items will update more over time making any recent downloaded items more stale than older ones.

jasonthorsness 2 days ago

Yeah I’m really happy HN offers an API like this instead of locking things down like a bunch of other sites…

I used a function based on the age for staleness, it considers things stale after a minute or two initially and immutable after about two weeks old.

    // DefaultStaleIf marks stale at 60 seconds after creation, then frequently for the first few days after an item is
    // created, then quickly tapers after the first week to never again mark stale items more than a few weeks old.

    const DefaultStaleIf = "(:now-refreshed)>" +
 "(60.0*(log2(max(0.0,((:now-Time)/60.0))+1.0)+pow(((:now-Time)/(24.0*60.0*60.0)),3)))"

https://github.com/jasonthorsness/unlurker/blob/main/hn/core...

wslh 2 days ago

It would be great if it is available as a torrent. There also mutable torrents [1]. Not implemented everywhere but there are available ones [2].

[1] https://www.bittorrent.org/beps/bep_0046.html

[2] https://www.npmjs.com/package/bittorrent-dht

Am4TIfIsER0ppos 2 days ago

I hope they snatched my flagged comments. I would be pleased to have helped make the AI into an asshole. Here's hoping for another Tay AI.

9rx 2 days ago

> The Rise Of Rust

Shouldn't that be The Fall Of Rust? According to this, it saw the most attention during the years before it was created!

emilbratt 2 days ago

The chart is a stacked one, so we are looking at the height each category takes up and not the height each category reach.
- vaylian a day ago
  
  What should the label of the y-axis be?

hsbauauvhabzb 2 days ago

Is the raw dataset available anywhere? I really don’t like the HN search function, and grepping through the data would be handy.

Havoc 2 days ago

It’s on firebase/bigquery to avoid people doing what OP did
If you click the api link bottom of page it’ll explain.
- jasonthorsness 2 days ago
  
  I used the API! It only takes a few hours to download your own copy with the tool I used https://github.com/jasonthorsness/unlurker
  I had to CTRL-C and resume a few times when it stalled; it might be a bug in my tool
  
  xnx 2 days ago
  
  Is there any advantage to making all these requests instead of using Clickhouse o BigQuery?
  
  jasonthorsness 2 days ago
  
  Probably not :P. I made the client for another project, https://hn.unlurker.com, and then just jumped straight to using it to download the whole thing instead of searching for an already available full data set.
  
  Havoc 2 days ago
  
  My mistake - apologies. Had misunderstood what you did

byearthithatius 2 days ago

Can you scrape all of HN by just incrementing item?id (since its sequential) and using Python web requests with IP rotation (in case there is rate limiting)?

NVM this approach of going item by item would take 460 days if the average request response time is 1 second (unless heavily parallelized, for instance 500 instances _could_ do it in a day but thats 40 million requests either way so would raise alarms).

deadbabe 2 days ago

Is the 20GB JSON file available?

mike503 2 days ago

Other people have asked, probably for the same reason but I would love an offline version, packaged in zim format or something.

For when the apocalypse happens it’ll be enjoyable to read relatively high quality interactions and some of them may include useful post-apoc tidbits!

shayway 2 days ago

Hah, I've been scraping HN over the past couple weeks to do something similar! Only submissions though, not comments. It was after I went to /newest and was faced with roughly 9/10 posts being AI-related. I was curious what the actual percentage of posts on HN were about AI, and also how it compared to other things heavily hyped in the past like Web3 and crypto.

alt227 2 days ago

Here, the entire history of HN with the ability to run queries on it directly in the browser :)
https://play.clickhouse.com/play?user=play#U0VMRUNUICogRlJPT...

matsemann 2 days ago

One thing I'm curious about, but I guess not visible in any way, is random stats about my own user/usage of the site. What's my upvote/downvote ratio? Are there users I constantly upvote/downvote? Who is liking/hating my comments the most? And some I guessed could be scrapable: Which days/times are I the most active (like the github green grid thingy)? How's my activity changed over the years?

pjc50 2 days ago

I don't think you can get the individual vote interactions, and that's probably a good thing. It is irritating that the "API" won't let me get vote counts; I should go back to my Python scraper of the comments page, since that's the only way to get data on post scores.
I've probably written over 50k words on here and was wondering if I could restructure my best comments into a long meta-commentary on what does well here and what I've learned about what the audience likes and dislikes.
(HN does not like jokes, but you can get away with it if you also include an explanation)
minimaxir 2 days ago

The only vote data that is visible via any HN API is the scores on submissions.
Day/Hour activity maps for a given user are relatively trivial to do in a single query, but only public submission/comment data could be used to infer it.
- ryandrake 2 days ago
  
  Too bad! I’ve always sort of wanted to be able to query things like what were my most upvoted and downvoted comments, how often are my comments flagged, and so on.
  
  saagarjha 2 days ago
  
  I did this once by scraping the site (very slowly, to be nice). It’s not that hard since the HTML is pretty consistent.
nottorp 2 days ago

> Are there users I constantly upvote/downvote?
Hmm. Personally I never look at user names when I comment on something. It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
- vidarh 2 days ago
  
  The exception, to me, is if I'm questioning whether the comment was in good faith or not, where the trackrecord of the user on a given topic could go some way to untangle that. It happens rarely here, compared to e.g. Reddit, but sometimes it's mildly useful.
- pjc50 2 days ago
  
  I recognize twenty or so of the most frequent and/or annoying posters.
  The leaderboard https://news.ycombinator.com/leaders absolutely doesn't correlate with posting frequency. Which is probably a good thing. You can't bang out good posts non-stop on every subject.
- matsemann 2 days ago
  
  Same, which is why it would be cool to see. Perhaps there are people I both upvote and downvote?
- thaumasiotes 2 days ago
  
  > It's too easy to go from "i agree/disagree with this piece of info" to "i like/dislike this guy"...
  ...is that supposed to pose some kind of problem? The problem would be in the other direction, surely?
  
  nottorp 2 days ago
  
  Either you got the direction wrong or you'd support someone who is wrong just because you like them.
  You're wrong in both cases :)
  
  thaumasiotes 2 days ago
  
  Maybe try rereading my comment?
  
  nottorp 2 days ago
  
  You're right. But I still disagree with you. Both ways are wrong if you want to maintain a constructive discussion.
  Maybe you don't like my opinions on cogwheel shaving but you will agree with me on quantum frobnicators. But if you first come across about my comments on cogwheel shaving and note the user name, you may not even read the comments on quantum frobnicators later.
xnx 2 days ago

Some of this data is available through the API (and Clickhouse and BigQuery).
I wrote a Puppeteer script to export my own data that isn't public (upvotes, downvotes, etc.)
9rx 2 days ago

> What's my upvote/downvote ratio?
Undefined, presumably. For what reason would there be to take time out of your day to press a pointless button?
It doesn't communicate anything other than that you pressed a button. For someone participating in good faith, that doesn't add any value. But those not participating in good faith, i.e. trolls, it adds incredible value knowing that their trolling is being seen. So it is actually a net negative to the community if you did somehow accidentally press one of those buttons.
For those who seek fidget toys, there are better devices for that.
- immibis 2 days ago
  
  Actually, its most useful purpose is to hide opinions you disagree with - if 3 other people agree with you.
  Like when someone says GUIs are better than CLIs, or C++ is better than Rust, or you don't need microservices, you can just hide that inconvenient truth from the masses.
  
  9rx 2 days ago
  
  So, what you are saying is that if the masses agree that some opinion is disagreeable, they will hide it from themselves? But they already read it to know it was disagreeable, so... What are they hiding it for, exactly? So that they don't have to read it again when they revisit the same comments 10 years later? Does anyone actually go back and reread the comments from 10 years ago?
  
  jpc0 2 days ago
  
  It’s not so much rereading the comments but more a matter of it being indication to other users.
  The C++ example for instance above, you are likely to be downvoted for supporting C++ over rust and therefore most people reading through the comments (and LLMs correlating comment “karma” to how liked a comment is) will generally associate Rust > C++, which isn’t a nuanced opinion at all and IMHO is just plain wrong a decent amount if times. They are tools and have their uses.
  So generally it shows the sentiment of the group and humans and conditioned to follow the group.
  
  9rx 2 days ago
  
  An indication of what? It is impossible to know why a user pressed an arrow button. Any meaning the user may have wanted to convey remains their own private information.
  All it can fundamentally serve is to act as an impoverished man's read receipt. And why would you want to give trolls that information? Fishing to find out if anyone is reading what they're posting is their whole game. Do not feed the trolls, as they say.
  
  matsemann 2 days ago
  
  Since there are no rules on down voting, people probably use it for different things. Some to show dissent, some to down vote things they think don't belong only, etc. Which is why it would be interesting to see. Am I overusing it compared to the community? Underusing it?
- saagarjha 2 days ago
  
  If Hacker News had reactions I’d put an eye roll here.
  
  9rx 2 days ago
  
  You could have assigned 'eye roll' to one of the arrow buttons! Nobody else would have been able to infer your intent, but if you are pressing the arrow buttons it is not like you want anyone else to understand your intent anyway.

xnx 2 days ago

I have this data and a bunch of interesting analysis to share. Any suggestions on the best method to share results?

I like Tableau Public, because it allows for interactivity and exploration, but it can't handle this many rows of data.

Is there a good tool for making charts directly from Clickhouse data?

texodus 2 days ago

No Clickhouse connector for free accounts yet, but if you can drop a Parquet file on S3 you can try https://prospective.co
- xnx 2 days ago
  
  Thanks! I'll check that out. Thought it was a typo of "Perspective" for a moment: https://perspective.finos.org/
  
  texodus 2 days ago
  
  Yes! This is the pro version, we also develop open source https://github.com/finos/perspective (which Prospective is substantially built on, with some customizations such as a wasm64 runtime).

dredmorbius a day ago

I've been tempted to look into API-based HN access having scraped the front-page archive about two years ago.

One of the advantages of comments is that there's simply so much more text to work with. For the front page, there is up to 80 characters of context (often deliberately obtuse), as well as metadata (date, story position, votes, site, submitter).

I'd initially embarked on the project to find out what cities were mentioned most often on HN (in front-page titles), though it turned out to be a much more interesting project than I'd anticipated.

(I've somewhat neglected it for a while though I'll occasionally spin it up to check on questions or ideas.)

sebastianmestre 2 days ago

Can you remake the stacked graphs with the variable of interest at the bottom? Its hard to see the percentage of Rust when it's all the way at the top with a lot of noise on the lower layers

Edit: or make a non-stacked version?

jasonthorsness 2 days ago

Lots of valid criticism here of these graphs and the queries; I'll write a follow-up article.

tacker2000 2 days ago

Yea, i also get the feeling that these rust evangelists get more annoying every day ;p

febeling a day ago

You wonder what all the Rust talk was about before the programming language's release in Jan 2012.

0x008 a day ago

like other's have said it likely includes partial matches as well (e.g. 'antitrust' etc)

andrewshadura 2 days ago

Funny nobody's mentioned "correct horse battery staple" in the comments yet…

th1nhng0 2 days ago

Can I ask how you draw the chart in the post?

jasonthorsness 2 days ago

lol it was Excel (save as picture / SVG format / edit colors to support dark/light mode)
- th1nhng0 2 days ago
  
  wow, I never expect that xD thanks for let me know

pier25 2 days ago

would love to see the graph of React, Vue, Angular, and Svelte

Too a day ago

And Nextjs

hellostgeroge 2 days ago

[dead]

a3w 2 days ago

Cool project. Cool graphs.

But any GDPR requests for info and deletion in your inbox, yet?

arduanika 2 days ago

Come on, you wouldn't GDPR a whimsical toy project!