“It will be the greatest theft in the entire history of humanity.” Indie hackers weigh in on big AI companies scraping the web

Big AI is scraping content made by creators like you and me and ingesting it for their own gain.

Google's a good example. It recently updated its privacy policy to say that it’ll use… well… everything on the internet.

Obviously, this practice brings up not only privacy concerns but copyright concerns too; in fact, a class action lawsuit against Google for copyright infringement is already underway.

I caught up with fellow indie hackers to get their thoughts.

Is scraping for AI good or bad?

Courtland Allen of Indie Hackers:

I’m a big fan of AI. It’s fun, it’s interesting, and it gives me superpowers as a developer and business owner.

I don’t care if AI ingests my content. It’s also a moot point. It’s 2023. We’ve seen this battle play out enough times to know that technology will ultimately win. Let’s just accept the inevitable and focus on making the best of it.

Class-action lawsuit statement:

Google does not own the internet, it does not own our creative works, it does not own our expressions of our personhood, pictures of our families and children, or anything else simply because we share it online.

Alexander Isora of Unicorn Platform ($13K/mo):

It will be the greatest theft in the entire history of humanity. But is that good or bad?

Creativity is the process of creating something new based on the existing thing(s). If I get inspired by Rembrandt and Andy Warhol and use the neural network in my brain to create a new style, am I a thief?

What if I teach a neural network in my computer to do this?

What if I make a service so anyone can use my network to create millions of works of art in the new style?

I believe it is neither bad nor good. This is progress. It's pointless to judge it. And there is no way to prevent it. We need to get used to the new reality and find a way to adapt to it.

That means we need to at least be aware of the progress and the possibilities of GPT. This is why I advocate following AI news and joining AI discussions like this one.

Is what Big AI is doing a crime?

Mateusz Buda of Scraping Fish ($5K/mo) and Narf AI:

I hold the view that if the content is public and all legal requirements are met, everyone should be allowed to obtain it, store it, process it, and even sell it in one form or another.

The stuff that Google does is totally legal, as it does only the things that website owners allow them to do and it’s pretty easy to prevent your data from getting scraped at the cost of not being exposed for free and allowed to be found by other people.

Oleg Kulyk of ScrapingAnt ($14k/mo):

Google robots (web scraping spiders) should follow sitemaps and robot.txt files in order to not index web pages that website owners restricted it from. So you should be able to protect even publicly available data from scraping, but I’m not quite sure that Google would follow the same mechanisms for indexing and data scraping for AI learning. Still, I assume that they should.

What does all this mean for copyrights and innovation?

Courtland Allen:

I’m not a big fan of copyright. It puts unnecessary restrictions on ideas, stories, and art, which are things that are naturally better off if they spread.

I don’t buy the argument that copyright is necessary to spur creativity. Lots of people create art without any expectation of profit from licensing, because they love art. And lots of people get paid for their writing without relying on licensing, because they’re good at business.

Expecting a paycheck because ChatGPT or Bard or Claude scraped your writing seems like the worst of both worlds — wanting to do art for the sake of business, but not being good enough at business to figure out how to profit.

I don’t see why it benefits society to have laws that support that kind of behavior.

What does this mean for other scrapers?

Mateusz Buda:

The landscape of web scraping isn't equitable for all players. Large corporations like Google and OpenAI enjoy an advantage compared to individual indie hackers. Legal challenges associated with web scraping disproportionately affect these smaller entities, thereby limiting their access to content. Small companies with no marketing budget also cannot afford lawyers when they face cease-and-desist letters."

Oleg Kulyk:

Other web scraping market players are constantly struggling against anti-bot systems, as they don’t have Google’s access.

Google has a great benefit over all other companies that scrape publicly available data, as their indexing robots are not banned by anti-bot systems like Cloudflare, Akamai, etc., so they can get the needed data pretty easily.

From the other side, all other companies that extract publicly available data cannot follow robots.txt and just run the spiders to follow all the website links and extract the publicly available data. So I’d not say that it should be considered a crime, but it’s not the best way to stay on the ethical side.

What does the future hold?

Mateusz Buda:

All things considered, as long as Google doesn't use its position to access data not available to the public, I don't think it is theft. However, the future will necessitate regulations around the use of publicly available data, similar to the GDPR or CCPA that protect personal data. Ideally, authors would retain the rights to their written content, and explicit consent would be required for its usage. Hopefully, this won't result in even more cluttered websites with additional consent banners that end users have to navigate.

Importantly, tech giants like Google or OpenAI should not receive preferential treatment. All entities, regardless of size, should operate under the same rules concerning data access and usage.

Keep moving forward

So there you have it. AI, and the scraping that powers it, is problematic. But ultimately, these indie hackers are on board. What other option do they have?

We’re in a new frontier and we’re figuring it out as we go. There’s no going back, so all we can do is stay informed, stay ethical, and adapt to this new environment.

At least that’s what these founders said.

Subscribe for more founder interviews, roundtables, case studies, and tips from people who are in the thick of it. 🪤

James Fleischmann

posted to

The Boot's Trap 🪤

on July 18, 2023

Say something nice to IndieJames…

Post Comment

4

Big thanks to @alexanderisora, @csallen, @kami4ka, and @mateuszbuda for participating and lending their expertise! I really appreciate it.

IndieJames

·
a year ago
·
Reply
1. 2
  
  thanks a lot for having me, James! 🙏
  
  alexanderisora
  
  ·
  a year ago
  ·
  Reply
2

This is the evolution of the internet but I do believe that long term because people have less incentive to publish content freely they will start putting up complex block mechanisms in order to prevent scrapping.

A pay-wall sounds good but in the long term it won't work. Google will just pay your $15/mo subscriber and scrape the blogs you've spent thousands of hours working on so it's gonna become an arms race between creators or content and scrappers trying to one-up the other.

There's always someone making a gun and someone else making a body-vest. Same cycle will exist with AI

FrenchMajesty

·
a year ago
·
Reply
1. 2
  
  i agree paywall is just a temporary solution.
  
  content creators have to offer more than just texts. make a community, have a strong personal brand, create videos/podcasts, build side-projects. thus you will be always a step ahead.
  
  tech is progressing fast. if you want to do one thing all your life you'd better go farming. the work on the internet requires even more mental flexibility now (which is great for preventing Alzheimer's disease thou!).
  
  alexanderisora
  
  ·
  a year ago
  ·
  Reply
2. 1
  
  Good points! But if you're right that Google would be up for paying creators to scrape their content, I'd say that solves the problem right there.
  
  IndieJames
  
  ·
  a year ago
  ·
  Reply
2

I think it ultimately comes down to this quote:

I believe it is neither bad nor good. This is progress.

zerotousers

·
a year ago
·
Reply
1

The potential scale of data scraping by big AI companies is undeniably concerning. Safeguarding user privacy and ethical data usage should be a top priority to avoid what could be deemed as the "greatest theft in the entire history of humanity." Transparency and responsible practices are crucial to ensure a fair and sustainable AI ecosystem. #DataPrivacy #EthicsInAI 🛡️🌐

jhonstan

·
a year ago
·
Reply
1

well AI is going to lead to a more world with more centralised power and resources because the AI keeps on getting better and only a few people own it so i think its not good tbh

brucelee

·
a year ago
·
Reply
1

Did you got authorized written permissions to use the content of other people? I truly dont think so!!!!

Iwintradescom

·
a year ago
·
Reply
1

In my opinion expecting Google will not use its position to access data not available to the public is a big mistake.

bhaveshu1

·
a year ago
·
Reply
1
I think we just need to get on with it. I have two views on this:
1. If I was to become an author, what I write would be influenced by all the books I've read in my life. AI is no different - it's just read more books (so to speak)
2. Technology / progress has come along and changed society so many times. From the wheel and fire, to robots on car production lines. This is just the "robots on car production lines" of this generation.
Primer

·
a year ago
·
Reply
1. 1
  
  Well said
  
  IndieJames
  
  ·
  a year ago
  ·
  Reply
1

cool.

frank25184

·
a year ago
·
Reply
1

AI is fun, but it is not to misuse. Some information it provides are not accurate. But it makes the work easier.

stuartames745

·
a year ago
·
Reply
1

This issue raises significant concerns about accountability and responsibility in the use of AI-generated content. While the potential for more accessible data is promising, the risk of monopolization by large entities is real. Indie Hackers could indeed find opportunities in navigating this landscape, but it's crucial for society to establish clear guidelines and ethical frameworks to ensure fair usage and prevent abuse of AI-generated content. Balancing innovation and responsibility will be key in shaping the future of AI for everyone's benefit.

SamraKhan51

·
a year ago
·
Reply
1

I've been thinking of this a lot not just in the context of building the models themselves, but the downstream users of these models.

If I generate some data or text using an LLM and it happens to produce something much like what has been written before, who can be held liable for such a case? I believe the points of Mateusz remains true, it is large entities who can fight lawsuits who will benefit from this the most.

Perhaps the best outcome of this could be that data on the web becomes more free and open for all to use and access. The worst case scenario would be a greater and greater number of paywalls that silo the web. Both scenarios would yield many opportunities for Indie Hackers.

blakedeckard

·
a year ago
·
Reply
1. 1
  
  Interesting, what opportunities are you seeing indie hackers??
  
  IndieJames
  
  ·
  a year ago
  ·
  Reply
  1. 1
    
    I think if data becomes heavily silo'd then the potential for niche sites grows as it is more difficult for data to flow. Segmentation should create more room for smaller players.
    
    If data becomes more free, then it should allow for indie hackers to have more access to data to make products which may have previously been constrained by the inability to have sufficient data or access to data. This is especially true if the most powerful LLMs and similar AI models are accessible to Indie Hackers through APIs, it is one more tool to be leveraged.
    
    blakedeckard
    
    ·
    a year ago
    ·
    Reply
1

This is such an interesting read! Honestly, thank youuu

BolaxBanks

·
a year ago
·
Reply
1. 1
  
  🙏
  
  IndieJames
  
  ·
  a year ago
  ·
  Reply