• CompostMaterial@lemmy.world
    link
    fedilink
    English
    arrow-up
    29
    arrow-down
    5
    ·
    3 days ago

    I’d like to play devil’s advocate for a sec and ask this question, how is a company scraping information from publicly available sources to train AI models any different than companies scraping that same publicly available data and indexing it for search?

    While the search model is helpful to is all, Google isn’t doing it out of the kindness of their hearts, they have a whole business model based on selling advertising utilizing the information they have freely indexed. Yet very few complain about search indexers crawling their data like they do AI bots.

    Again, just playing devil’s advocate for the sake of curiosity.

    • baggachipz@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      52
      arrow-down
      1
      ·
      3 days ago

      This is all true, with one key difference: search results (used to) point you to the actual source. LLMs answer you with that information as if they thought of it, with no attribution. So at least search results have a benefit for the source of indexed content.

      • CompostMaterial@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        ·
        3 days ago

        I don’t know about all AI products, but I know that I use the Copilot sidebar built into edge for work and school questions and it always provides citations to the source information. In fact if I ask a question for school and add in the prompt to cite all sources with a reference in APA format, it gives me everything I need in proper format.

        • ℍ𝕂-𝟞𝟝@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          8
          ·
          3 days ago

          Yeah, it’s useful but double check your sources and never hand in anything, even the citations by just copy and pasting it without scrutiny. It can make up all kinds of bullshit, pretend cited works say something when they don’t, etc.

          You don’t want to it to hallucinate you in front of an academic ethics committee. Again, not against using it, but never base anything on stuff it says, only base stuff on primary sources it helped you find.

          • CompostMaterial@lemmy.world
            link
            fedilink
            English
            arrow-up
            6
            ·
            3 days ago

            Fully agree. Honestly, it’s why I like the Copilot branding Microsoft used. It is a Copilot, not the Captain. You still need to be in control and verify and scrutinize.

        • morrowind@lemmy.ml
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          3
          ·
          3 days ago

          That’s not the same. In that case copilot is also doing a search. They’re talking about the model itself

    • Dragonstaff@leminal.space
      link
      fedilink
      English
      arrow-up
      41
      arrow-down
      2
      ·
      3 days ago

      the search model is helpful to is all

      You answered your own question. The search engine indexes your page to send traffic to you. The AI bot indexes your page to plagiarize your content.

      Anecdotally, AI also routinely ignores sites’ robots.txt and spoof their agents to try to hide what they’re doing. A lot of site owners are complaining about the costs of delivering content to web scrapers. Where search indexes might hit a site every day, some AI bots are running every hour and just wasting their bandwidth.

    • CosmicTurtle0@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      20
      ·
      3 days ago

      You likely consented to search crawlers. You didn’t consent to having your site slammed by AI bots to regurgitate your site either privately or publicly.

      • CompostMaterial@lemmy.world
        link
        fedilink
        English
        arrow-up
        7
        arrow-down
        2
        ·
        3 days ago

        If memory serves me correctly, nobody concented to the search indexes originally either, it took time for those guard rails to be put in place and respected. I would imagine that this new tech will undergo the same growing pains as guard rails get implemented.

        • ℍ𝕂-𝟞𝟝@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          17
          ·
          3 days ago

          Yeah but the difference is that search engines act in synergy while AI models usually extract value from the site. One is getting your woodworking shop in the phonebook without consent, the other is taking your lathe out the door.

    • jjjalljs@ttrpg.network
      link
      fedilink
      English
      arrow-up
      2
      ·
      2 days ago

      There are credible allegations that the AI companies are not merely scraping publicly available resources, but are also consuming content in violation of the terms of use / copyright law. Like, a site has a robots.txt file that says “no scrapers” and they scrape it anyway. People would be mad about traditional search doing that as well.

      Secondly, if a search service scrapes your site and then directs relevant users to it, that’s probably fine. Most websites want users to visit. A lot of AI stuff sucks up the content, and then the creators of that content get nothing. No users are sent there. The scraper hitting the site takes resources, and gives nothing back.

      Google has also gotten some flak for putting stuff on their own site instead of sending users to the source. Like you do a search and get a snippet on the google page, and you never click through to example.com/cool-stuff. Well, now the owner of example.com/cool-stuff doesn’t get the click. If they run ads, they get no credit. If they have metrics, they probably don’t see any visitors. If they have like forums, people are less likely to engage.

      If the “AI Search” includes links back to the source, that’s not perfect either. One, it’s kind of excessive to use an LLM to parse text when the origin site is already there and readable. If I search for “population of london”, you can just send me to a census website or even wikipedia. You don’t need to use a whole ass LLM. Two, as I touched on in the previous paragraph, users are less likely to click through if google is putting the core of the information right there (even if it’s not always accurate). It’s still lessening traffic to the origin site, and traffic is often the lifeblood of websites.

      Lastly, a lot of AI stuff is simply inaccurate or misleading. We’ve all laughed at the “use glue on your pizza” stuff or the “there are two Rs in ‘strawberry’” fuckups. If traditional search was really bad, like you type in “cat food” and you got a webpage that was all jewelry and “buy gold” scams, you’d be annoyed, too. That’s more like how search was before old google came about. There were a lot more low effort “SEO” hacks like putting a bunch of keywords in tiny print to fool the search indexer. Now google is the shitty old guard, but they have too much money and power to be easily replaced.

      That’s just off the top of my head. Scraping for AI isn’t the same as scraping to make a searchable index.

    • not_IO@lemmy.blahaj.zoneOP
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      2
      ·
      3 days ago

      i find people who “play devils advocate” just unnecessarily exhausting to the cause. If you have a opposing opinion just say it, if not then don’t. This is real life and not debate club

      • CompostMaterial@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 days ago

        Well, I was in a debate club so I suppose that is where it comes from.

        Also, saying I have an opposing opinion is fine, if that is my actual stance. I this case it’s more of an I can see both sides of the argument and would like to have a rounded discussion rather than a reddit echo chamber.

      • Bruz@aussie.zone
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        2 days ago

        I have to disagree. I often form opinions gradually over time as i learn about the issue and playing devils advocate can help that process. If less people planted themselves in certain yay or nay camps our conversations would be far more honest and productive. Devils advocate arguments can sometime be like thought experiments to help us learn about and understand an issue.