08 July 2025

State of the Bots: FIPP and TollBit put the spotlight on AI scraping

There are few issues more pressing for media companies than copyright infringement by AI bots. With research showing website scraping has increased at a rapid rate, publishers are scrambling to find out who exactly is extracting the data, what can be done to block them more affectively and whether there is an opportunity to monetise content.

Turning a spotlight on scraping, FIPP recently sat down for a webinar with AI monitoring and monetisation company TollBit to discuss its Q2 State of the Bots report. The study looks at AI scraping trends, offering an evolving view into how AI companies access content and revealing emerging patterns, behaviours and shifts each quarter.

Joining FIPP CEO Alastair Lewis was Josh Stone, VP of Business Development for TollBit, and Cody McCauley, Partnerships Manager at TollBit – a rapidly growing company that has signed up over 2,000 publisher sites, analysing 65 billion website visits and detecting almost half a billion AI bot scrapes.

Also sharing invaluable insights during the webinar were Elsa Esparbé Gener and Álex Poderoso of Spanish publisher RBA, who has firsthand experience with the TollBit platform.

“(Copyright infringement by AI bots) is clearly really important for us as publishers and

media owners right now to understand what is happening with our content, who is scraping, who is viewing, who is using and redistributing our content, either with our permission or, more critically, without our permission, and if they are so doing, then what can we do about it?” said Lewis.

“How can we address that and ultimately, and most importantly from most of our perspectives, how can we set ourselves up to protect that copyright and that IP in the future? And is there an opportunity to monetise that?”

An explosion in scraping

One of the most eye-catching findings from the Q2 State of the Bots report is a huge surge in scraping activity, with AI bot traffic nearly doubling in Q1 of this year. Most notably, there’s been a significant shift towards real-time retrieval bots (RAG).

“RAG bots are now scraping more than training crawlers and the RAG scrapes that we’ve seen per site grew 49% quarter-over-quarter, which was about two and half time the growth rate of training bots,” revealed McCauley.

“The market has evolved, and training is not the primary use case for many of these AI developers. Retrieval is really the name of the game. And as we’re seeing exponential growth here, that’s only likely to increase moving forward as well.”

Another trend picked up by TollBit is bots bypassing protections. “Many publishers today have been relying on their robots.txt files to try and prevent these developers from scraping and accessing their content – but that’s really not working anymore,” added McCauley. “Many of these AI crawlers are ignoring these rules and bypassing and continuing to access and scrape content.”

According to TollBit, the share of AI box scrapes that bypass robots.txt files saw a jump from 3.3% in Q4 of 2024 to about 13% by the end of Q1 in 2025. There were over 26 million disallowed scrapes in March alone of sites that had robots.txt files in place asking developers not to extract content but were scraped anyway.

“With the rise of retrieval, we are seeing this happen more frequently, especially for some of the search and user-based agents that some of these AI developers are

using,” said McCauley. “They have even updated for some of them within their terms that when they are going out and retrieving content on behalf of a user, doing a query, they are not going to listen to robots.txt, they are going to bypass that.”

FIPP WORLD MEDIA CONGRESS

Madrid, Spain, 21-23 October 2025

Balancing AI and editorial integrity will be one of the key topics at this year’s Congress. Join the conversation and shape the future of media.

>> BOOK YOUR PLACE NOW

A one-way street

One thing that has not changed is the fact that referral traffic back from the scraping activity remains incredibly low.

“We are seeing publishers getting scraped in the millions of times a month and getting real people audience back in the hundreds or just single thousands in exchange for that scraping activity,” said Stone.

“We’re really not seeing the referral traffic value driver come back to publishers in exchange for this scraping activity, which is something, in early days, the AI companies we’re talking about a lot – pushing the narrative that, like Google, we will send you traffic back. I think there’s broad acknowledgment, from their side as well as in the data, that we’re seeing that’s just not the case.

“These answer engines are designed to answer the question and give the information natively within that experience and the traffic back is about 96% lower than what you would expect if that person asked the same question in a traditional Google search.”

Faced with a huge spike in scraping activity, an increasing number of publishers are taking more aggressive action. The adoption of TollBit’s bot paywall – which blocks bots from scraping content unless they pay for it and also presents a path to payment – has increased dramatically around 730% from Q4 to Q1. TollBit sent almost 100 million bots to their paywall, up from about 11 million in Q4.

“We think getting better control is the best way to move forward and hopefully push towards an ecosystem where again there is a fair exchange of value for access to your content and creative works,” said Stone.

The paywall is part of three core products offered to publishers by TollBit – the company also offering analytics tools to determine who is scraping content and what is being scraped the most, and monetisation tools that allows publishers to establish terms under which their content can be accessed and the rates for access.

“As AI applications and services proliferate, the fundamental economics that have powered the open web for years are at risk,” added Stone. “We need a fundamental new piece of infrastructure, a new protocol, so that we can power a fair value exchange.”

Views from the frontline

Spanish media giant RBA, which publishes 24 magazines across a broad range of topics, approached TollBit to, first and foremost, provide them with data on who was scraping their content.

“We knew people were entering our house,” said Elsa Esparbé Gener, Head of New Business Development for the group, “but we didn’t know how many times and who was coming in.”

According to RBA Digital Director Álex Poderoso, the publishers became aware of RAG scraping thanks to TollBit. “We found out how the real-time retrieval market was growing, so we needed to somehow shift the actions that we were taking on AI.

“It was important for us to have a tool that gave us analytics – not just who is coming in and how many times, but which content was more interesting for them. We didn’t have the full picture on how many different AI drivers were in the market. We had OpenAI, Perplexity, Google and Microsoft on the radar but there were many others we didn’t know about.

“So, the first conversations with our bosses was maybe that archive content for training modelling is not what these technology companies are looking for. The second thing was that we don’t know exactly what they were looking for. We thought they were interested in respectable brands like National Geographic, but that had to be confirmed with data.”

RBA decided to integrate TollBit software across all of its 17 sites, starting with National Geographic. “The main finding was that we not only have many different third parties coming to our websites, but they have different bots which gives us information on what these platforms are doing with our content,” said Poderoso,

“We saw that the interest for the real-time content was very high, and it was going up.

We also discovered that they are not just interested in brands, but they are coming to the website more or less proportional to the traffic that a website has. Maybe we have brands that are not so well known, but they have lots of traffic – so there are lots of users looking for information related to that website. So, the bots are coming a lot too.”

As expected, RBA discovered that the scraping did not lead to many more users. For every 200,000 requests from an OpenAI bot the publisher only gets 200 users. According to Poderosa, RBA could not get information about who was visiting their sites from Google or Microsoft, with the Tech giants not distinguishing between bots or users doing a search.

Given the data RBA received, the publisher decided to start blocking bots and set up ways to monetise their content. “The first thing that we need to do with this data is understand which content is interesting for those platforms,” added Poderoso. “Understanding how all your content works in terms of AI demand is very important for us because the first step after blocking is to set up a pricing.”

Gener stressed the importance of media companies sharing their experiences when it comes to scraping and valuing their content.

“We have to give value to our IP and start closing doors, because our websites were open,” she said. “As soon as you have data you must protect the access to your websites and give a price to that content. Hopefully we will start closing deals now with all these companies and hopefully we will be changing the licensing model.”

[Related] FIPP x PPA webinar: AI update

A special small brew

While it is still early days when it comes to monetisation, the increase in scraping activity also presents opportunities for niche publishers, according to Josh Stone.

“One of the dynamics we find interesting in this potential future is that the value ascribed to the content is actually going to be much more aligned in theory with the quality and differentiation of the content itself,” he said.

“If you think about what has driven value in the last 10 years, it’s really an ability to arbitrage Google or social to drive traffic and then serve ads next to the content. The content obviously has something to do with that, but it’s not the primary driver necessarily. If you’re really good at arbitraging SEO or social, you can get traffic almost regardless of what the article itself says.

“If you think about how these AI companies are pulling in and valuing content, they have specific information they need, specific holes that they need to fill, and so they’ll go out looking for that information. And we don’t know exactly how they will evaluate or assess quality, but you can imagine a world where highly differentiated specific content in the niche can actually drive a really effective premium precisely because it’s differentiated.

“We had a conversation with one of the large AI companies and they were saying essentially that a big publisher who has a list of the top 10 coffee makers is interesting to us, but that actually may not be as interesting as some guy who has a small blog about coffee makers that he’s been writing for 15 years. And while that may not be a massive traffic site, he may be the authority on coffee makers and so that content is actually really interesting to them.”

A pertinent question in the scraping debate is what incentive there is for AI companies to work with smaller publishers without legislation in place to enforce it. Stone described legislation as “critical”.

“A lot of the legislation focus to date has been around copyright law. We also think there needs to be a conversation around legislation of access rules. One of the pervasive issues in this space is bots that pretend to be real people and so it makes it much harder to control access. We actually think some legislation around that could be even more impactful than legislation around copyright.

“We also need to make progress on making authorised compensated access faster and easier. What needs to happen in the ecosystem is we need to see the transaction costs associated with unauthorised access rise – legislation, more aggressive measures on blocking and legal risk can help with that.

“But we also need to lower the transaction costs associated with authorised access. How do we make it faster, better, easier to just pay for the content than to try and steal it in another way?”

Until all the rules are clear, Gener advised publishers to take stock of their content. “We need to start organising our archives very well,” she said. “A company like RBA has a lot of content to scan and organise after 30 years. As soon as the rules are clear for all of us, then we must run. The ones that have done the homework will be the first to close good deals.”

This and many other topics relevant to current and future of media industry will be further explored at the FIPP World Media Congress, taking place in Madrid, Spain, from 21-23 October 2025.

The FIPP Congress will bring together media professionals from across the globe for three days of insightful discussions, keynote presentations, workshops, and unparalleled networking opportunities. Whether you’re a seasoned industry leader or a rising innovator, the FIPP Congress promises to be an unforgettable gathering that will shape the future of media. Book now with the Summer Special Rate.