How Does Bot and Human Operated Fraud Work?
In Digital Advertising: Request Based, Browser Based, Human Operated
According statista.com1 the number of bots in 2022 on the Internet was ~50%. This bot traffic can be split in two parts:
Good bots: Search engines, internet archive, malware scans, etc.
Bad bots: Scraping, scalping, ad fraud, DDoS attacks, etc.
Respectively 17.3% and 30.2 of the total traffic [1] are good and bad bots. The main technical difference between good bots and bad bots is that bad bots try to blend in with human traffic by technically changing their appearance, ie. user agent, use residential proxies, change their TLS fingerprint in order to match the provided user agent, prevent browser automation leaks, etc. Good bots declare themselves, see the table 1 for some examples.
According statista.com2 the number of bots in 2022 on the Internet was ~50%. This bot traffic can be split in two parts:
Good bots: Search engines, internet archive, malware scans, etc.
Bad bots: Scraping, scalping, ad fraud, DDoS attacks, etc.
Respectively 17.3% and 30.2 of the total traffic are good and bad bots. The main technical difference between good bots and bad bots is that bad bots try to blend in with human traffic by technically changing their appearance, ie. user agent, use residential proxies, change their TLS fingerprint in order to match the provided user agent, prevent browser automation leaks, etc. Good bots declare themselves, see the table 1 for some examples.
One valid question would be: What are all these bots doing? Bots are used to monetize something. This can be scraping information, polling the website for the exact sneaker release moment, but also loading ads, clicking ads, etc. There are more flavors, but in general the bot world can technically be split into these two major groups:
Scraping and scalping information, purchase limited goods, tickets
Ad fraud, click fraud, and lead gen fraud
Interestingly, vendors protecting businesses from these two bot classifications can also be split in two groups. The vendors protecting retail websites and ticket sales actively block bots when they see them. Vendors in the advertising world don’t block bots, but passively detect bot traffic.
Passive bot detection has advantages: The biggest advantage is that you don’t provide a direct feedback loop to the bot makers. Bot developers simply don’t know whether their bots are flagged or not, at least not instantly. It also has disadvantages, for example: Having an ecommerce site with millions of articles means that bots will scrape product info round the clock, costing a lot of bandwidth and CPU cycles especially when bots scrape the long tail of products as the information and images are typically not cached. That’s a valid reason why bots are actively blocked, even though the effect is that it forces bot makers to quickly evolve.
In digital marketing active blocking bots would imply that detected bots are not able to load advertisements while the business model of all players in the ecosystem is based on volume. Publishers or websites showing advertisements, ad verification vendors, and all middleman in the ecosystem, make money based on volume. More ads means more impressions and ad verifications which means more money to both the verification companies and the publishers or websites where the ads were shown! Solving the bot problem by actively blocking bots would cost them 20%, 30%, 40% and in some cases over 50%.
Technology wise: How do bots work?
In order to run bots at scale the right technology stack needs to be chosen, the less overhead the better. The cheapest way to run a bot is by generating and firing the HTTPS requests directly, without a browser or App. This is how scrapers were able to scrape price and seat information from airlines, buy limited edition sneakers and PS5s at the release date, concert tickets, and sports event tickets.
In response to scraping anti-bot vendors started to block bots by simply looking at the combination of the user agent, TLS fingerprint and some basic JavaScript challenge-response tests. If the returned payload didn’t match the expected answer to the challenge, or was inconsistent, contained traces of browser automation, etc. the WAF (web application firewall) would simply block access and would return: HTTP status 4033 or 4294 and blacklist the IP address for 10, 20 or 30 minutes.
Checking for bot traffic at each request is expensive. That’s where access tokens, aka. access cookies, appear. Once a browser is approved by the anti-bot’s backend it will receive an access token, which expires after eg. 10 minutes or a max number of requests. Each subsequent request within the 10 minutes providing this access token will be allowed and receive a normal response.
This is exploited by bot makers. They will obtain an access token using a fully fledged browser and once a valid token is obtained, the bot switches to pure HTTP requests and continues until the token expires. The reason is, browsers are slow cost of lot of CPU and memory resources preventing to scale on a single node, etc. Of course the anti-bot answer is to validate whether the client requests all non essential components in a webpage, whether the client’s traversal path through the website is a normal human path, etc.
The stakes are high, that’s why services offering to bypass anti-bot vendors are popular and very profitable. Their services are often offered as an API where you send your request to the API, which forwards the request like a proxy server to the target website or App backend. The API controls a browser or software fully emulating the detection JavaScript, network packets and TLS fingerprint in order to return a perfect and correct payload to the anti-bot vendor, which will respond with an access token allowing free passage to the next several requests.
Knowing that the most difficult part of scalping and scraping is being maintained by the “10 best reverse engineers in the world”, see Figure 1 will give you peace of mind when you are trying to setup an organization to resell #taylorswift tickets, #PS5 consoles or limited #nike sneakers.
Luckily there are many more of these APIs available, so there’s always a backup.
So, what do Ticketmaster, Nike, Sony, but also poker and gambling sites do about this? Because they surely know that these anti-anti-bot services exist and only keep amateur scrapers and hobbyist scalpers away. Their answer is to stack vendors. Some have multiple anti-bot vendors protecting their sites and backend APIs with which their Apps communicate and the bad guys needs to bypass each of the anti-bots. From the brand’s perspective it probably is: The majority vote counts. If 2 out of 3 anti-bot vendors say it’s human, it probably is. Bots have evolved quickly because of the direct feedback loop, ie. blocked access when detected. That enabled bot makers to quickly evolve their bots and now the best ones are winning, and monetizing their hard work.
The number of checks which can be done at the request itself, the network stack, TLS, the browser object models, audio and webgl canvas fingerprinting, etc. is limited5 , while browsers become more restricted and lose entropy due to privacy and anti-tracking. The answer is that anti-bot vendors will randomly change the challenges over time, again it seems that isn’t affecting the services of these API providers.
Anti-bot vendors have tried to make it more difficult by implementing virtual machines in JavaScript. This means that JavaScript code is not executed directly, but is implemented in bytecode and has to be fetched, decoded and executed by JavaScript code. Over time the opcodes, encryption methods and keys, etc. will change making static responses useless. Again, this only raises the bar and will weed out amateurs while the professionals laugh about it. Though, if it becomes too hard to bypass and the profits are high enough low wage workers and farms will be used to buy sneakers and tickets manually. The same farms are used to solve new types of CAPTCHAs which cannot be solved yet by software, eg. sliding puzzles.
Anti-bot vendors VS. Ad fraud detection
Why does this ‘anti-bot vendor versus scrapers and scalpers’ fight look so different when comparing it to the fight in digital advertising ecosystem? It can’t be explained by the amount of dollars involved. The answer can be found in what clients expect from their anti-bot vendor to protect their assets or spend. When #Sony released the #PS5 a large portion of the consoles were being flipped at markup of 200% and that isn’t good for the brand’s reputation. The same with Taylor Swift tickets: A lot of unhappy fans having to pay more than double the orginial ticket price.
In digital advertising a direct feedback loop doesn’t exist. If an advertisement has 1,000,000 impressions, it will have between 2,000 and 5,000 clicks at a click-through-rate of 0.2% to 0.5%. The number of conversions to leads or sales is again between 2% to 8% of the clicks. This means that 1 million impressions will convert to 40 (2% of 2,000) [low estimate] and 400 (8% of 5,000) [high estimate] leads or sales. But, if half of the impressions (500,000) were shown to bots and the other half (500,000) converted to 100 leads or sales the campaign would be considered successful, where it could have been 200 leads or sales.
The difference is that people will start complaining on social media when tickets, playstations or sneakers are sold out instantly and only available at a 200% - 300% markup, while the people working in digital marketing didn’t even realize that they could have doubled the business outcome.
Technically, the differences between ad fraud detection vendors in the digital marketing ecosystem and anti-bot vendors protecting brands against scraping and/or scalping are enormous. The anti-bot vendors protecting websites from scrapers and scalpers do block access to bots and thus are required to evolve quicker and find creative ways of detecting bots in order to keep up with the bot makers. In digital marketing vendors keep on using the same detection for years, while on the offensive side the bot creators use the experience from the scraping and scalping world and thus evolve continuously. Ad fraud detection vendors don’t have a feedback loop telling them what works and what not.
So, How do ad fraud detection vendors without a feedback loop know if their detection works? They don’t and if they detect high rates of fraud it is most likely based on assumptions. That’s why most ad fraud detections in digital marketing are laughable: “These aren’t the bots you’re looking for. Move along, move along”.
In lead generation our clients will contact the prospects and will know whether Oxford BioChronometrics' SecureLead did catch the fraud, or not. If our detection would be based on assumptions it would cost them business due to false positives6, and by not catching the real fraud it would increase litigation risk simply because following up on generated lead and thus calling people without their consent is risky. Some of these callees will start legal action which has to be settled and will cost a ton of money.
Now what?
If you want to be sure your ad fraud vendor detects fraud and fraud only, you should be asking the right questions. Questions like:
On what criteria does your detection flag bots and/or fraud?
Do you have a feedback loop to know what works?
If not, how do you know your detection works accurately?
The past few years have shown many examples that you simply cannot trust any vendor in the digital marketing ecosystem, eg. MFA sites, MFA on publisher’s subdomains, long tail websites sold as premium, audience networks, fake Apps loading ads in the background, free gaming apps where the gamer has to click on an ad to continue, should I go on? An ad fraud detection vendor showing a dashboard with decreasing numbers is like the butcher certifying its own meat. Quarterly reports with low bot percentages smells like willful ignorance.
What value do such charts or claims really have? Not much without the option to see why individual bots and/or fraud were flagged (the discrete decision). When a decreased ad fraud percentage can’t directly be tied to increased business outcomes like sales and increased quality of generated leads, you still don’t know whether the fraud detection fails to flag fraud, or the quality of traffic improved.
I can assure you browser automation and bot detection at scale is discrete and thus a yes / no. Human operated fraud detection has more shades as it is based on interactional behavior and looking at flow. But, luckily, percentage wise human operated fraud is relatively small simply because it is expensive to scale.
Questions? Feel free to comment, connect or DM
#adfraud #leadgeneration #CMO #botdetection
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429
https://www.linkedin.com/posts/kouwenhovensander_frauddetection-fingerprinting-activity-7049009901523656705-xrkX
https://www.linkedin.com/posts/kouwenhovensander_adfraud-b2c-digitalmarketing-activity-7125112905296957441-O4Wp