Imagine an Internet without proxy servers. It would imply that all traffic originates from its own IP address. That also means that when bots in a data center load advertisements, click on advertisements, watch videos, or listen to podcasts you’ll be able to see them simply by looking at the usage per IP address. Normal humans don't load 1000s of ads per minute, or watch hundreds of video streams concurrently. Without proxies excessive usage numbers using automation would be visible.
This means that proxy servers enable bots, and other types of fraud to split their traffic over many different proxies. For example, if you run multiple browsers then each browser connects to a different proxy. At the receiving end you’ll see the incoming requests from many different IP addresses.
Legitimate usage?
Proxy servers do have a legitimate usage. For example, you run different digital campaigns per state and you want to verify the correct ads are served per state. Second example is to verify that advertisements don't serve malware to residential IP addresses only. Third example would be to verify that advertisement really advertise what is being sold on the landing page after the click.
The fraudster technique behind this is 'cloaking' where only residential IP addresses and specific user agents arrive at the gambling, counterfeit products, or illegal goods, etc. website. Visitors from data centers, eg. automated scans by Google, Microsoft, etc., are sent to a different website and thus the automated scan thinks that the advertisement are legit, or no malware is served.
It becomes a gray area when scrapers, crawlers and bots using proxies harvest your content in order to train AI models, collect price information, etc. Though, you could disallow these bots by adding their user agents to your robots.txt, but will these bots adhere these rules? If they even declare themselves as bot. And if not what will you do?
The next valid question would be: How do these legitimate cases explain the vast amount of companies offering zillions of proxies? And those are not data center or ISP proxies, but residential proxies in your neighborhood.
So, Where do these residential proxies come from?
A simple Google search query “high quality proxies proxy servers” shows 19 companies offering proxy services, and there are many more. A screenshot of the results can be seen in Figure 1. Detailed information of the first 10 companies are shown in Table 1. The table contains the company name, URL with details of the residential proxies, the number of proxies worldwide, in the US, in the UK, and in Germany. This gives a good idea how big this ecosystem is, the combined total of these 10 companies world wide is over half a billion proxy servers! Let that sink in!
Table 1 shows the number of IP addresses and the bulk of these are residential IP addresses. An important question would be: How are these residential addresses obtained? The companies themselves claim they are obtained... ethically.
So, how do these proxy companies source their IP address and obtain bandwidth from regular people? By looking at the almost infinite numbers it should be fairly easy.
Mobile App creators integrate proxy SDKs in (free) Apps. This enables the developer to earn some extra money [1] [2].
Windows or Mac application developers integrate a proxy SDK in their software. This enables the developer to earn some extra money [1] [2].
Directly sublet your home broadband to companies that use your bandwidth to proxy their client’s web traffic [3] [4].
(free) VPN apps. You’re using a free app and think you are safe. The catch is that the VPN tunnel works in both directions. Your internet connection is used by someone else paying to anonymously access the Internet.
Malware. Legitimate companies will typically not use malware, but if they obtain their proxies from other companies without doing a proper checks this might be the case.
IP range hijacking. Ancient IPv4 addresses assigned 20+ years ago, forgotten by administrators, no ARIN membership [5], no protections like RPKI or ROA [6] makes them vulnerable to IP range hijacking [7]. Just like malware it's not legal, but IP range hijacking does happen and if successful it generates a lot of money.
You can decide for yourself how ethical each method is. The IP range hijacking and malware proxies are typically used in real criminal activities (hacking, data extraction from companies or government) as criminals can be sure that they don't log anything nor leave a trace.
Breaking down making money for bandwidth
If you sublet your own home broadband you'll get compensated for the bandwidth. But how much do you get compared to the money these companies make?
Figure 2 shows that by sharing your internet connection you'll earn $0.20/GByte and Figure 3 shows that as a proxy user you pay from $4.55/GByte when buying 100GByte of traffic (residential proxies). That’s a nice margin! The earnings with cash for bandwidth are capped at $140/month, but the proxy subscribers will have to pay for each and every GByte.
But, if an App integrates an SDK which allows the app to act as a proxy. Does the owner of the device know this? And is this person compensated for the bandwidth? Or is the App developer the only one being compensated (at $0.20 / GByte) ? [8] How ethical is that?
Again, you can decide for yourself how ethical this compensation is. The example above is just one out of many. If you want to look at some other examples, just search for: Honeygain, Repocket, Earnapp, Packetstream, Loadteam, and there are many more.
Fraud detection of proxy servers
Proxy servers only forward specific traffic, browser traffic or HTTPs traffic. Technically the setup looks like this: Your browser talks to a proxy, the proxy server talks to the web server, and because of that the web server only sees the IP address of the proxy server. Legitimate proxy servers will add HTTP headers, eg. x-forwarded-for, to inform the web server that they did forward the request. But, not all proxy servers do that. Other network traffic, such as DNS queries, UDP are not forwarded to a proxy server by your browser. They are sent directly. This can be leveraged for fraud detection.
WebRTC (Web Real-Time Communication)
Having a zoom video call from your browser uses WebRTC. In order to communicate your browser tries to directly communicate with the zoom video server. Your browser will bypass the HTTPs proxy server and tries to communicate directly. If the user, or fraudster, doesn’t realize this it can be used to determine the true IP address of the client.
DNS (Domain name system)
Domain name resolving works backwards. For example, if you want to resolve the IP address of blablabla.phonyurl.com you ask your local DNS server: what is the IP address of this domain? If it isn’t cached, it will start the full resolve process. First the .com domain is asked: who is phonyurl.com? Then a second DNS query is made and sent to the DNS server of phonyurl.com in order to know what is the the IP address of blablabla.phonyurl.com. If you are the owner of phonyurl.com you also own the local DNS server. Now, let’s generate unique random subdomain names which don't exist yet and thus cannot be cached. In that case the phonyurl.com DNS server will see a query to resolve some random name and thus knows which IP address did try to resolve that name. Tying back the IP address from the DNS resolve to the IP address of the visitor which downloaded a JavaScript file which included this random DNS name is another way of matching the IP addresses. If they are the same it's great, but if they differ that could be because DNS resolving is not sent through a proxy server.
Difference proxy server and VPN
One way to overcome this is to use VPNs which tunnels all network traffic to its endpoint. This includes all protocols like HTTPs, DNS, UDP, NTP, etc. Figure 4 shows the differences between a proxy server and a VPN and how fraud detection at the receiving end would be able to see the IP address(es).
Are (residential) proxies a problem in ad fraud and lead generation fraud?
This is a perfectly valid question. If ad verifcation companies do blacklist all data center IP address ranges, prebid will be ignored, and no advertisements will be served to these IP address ranges. If bots running in a data center don't get advertisements fraudsters in the ad fraud ecosystem will have to use more expensive residential proxies.
I’ll leave it to the reader to ask their ad verification partner whether they upfront flag data center IP address ranges and prevent prebids and ads being served to bots. If they say: “Yes, we do block them!” Can you just trust them? or should you validate their claim using a few random data center proxy servers yourself.
Once you know data center proxies are blocked you might have peace of mind. But, residential and mobile proxies still exist. Though they are more expensive to use than traffic directly from a data center, will ad fraud will still be profitable? If you have to pay $4.55 per GByte, the question is: How many advertisements fit in a GByte? That, of course, differs per ad type. A video ad consumes more bandwidth than a display ad. As you can see, and calculate, these proxies are too expensive for impression ad fraud, especially video ads. Simply calculate how much a site earns per 1000 ads, and calculate the data usage of these ads. Spoiler: Renting residential proxies eats away most, if not all, of the profit made by impression fraud.
In lead generation the volume and thus data usage is smaller and the profits per generated lead are way higher: Dollars instead of fractional pennies. That's why residential proxies in lead generation are common. A second reason is that they when filling out a form the contact address and area code of the phone number has to match the geolocation of the IP address. Proxy providers do programmatically offer you to connect to specific proxies in countries, states, ZIP code, cities or ASN (Autonomous System Numbers) / ISPs. This enables fraudsters to quickly switch to the desired location matching the PII data to be filled out in the contact form. And you already know this, the PII data is bought on the dark web and originates from data breaches.
As mentioned fraudsters continuously rotate IP addresses to keep under the rate-limiting radar, matching IP address with contact data. That combined with the availability of 50 million IP addresses, it will take a very long long time to exhaust all IP addresses or even blacklist them, and these simple facts should worry you.
So, again in different words: Are proxy servers (and VPNs) the enabler of ad fraud?
Without proxies, protecting against ad fraud and lead generation fraud would be so much simpler. Simply flag an IP address when fraud is detected, and use rate limiting per IP address to prevent excessive usage. Only these simple filters will solve 80% of the problems. But, reality is different: Proxies (and VPNs) are part of the Internet. And, thus yes, they are the enabler of many sorts of fraud, downloading TBytes of data from data breaches, web scraping, spamming, buying Taylor Swift tickets, purchase all limited sneakers or playstations at the moment of release, etc.
Can they be detected reliably? It depends. The companies providing these proxy services do improve and because of that it gets harder to detect inconsistencies. But, luckily, many fraudsters are not that technically skilled and make mistakes. The professionals however are a pain and know how to configure their bots to avoid detection at almost all levels. Luckily, in lead generation humans have to interact with the contact form and by looking at the human interactional behavior Oxford Biochronometrics is able to determine whether someone interacts with the contact form, or something interacts with the contact form. A bot replaying a pre-recorded session, or a bot moving the mouse using programmatic lines (eg. b-spline or bezier curves), etc.
Human operated fraud using browsers and residential proxies or VPNs is another level. Luckily their behavior is different from normal humans. Just like an experienced shop owner is able to spot a thief by looking at human behavior, this works the same in the digital world.
Want to know more? Questions? Corrections? Suggestions? Feel free to connect, comment or DM
#proxies #adfraud #leadgen #CMO #residentialproxies
[1] https://bright-sdk.com/
[2] https://infatica.io/sdk-monetization/
[3] https://www.getpaidto.com/quick-points/bandwidth/
[4] https://pawns.app/internet-sharing/
[5] https://www.arin.net/
[6] https://www.arin.net/resources/manage/rpki/roa_request/
[7] https://ipv4.global/blog/hijacked-ip-addresses/
[8] https://repocket.com/sdk