Fingerprinting, finger pointing or fingers crossed?
In digital marketing the end of 3rd party cookies has led to digital fingerprinting devices of visitors as the de facto standard to track individuals across the Internet. Fingerprinting seems to be an innocent word, implying that you need to touch or click something in order to leave a fingerprint. Unfortunately, this isn’t that case. Your digital fingerprint is being taken automatically, silently, without notice, or consent. Once the digital fingerprint has been taken it becomes finger pointing: We track YOU, and YOU, Hey It’s YOU again. YOU are being tracked across the Internet. How bad is the situation? The blog ‘web fingerprinting is worse than I though’ [1] has some great examples and screenshots of different browsers with different settings about the current state of web fingerprinting.
What is a fingerprint? and how do marketeers obtain one? In order to create such a fingerprint JavaScript code will be executed on the visitor’s device once they visit a webpage with an advertisement, or webpages having tracking pixels, eg. the meta pixel. The JavaScript code will calculate the fingerprint based on the collected data and subsequently it will be conveyed to some central platform collecting the information. The generated fingerprint is as unique as possible but retaining a stable output, which means that over time the fingerprint keeps the same. This enables 3rd parties using that central platform to see the same fingerprint over and over again inferring that it is the same device and thus the same visitor. Based on the URLs of the web pages, search terms, referrer URLs, etc. which are associated to the fingerprint you will fit a profile and thus get certain advertisements.
In marketing the fingerprints are used to point a finger at individual users: It is YOU that we recognize! And with a large audience you’ll need a lot of fingers to point. This completely differs in fraud detection, where typically only 2 fingers are needed.
In fraud detection fingerprinting is used to classify a device as: human or fraud; and a tiny tiny group outliers requiring manual inspection. The underlying technique might be similar, the goal is completely different: It is not about tracking people! Btw. I have seen some implementations made by fraud detection companies which use device tracking oriented fingerprinting instead of bot/ human classification fingerprinting. The irony is that the founders of these companies have their roots in marketing and not in cyber-security, which might explain certain choices. From a fraud detection’s perspective, device fingerprinting is about distinguishing between regular human operated browsers and fully automated browsers (bots) and fraudsters using special browsers. It doesn’t have to be purely based on JavaScript to determine the total picture, as a device’s fingerprint is just one of the many. When looking at the total picture much more of the tech stack can be used for fingerprinting.
The stack in the center of figure 3 below shows the tech stack as a layered model when a browser or app with webview is running and displaying a webpage. When the browser communicates with the Internet it will have to go through the lower layers, ie. Operating System (OS), Network Stack, TLS, etc. These layers all have their own implementations in each browser and OS, and thus fingerprint. On the left and upper right APIs and user configurable preferences are shown which can be extracted by JavaScript (and thus spoofed) from the browser at runtime. At the lower right the encryption and TCP/IP network packet implementations are shown, which again differ per browser and per OS.
Scaling is difficult, as always
Running a single bot successfully isn’t that difficult. Once you’ve patched your browser against the most common detection methods you’ll be fine as long as you keep it under the radar. But, naively scaling your single bot to a botnet with a lots and lots of the same bots causes the same fingerprint being calculated and thus appear over and over again. That’s highly suspicious and puts you on the detection radar. That’s why bot developers override default screen and navigator values in the browser with values randomly selected from a database, preferably based on data collected from real devices. The calculated fingerprints will be based on these values.
First degree questions
With this simple technique they have a created a botnet with a variety of fingerprints, ie. devices with a variety of memory, CPU cores, screen sizes, color depths, user agent, user agent data, languages, architecture, codecs, etc. But, these values only have value in marketing fingerprints, and don’t have any value in fraud detection! Just like an IP addresses any value which can be spoofed easily has no real value on its own. These are considered first degree fingerprint values, extracted directly from the browser. These values can be overridden directly in the browser without leaving a trace. Again, these values have absolutely no value on their own.
Second degree questions
Second degree fingerprint values are based on the execution of code in the browser, a hash of some graphics rendered on a canvas by the GPU (graphics card), a hash of audio rendered by the browser, Math functions returning tiny floating point differences due to different implementations, etc. This is much harder to fake with a simple find and replace spoofed value. With second degree tests the ‘question’ lies within the JavaScript code and the ‘answer’ in the implementation of the function within the browser, GPU, or in case of fraudsters a virtual GPU. To avoid being detected as a botnet using a large army of the same browsers fraudsters have come up with several answers, one of them is: randomization. When the fingerprint code is executed in order to determine the combination of browser and graphics card a little bit of random noise is added when the RGB values of the pixels are read to calculate the fingerprint. This causes the pixel values to change slightly and thus the calculated hash based on these pixel values and thus creating an unique fingerprint.
Of course, in order to catch bots adding canvas noise a small lightweight canvas fingerprint with predictable output is calculated. If the output differs from the expected output noise has been added: Gotcha! Though this can be circumvented as well. That’s why a better approach in fraud detection is to match and validate the calculated fingerprint against the reported graphics card, browser, OS and architecture. If you see millions of visitors and suddenly a group of outliers using the same reported hardware, or the same signature with different types of reported hardware: that’s suspicious! This is where extracted first degree values, ie. the vendor and renderer of the graphics card, in combination with second degree values have added value.
The same applies to the reported user agent. This value on its own doesn’t have any value, as you can spoof it very easily, you can even change your browser into an infamous ‘fartbot’ [2]. But, if you are pretending to be a Chrome/110 and the real installed browser is Chrome/103 it is detected quite easily, because the TLS fingerprint doesn’t match the reported UA, or because the HTTP header order, or HTTP2 structure differs [3], etc. Gotcha!
Of course, one of the answers is to use a TLS termination proxy which will change the TLS fingerprint on demand and thus to match the reported user agent. To show you that this isn’t some theoretical situation you can try this interesting tool to test and look at your TLS is [3] . And if you would like to run some tests you could try to use curl-impersonate [4] which is a modified curl able to change the TLS fingerprint using a command-line parameter.
Lastly, did you ever consider that by simply looking how the TCP/IP packets are formatted [5] you could determine the OS? Different OSes have different implementations of TCP/IP and thus these packets slightly differ. Simplest example is by looking at the TTL (time to live) field value. The default TTL for Windows is 128, and 64 for Linux. So, if a Windows Chrome/110 browser knocks on your web server’s door and by looking at the network packets it is clearly a Linux: Gotcha!
Now the question arises: With so many different ways of detecting real versus reported, how do bot developers keep up to date? That’s quite easy. Most anti-bot vendors protecting their client’s websites kick them out softly by showing CAPTCHAs, or hard by returning a HTTP 429 status code (too many requests), or just a HTTP 403 error (forbidden). I'm not talking about legacy ad fraud detection vendors, which have passive solutions reporting after the fact if they are able to detect advanced bots and fraud at all.
Blocking bots provides an immediate feedback loop to the bot developers telling them: “You need to upgrade your stuff!!”. In any other case it’s like “these aren’t the bots we’re looking for.. “
Large popular brands do know this, some recent examples: Ticket sales for Taylor Swift, sneaker releases, etc. For some reason bots are still be able to continue their scheme, which means: they are able to ‘perfectly’ blend in. The word ‘perfectly’ of course depends on the quality of the bot and fraud detection.
The avoid-being-detected-trick lies in to provide the correct answers to the questions asked. Without any patches the bot would reveal its true browser and its nature, ie. a bot with webdriver=true or the existence of $cdc_blablabla in case of Selenium, etc. and the canvas fingerprints will reveal the usage of virtual screens or no screen at all. That’s why these bots patch these indicators by overriding or completely removing them and use randomization to trick the questions, but that’s fingers crossed for the bot devs. They might avoid detection, but as always scaling becomes a problem. Simply, because all these sessions appear in the long tail of unique visitors. They are outliers because of the randomization and thus can be finger pointed at.
So, what’s the solution to perfectly blend in with the human audience? The simplest way is to test and score a bot against public available tests like creepjs [6], fingerprint.com and many others. But, as can be expected any bot or fingerprint test publicly available only covers the basics. The real deal is to know the questions being asked and to prepare the correct answer to perfectly blend in with a large group of humans. But, these questions are deeply hidden in the detection JavaScript(s). This means that JavaScripts, which contain those questions, need to be protected from extracting those questions easily.
The next article will show how JavaScripts can be protected against reverse engineering and what different techniques can be used. All techniques have the same goal: make it hard, harder, or near impossible, to understand what is going on. Again, It’s all about economics: time is money. If it is too hard or too much work to reverse engineer then “this isn’t the website you’re looking for..”, “move along.. ”
Questions? Just leave a comment or DM
Links:
[1] https://www.bitestring.com/posts/2023-03-19-web-fingerprinting-is-worse-than-I-thought.html
[2] https://www.linkedin.com/pulse/you-can-see-better-do-augustine-fou
[4] https://github.com/lwthiker/curl-impersonate
[5] https://en.wikipedia.org/wiki/TCP/IP_stack_fingerprinting