What data can be retrieved from CDN access logs?
Last updated: 16 October 2023
Did you know that a lot of personal data is collected without any consent at scale? This data is automatically generated, just by surfing the web and consuming digital media, without the need for cookies or trackers.
How? CDN access logs. We created a blog post that reveals the personal data that is automatically being logged, and stored.
CDN access logs are a very reliable source for analytics, but they also contain a lot of data you may not want to end up with non-EU vendors who won’t or can’t protect this data.
Who owns the data?
Websites don’t ask for consent to log access data with clouds, CDNs and services. Who actually owns the data? Do you? Do your vendors? They can rightfully claim that it’s their service generating the data, so they can freely share, sell, merge, train AIs data about your services, your business, and your audience, and use their data to compete you with your own data.
Digital service usage and user consumption data are valuable assets. Usage analytics can be used to tune services to better accommodate users. The data can also be used to technically tune services to improve the quality. Such data can be collected from various sources to create a more personalised and improved online experience for users.
Content Delivery Networks (CDNs) and clouds play a pivotal role in delivering media efficiently. These infrastructures collect detailed information. For example, the access logs generated by CDNs and clouds (and the services built on top of them) can potentially reveal more about you than you might think.
In this blog post, we’ll dive into what personal data can be retrieved from CDN access logs, including content, IP addresses, location, and fingerprinting due to information like browser version, screen size, plugins, and fonts.
Content data tells a lot about you, you may not want to share
The type of content that you consume is in access logs. Are you binge-watching a sitcom, or are you following political debates? Are you taking digital courses and training in a certain language? Or are you watching medical instruction videos? The content you consume shares deeply private information about your interests, culture, purchasing interests, beliefs and sometimes even the most intimate personal feelings, desires, and fears.
IP Addresses, The Digital Identifier: who are you?
CDNs act as intermediaries between users and the origin of the content they request. In the process. CDNs log various information about each interaction. One of the most fundamental pieces of data stored is the user’s IP address. This numerical label is akin to a digital identifier, as it can often be used to trace the user’s identity, the user’s approximate location and their Internet Service Provider (ISP).
Location Data: A Demographic Profile: what are you?
Your IP address can be a treasure trove of information. With it, it’s possible to determine the user’s geolocation with reasonable accuracy. While this data might not point directly to your home address, it can provide insights into your country, region, city, and even neighbourhood. By matching your location derived from your IP address with other public and private demographic databases, it is possible to analyse detailed information about your income, home situation, religion and political persuasions.
Browser Fingerprinting: uniqueness in diversity
Beyond the consumed content and IP addresses, CDN access logs capture a wealth of data about your device and browser. This includes details like the browser version, operating system, screen size, browser size, installed plugins… And even the fonts available on the device. This information is collectively known as a “browser fingerprint.”
Each user’s browser fingerprint is unique, much like a human fingerprint. By analyzing these characteristics, companies and websites can track your online behaviour. Even if you’ve taken measures to protect your identity. Ironically, the unique use of ad blockers, tracker blockers and fingerprinting blockers themselves in combination with browser versions, can be used to create a unique fingerprint.
There’s more data that can be combined and trained
Logs contain the date and time of the request, the referrer (the site you are browsing), the technical request status, and custom data including sessions for the service owner to identify you and match your user account to the content being requested. This data too, can be used for profiling. With the rise of AIs, it is becoming increasingly easier to match multiple datasets, process billions of logs per hour, and train machines.
This data is collected without consent
Note that all this data is automatically collected without your consent since consent typically only covers cookies, not logs. And please note that when you watch a simple stream, dozens of service vendors such as CDNs, clouds, video players, analytics, and ad networks can all collect their own logs themselves, using a wide variety of clouds and CDNs.
Concluding, access logs contain data which can be used for good causes. For example, providing content providers with detailed statistics and performance data so that they can improve their services to your benefit.
However, these logs can also contain the most intimate, and private data about you, which can be turned against you out of political or commercial reasons.
Therefore we should not underestimate the importance of protecting this data. Not just by law, but also by best practices by minimising, anonymising data, and protecting data.
Fortunately, in the EU, we have the GDPR law that protects your rights when it comes to this data.
Unfortunately, the USA does not provide the safeguards and rights. And most CDNs and clouds are US-owned. Temporary transatlantic agreements are questionable to cover your rights, and so are Standard Contractual Clauses.
Fortunately, there is an increasing number of EU-owned and EU-operated services, such as Jet-Stream that proactively safeguard your data.