In the past two weeks some news articles about “data breaches” affecting Clubouse, LinkedIn and Facebook have been shared. I’d like to add my two cents on two points that keep coming up and clarify the following.
- Scraping is not a crime
- Scraping is not a data leakage
Scraping is not a crime
The first time I read about scraping was in 2009 when the company “MeinVZ”, known as “StudiVZ” and “SchuelerVZ”, attracted attention in Germany with a data breach (at least referred to as such in the media). In 2009, MeinVZ was the leading social media platform in Germany and, with around 16 million members, had 5 million more members than Facebook.
While the incident was described in the media at the time as a “security breach” and a “data breach”, no verdict was ever reached. The suspect, a 20-year-old, had programmed a script to read out publicly available information such as a photo, name and school attended from MeinVZ. After the 20-year-old was tricked into a trap by the MeinVZ operators in Berlin, he got arrested and took his life at the same day. Apart from the blackmail he was accused of, however, he did not commit any crime. The accusation of “data theft” was hardly tenable in this case, says his defense attorney Ulrich Dost: “It is obvious that data that are published freely accessible to everyone cannot be spied on by a third party.” An acquittal was to be expected.
So you can see how scraping cases then and now can be hyped up by the media. Maybe this is exactly what the blackmailers want, who would try to offer these publicly visible data sets without any apparent added value?
Scraping is not a data leakage
Incorrectly, database dumps like this one from Cloubhouse are often referred to as data leakage in some articles. Generally this is wrong. In most cases it is public data that can be accessed by anyone. The clubhouse database has no value at all. Any script kiddie can create such a database by scraping public available informations.
However, this does not apply to the case of Facebook, where a functionality in an API was abused. Here, a bug in the “Find contacts” API allowed nearly unlimited contact data to be read, including the mobile phone number of Mark Zuckerberg.
It is difficult to defend against such scraping. However, you can make it more difficult for an attacker by:
- Use UUIDs, which prevents simple counting through (iteration).
- Enable rate limiting to reduce the number of possible calls and slow down the scraping process.
- Enforce captchas in case of conspicuous/bot-like behaviour.
Generally, scraping is something desired by most websites because search engines like Google also use web scraping to locate and index websites.
Scraping is a regular tool in the age of the web, which is also used in the field of OSINT (Open Source Intelligence). Google itself is probably the best known. But there are also tools such as Sherlock, which can locate an username on various social media portals (Snapchat, Steam, PlayStation Network, etc.).
How to bypass Anti-Scraping – Using WhatsApp Online Status as an example
Also, this week there were reports of a stalkerware that is used to monitor the WhatsApp status of contacts. Such a tool violates WhatsApp’s terms of service. Over time I have developed some of these scraping tools, including scripts to hitfake YouTube Likes or tools for Facebook online time monitoring. Therefore, I would like to briefly discuss the present case of WhatsApp.
For scripters are captchas and rate limiting barriers that make scraping particularly interesting. Social media platforms can learn a thing or two from online games in particular. Here they are much more advanced against automatic bots and have implemented good anti-scraping techniques.