You are here

Anonymized data: You’re doing it wrong

public://webform/writeforus/profile-pictures/richi-2016-480.jpg
Richi Jennings, Industry analyst and editor, RJAssociates

Anonymized data isn’t as anonymous as people think. That’s the conclusion of recently published research—a cross-discipline collaboration between a quidditch-playing computer scientist and a child-prodigy statistician.

But wait, is this really news? Or is it simply a slow news week?

Whichever side you come down on, the results should be a wakeup call for IT/DevSecOps. In this week’s Security Blogwatch, we do not forgive.

Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: MVP in black.

[ Get up to speed on new privacy laws with this Webcast: California’s own GDPR? It’s not alone. Plus: Learn about data-centric protection with TechBeacon's guide. ]

We do not forget

What’s the craic? Karl Bode’s well—'Anonymized' Data Is Even Less Anonymous Than We Thought:

Analysis from students at Harvard University shows that anonymization isn’t the magic bullet companies like to pretend it is. [They] recently built a tool that combs through vast troves of consumer datasets exposed from breaches.

[They] analyzed thousands of datasets from data scandals ranging from the 2015 hack of Experian, to the hacks and breaches [of] services from MyHeritage to porn websites. Despite many of these datasets containing “anonymized” data, the students say that identifying actual users wasn’t all that difficult.

Independently they may not identify you, but collectively they reveal numerous intimate details even your closest friends and family may not know. … The public is dramatically underestimating the impact on privacy and security these leaks, hacks, and breaches have in total.

The problem is compounded by the fact that the United States still doesn’t have even a basic privacy law for the internet era. … We’re left relying on the promises of corporations who’ve repeatedly proven their privacy promises aren’t worth all that much.

Could we have a more sweary summary? Shoshana Wodinsky obliges—'Anonymized' Data Is Meaningless Bull****:

Data that’s “anonymized” [is] a concept we’ve seen to be bull**** time and again.

Big shadowy data brokers, by and large, aren’t going to store anything explicitly personal about you … because there’s no value in it. Even though the ads stalking us around the web might seem to suggest otherwise, marketers give no ****s about your hopes, your dreams, your fears, the gym you go to, or how you sexually identify.

What is an issue is what happens when those “anonymized” data points inevitably bleed out of the marketing ecosystem and someone even more nefarious uses it. … When one data broker springs a leak, it’s bad enough—but when dozens spring leaks over time, someone can piece that data together in a way that’s not only identifiable but chillingly accurate.

That’s why the “anonymized data” defense from marketers and data brokers is so ****ed.

Sauce? Adam Zewe channels Dasha Metropolitansky and Kian Attari—leaks pose greater risks than most people realize:

That’s the conclusion reached by two students at the Harvard John A. Paulson School of Engineering and Applied Sciences, who explored data leaks for their final project in Privacy and Technology (CS 105), taught by [Professor] Jim Waldo. [They] wondered if they could identify an individual across [many] leaks that have occurred, combining stolen personal information from perhaps hundreds of sources.

“The program takes in a list of personally identifiable information … and searches across the leaks for all the credential data it can find for each person,” [Attari] said. … “Once it is leaked, it’s gone. That is something everyone has to realize.”

“What we were able to do is alarming because we can now find vulnerabilities in people’s online presence very quickly,” Metropolitansky said. … “A cyber criminal [can] search for victims who meet a certain set of criteria.”

For example [she] pulled up a list of senior-level politicians, revealing the credit scores, phone numbers, and addresses of three U.S. senators, three U.S. representatives, the mayor of Washington, D.C., and a Cabinet member. … “Linking different leaked data sources can reveal much more than any of us are aware of, or comfortable with, being out in the network,” said Waldo.

But who’s cross-referencing the data today? Shotgun claims to have worked for one:

I worked for a company called Maxpoint [Valassis]. Their whole spiel was combing through data sources to draw a picture of a user in order to target advertising.

I am not being a doomsayer when I tell you that privacy is dead. It is just a simple "water is wet" fact.

The only thing that is going to ever get even a modicum of privacy back is a law that prohibits companies from being able to share data, even if you agreed to it in those multi-page legalese agreements.

So what’s a concerned citizen to do? SarDeliac suggestifies thuswise:

The only real tactic left is to keep as low a profile as possible and try to be more noise than signal.

The recent controversy of Avast/AVG selling customers’ “anonymized” data via its Jumpshot subsidiary has focused minds—including MiguelC’s:

There's a lesson to be learned there: once your costumers jump ship because they're not happy about your product (or your company) and start using something else, it's really hard to bring them back. Best bet should be not to **** off costumers in the first place.

Wait. Pause. Our old friend gweihir says there’s no “there” there:

Anybody that actually looked into this has known [it] for a long time. Anonymization breaks down when you have multiple anonymized data sets with the same people in them or at least significant overlap. This is neither new, nor is it surprising in any way.

But is there a bigger issue? Drew M draws this conclusion: [You’re fired—Ed.]

A bigger issue is our definition of what it means to have anonymized data. There are valuable and socially beneficial uses for anonymous medical data and, frankly, I am hopeful that we will start using more of this sort of information for the public good.

That said, it needs to be truly anonymized. Things like IP addresses, device IDs, etc. that enable us to string data together need to be fully scrubbed from each and every dataset to be truly considered anonymous data.

Meanwhile, Antique Geekmeister adds another worry:

Unfortunately, some of them anonymize it after they've collected it, not during it: I had a fascinating chat with a company several years who only realized when we spoke that activating Splunk for their logs, tapping data for immediate analysis, would defeat all their existing anonymization practices.

The moral of the story?

If you’re storing anonymized data, consider red-teaming; it’s probably not as anonymous as you think.

[ Get on top of access with TechBeacon's guide to identity governance. Plus: Learn how to secure and manage cloud-based Linux resources with Active Directory in this Webinar. ]

And finally

“Minimum viable product that technically qualifies as a movie”

Previously in “And finally”

You have been reading Security Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites … so you don’t have to. Hate mail may be directed to @RiCHi or sbw@richi.uk. Ask your doctor before reading. Your mileage may vary. E&OE.

Image source: Wendelin Jacober (Pexels)

[ Explore TechBeacon's guide to SecOps challenges and opportunities. Plus: Download the 2019 State of Security Operations report. ]