Tactical, Operational & Strategic Analysis of Markets, Competitors & Industries

How Can We Insure the Accuracy of Data Mining – While Anonymizing the Data? - by Lance Winslow

Okay so, the topic of this question is meaningful and was recently asked in a government publication on Internet Privacy, Smart Phone Personal Data, and Social Online Network Security Features. And indeed, it is a good question, in that we need the bulk raw data for many things such as; planning for IT backbone infrastructure, allotting communication frequencies, tracking flu pandemics, chasing cancer clusters, and for national security, etc, on-and-on, this data is very important.

Still, the question remains; "How Can We Insure the Accuracy of Data Mining – While Anonymizing the Data?" Well, if you don't collect any data in the first place, you know what you've collected is accurate right? No data collected = No errors! But, that's not exactly what everyone has in mind of course. Now then if you don't have sources for the data points, and if all the data is a anonymized in advance, due to the use of screen names in social networks, then none of the accuracy of any of the data can be taken as truthful.

Okay, but that doesn't mean some of the data isn't correct right? And if you know the percentage of data you cannot trust, you can get better results. How about an example, during the campaign of Barak Obama there were numerous polls in the media, of course, many of the online polls showed a larger percentage, land-slide-like, which never materialized in the actual election; why? Simple, there were folks gaming the system, and because the online crowd, younger group participating was in greater abundance.

Back to the topic; perhaps what's needed is for someone less qualified as a trusted source with their information could be sidelined and identified as a question mark and within or adding to the margin of error. And, if it appears to be fake, a number next to that piece of data, and that identification can then be deleted, when doing the data mining.

Although, perhaps a subsystem could allow for tracing and tracking, but only if it was at the national security level, which could take the information all the way down to the individual ISP and actual user identification. And if data was found to be false, it could merely be red flagged, as unreliable.

The reality is you can't trust sources online, or any of the information that you see online, just like you cannot trust word-for-word the information in the newspapers, or the fact that 95% of all intelligence gathered is junk, the trick is to sift through and find the 5% that is reality based, and realize that even the misinformation, often has clues.

Thus, if the questionable data is flagged prior to anonymizing the data, then you can increase your margin for error without ever having the actual identification of any one-piece of data in the whole bulk of the database or data mine. Margins for error are often cut short, to purport better accuracy, usually to the detriment of the information or the conclusions, solutions, or decisions made from that data.

And then there is the fudge factor, when you are collecting data to prove yourself right? Okay, let's talk about that shall we? You really can't trust data as unbiased if the dissemination, collection, processing, and accounting was done by a human being. Likewise, we also know we cannot trust government data, or projections.

Consider if you will the problems with trusting the OMB numbers and economic data on the financial bill, or the cost of the ObamaCare healthcare bill. Also other economic data has been known to be false, and even the bank stress tests in China, the EU, and the United States is questionable. For instance consumer and investor confidence is very important therefore false data is often put out, or real data is manipulated before it's put on the public. Hey, I am not an anti-government guy, and I realize we need the bureaucracy for some things, but I am wise enough to realize that humans run the government, and there is a lot of power involved, humans like to retain and get more of that power. We can expect that.

And we can expect that folks purporting information under fake screen names, pen names to also be less-than-trustworthy, that's all I am saying here. Look, it's not just the government, corporations do it too as they attempt to put a good spin on their quarterly earnings, balance sheet, move assets around, or give forward looking projections.

Even when we look at the data from the FED's Beige Sheet we could say that most all of that is hearsay, because generally the FED Governors of the various districts do not indicate exactly which of their clients, customers, or friends in industry gave them which pieces of information. Thus we don't know what we can trust, and we thus must assume we can't trust any of it, unless we can identify the source prior to its inclusion in the research, report, or mined data query.

This is nothing new, it's the same for all information, whether we read it in the newspaper or our intelligence industry learns of new details. Check sources and if we don't check the sources in advance, the correct thing to do is to increase the probability that the information is indeed incorrect, and/or the margin for error at some point ends up going hyperbolic on you, thus, you need to throw the whole thing out, but then I ask why collect it in the first place.

Ah hell, this is all just philosophy on the accuracy of data mining. Grab yourself a cup of coffee, think about it and email your comments and questions.

By Lance Winslow

0 members like this

Comment

You need to be a member of Competitive Intelligence to add comments!

Join Competitive Intelligence

Comment by Eric Garland on September 5, 2010 at 11:11am: Lance, thanks for this, and hoping to read more great stuff from you in this space.

You are right on time and topic with this issue. As professional gatherers of data (and hopefully professional analysts of same) it is impossible to ignore the creation of an unprecedented treasure trove of information waiting for insight to be mined from its unplumbed depths. And yet, at the same time, the non-anonymous nature of this information raises massive questions about the future of freedom, privacy and even democracy. And I daresay that such questions are far, far more philosophical than the average ethical debates that normally impact intelligence. This is HUGE and nobody has ever had to face issues of this magnitude. The history of market research is such that we could derive insight about the customer without ever really following each individual person. Cars or clothes or pharmaceuticals might be improved by such research, without serious damage to any one individuals liberty. Now, with the availability of person-specific Facebook data (available for sale, mind you!), FourSquare reports of your physical whereabouts, and increasing self-identification through Twitter and LinkedIn, we'll be able to use the macroscopic information of general trends, but for the first time in history, drill down to what YOU, BOB JENKINS or MARIA GONZALEZ want in terms of food, furniture, or software features. We must consider how powerful this is and be prepared to develop codes of ethics, or we shall collectively suffer the consequences.

Then you dive into questions of government transparency...man...that's a whole 'nother set of timeless questions.

Question to CI Ning members - will the "codes of ethics" to which we adhere need to change in the future?

How Can We Insure the Accuracy of Data Mining – While Anonymizing the Data? - by Lance Winslow

You need to be a member of Competitive Intelligence to add comments!

Free Intel Collab Webinars