<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Datamining for terrorists would be lovely if it worked</title>
	<atom:link href="http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/</link>
	<description>Ben Goldacre&#039;s Bad Science column from the Guardian and more...</description>
	<lastBuildDate>Fri, 10 Feb 2012 11:24:40 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: annsaet</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-34960</link>
		<dc:creator>annsaet</dc:creator>
		<pubDate>Thu, 21 Oct 2010 22:07:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-34960</guid>
		<description>I&#039;m very happy to see this point - the one about the poor predictive value of positive hits using even the most precise algorithm when using it to pick out a small number of &quot;true positive&quot; cases in a large, innocent population - getting more traction. The bigger the haystack, the harder it is to find true needles. See also my article &quot;Nothing to hide, nothing to fear?&quot; in the International Criminal Justice Review (2007). I&#039;m also happy to see people picking up on the point of what we do have to fear if we give up our privacy in a moment of moral panic over terrorism. For instance: While it is hard to find terrorists in the haystack, it is relatively easy to find whistleblowers. Those in power whose misdeeds have been &quot;blown&quot; will know exactly what journalist got the story and approximately when. Not so hard to find out who said journalist was talking to at the time if you have her phone records, eh? And what about tracking your political opposition? This isn&#039;t hypothetical. See http://www.truth-out.org/attention-left-liberal-and-radical-groups-pennsylvania-has-been-monitoring-you63957. After all ... who gets to say what counts as &quot;terrorism&quot; once we give some mighty protector the right to sift through whatever he thinks he needs to see in order to protect us against it? None other than that mighty protector himself, is who! We know they&#039;re already abusing the powers they have (and some they don&#039;t). Why on earth would we voluntarily give them more? And who&#039;s being &quot;naive&quot; if we do?</description>
		<content:encoded><![CDATA[<p>I&#8217;m very happy to see this point &#8211; the one about the poor predictive value of positive hits using even the most precise algorithm when using it to pick out a small number of &#8220;true positive&#8221; cases in a large, innocent population &#8211; getting more traction. The bigger the haystack, the harder it is to find true needles. See also my article &#8220;Nothing to hide, nothing to fear?&#8221; in the International Criminal Justice Review (2007). I&#8217;m also happy to see people picking up on the point of what we do have to fear if we give up our privacy in a moment of moral panic over terrorism. For instance: While it is hard to find terrorists in the haystack, it is relatively easy to find whistleblowers. Those in power whose misdeeds have been &#8220;blown&#8221; will know exactly what journalist got the story and approximately when. Not so hard to find out who said journalist was talking to at the time if you have her phone records, eh? And what about tracking your political opposition? This isn&#8217;t hypothetical. See <a href="http://www.truth-out.org/attention-left-liberal-and-radical-groups-pennsylvania-has-been-monitoring-you63957" rel="nofollow">www.truth-out.org/attention-left-liberal-and-radical-groups-pennsylvania-has-been-monitoring-you63957</a>. After all &#8230; who gets to say what counts as &#8220;terrorism&#8221; once we give some mighty protector the right to sift through whatever he thinks he needs to see in order to protect us against it? None other than that mighty protector himself, is who! We know they&#8217;re already abusing the powers they have (and some they don&#8217;t). Why on earth would we voluntarily give them more? And who&#8217;s being &#8220;naive&#8221; if we do?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robert Carnegie</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25613</link>
		<dc:creator>Robert Carnegie</dc:creator>
		<pubDate>Mon, 16 Mar 2009 16:10:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25613</guid>
		<description>Regarding the IRA incidentally: the newly active again Regal IRA and Continental IRA have been estimated recently at 300 people.  If the figure of 400 members of the previous IRA is correct, then allowing for retirements that must be nearly all of them, with the possible exception of Martin McGuinness.  None of which surprises me.  Every time there is an Irish peace deal, the entire IRA membership disbands and reforms under the guise of being Different Republicans.</description>
		<content:encoded><![CDATA[<p>Regarding the IRA incidentally: the newly active again Regal IRA and Continental IRA have been estimated recently at 300 people.  If the figure of 400 members of the previous IRA is correct, then allowing for retirements that must be nearly all of them, with the possible exception of Martin McGuinness.  None of which surprises me.  Every time there is an Irish peace deal, the entire IRA membership disbands and reforms under the guise of being Different Republicans.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dslick</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25446</link>
		<dc:creator>dslick</dc:creator>
		<pubDate>Fri, 06 Mar 2009 19:46:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25446</guid>
		<description>Several posts have noted that iterative or recursive methods of data mining such as using neural network learning systems might lead to development of sufficiently accurate algorithms for separating signal (terrorists) from noise (non-terrorists) that it would be cost effective to thoroughly investigate all &quot;possible terrorist so identified. Such an approach has in fact been used fruitfully in other areas of science. However, this approach assumes that essential properties of the objects being detected do not change fundamentally over time. I&#039;m not confident that such an approach would work well in a situation where terrorist&#039;s methods for communicating, organizing themselves, recruiting, raising funds, etc actively evolve over time with a primary goal of remaining undetected.

Another concern with data mining is that sophisticated terrorist organizations that are able to learn about important variables in data mining algorithms would not only be able to change their operations to avoid detection, but might actually set up operations designed to massively increase false-positive in order to bog down the system. For example, they might set up an &quot;operation&quot; that is specifically designed to be discovered for purposes of implicating very large numbers of non-terrorists through linkages used by the search algorithm for identifying suspects (e.g., email communications).</description>
		<content:encoded><![CDATA[<p>Several posts have noted that iterative or recursive methods of data mining such as using neural network learning systems might lead to development of sufficiently accurate algorithms for separating signal (terrorists) from noise (non-terrorists) that it would be cost effective to thoroughly investigate all &#8220;possible terrorist so identified. Such an approach has in fact been used fruitfully in other areas of science. However, this approach assumes that essential properties of the objects being detected do not change fundamentally over time. I&#8217;m not confident that such an approach would work well in a situation where terrorist&#8217;s methods for communicating, organizing themselves, recruiting, raising funds, etc actively evolve over time with a primary goal of remaining undetected.</p>
<p>Another concern with data mining is that sophisticated terrorist organizations that are able to learn about important variables in data mining algorithms would not only be able to change their operations to avoid detection, but might actually set up operations designed to massively increase false-positive in order to bog down the system. For example, they might set up an &#8220;operation&#8221; that is specifically designed to be discovered for purposes of implicating very large numbers of non-terrorists through linkages used by the search algorithm for identifying suspects (e.g., email communications).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: misterjohn</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25437</link>
		<dc:creator>misterjohn</dc:creator>
		<pubDate>Thu, 05 Mar 2009 18:34:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25437</guid>
		<description>Just to put in my twopennyworth about the use of tests.

Question

Suppose you were tested in a large-scale screening programme for a disease known to affect one person in a hundred.

The test is 90% accurate.
You test POSITIVE.

What is the probability that you have the disease?


Answer

Imagine testing 1000 people: 

10 have the disease, so at 90% accuracy we get 9 hits, 1 miss 

990 have no disease, but at 90% accuracy we will also get 99 false positives 


Therefore:

You are one of 108 people to get a positive result, but only 9 of them have the disease

P = 9/108 = 1/12

And that&#039;s when we know, firstly, how accurate the test is, and the likelihood of having the disease. Anyone like to suggest prior probabilities for &quot;Being a terrorist&quot;, or the degree of accuracy of these tests, as has already been said.</description>
		<content:encoded><![CDATA[<p>Just to put in my twopennyworth about the use of tests.</p>
<p>Question</p>
<p>Suppose you were tested in a large-scale screening programme for a disease known to affect one person in a hundred.</p>
<p>The test is 90% accurate.<br />
You test POSITIVE.</p>
<p>What is the probability that you have the disease?</p>
<p>Answer</p>
<p>Imagine testing 1000 people: </p>
<p>10 have the disease, so at 90% accuracy we get 9 hits, 1 miss </p>
<p>990 have no disease, but at 90% accuracy we will also get 99 false positives </p>
<p>Therefore:</p>
<p>You are one of 108 people to get a positive result, but only 9 of them have the disease</p>
<p>P = 9/108 = 1/12</p>
<p>And that&#8217;s when we know, firstly, how accurate the test is, and the likelihood of having the disease. Anyone like to suggest prior probabilities for &#8220;Being a terrorist&#8221;, or the degree of accuracy of these tests, as has already been said.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: heng</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25372</link>
		<dc:creator>heng</dc:creator>
		<pubDate>Tue, 03 Mar 2009 17:52:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25372</guid>
		<description>njdowrick @72
Any algorithm ideally uses *all* available information. It is invariably worse to consider individual tests on individual pieces of information and then attempt to combine the results than it is to do tests on all the data at once. The reason for this is that it is much harder to consider the correlations between pieces of data if you process them separately. Of course, if there are no correlations then you don&#039;t lose anything, but you don&#039;t gain anything either.</description>
		<content:encoded><![CDATA[<p>njdowrick @72<br />
Any algorithm ideally uses *all* available information. It is invariably worse to consider individual tests on individual pieces of information and then attempt to combine the results than it is to do tests on all the data at once. The reason for this is that it is much harder to consider the correlations between pieces of data if you process them separately. Of course, if there are no correlations then you don&#8217;t lose anything, but you don&#8217;t gain anything either.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: njdowrick</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25366</link>
		<dc:creator>njdowrick</dc:creator>
		<pubDate>Tue, 03 Mar 2009 17:23:59 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25366</guid>
		<description>69 Queex

Of course genes *can* be correlated, but they don&#039;t have to be. Surely you&#039;d agree that if you are looking for a person who you know lives in Birmingham, regularly eats Cornish Yarg, and collects calculators you&#039;ll get a far smaller set through using all three filters than through using each one individually and ignoring the others. That&#039;s really the only point I&#039;m making.

I am certainly not arguing that the proposed system would work (still less that it would be a good thing). All I&#039;m saying is that Ben&#039;s order-of-magnitude estimates just aren&#039;t convincing to me given the huge range of evidence that could potentially be available. Ben&#039;s general discussion failed to persuade me that the idea was obviously nonsense; of course, I fully realize that this may say as much about my intelligence (or my knowledge of the field) as it does about Ben&#039;s arguments! To convince me I&#039;d need to know in detail what was proposed; rough estimates that might be reasonable in different fields might not apply here.</description>
		<content:encoded><![CDATA[<p>69 Queex</p>
<p>Of course genes *can* be correlated, but they don&#8217;t have to be. Surely you&#8217;d agree that if you are looking for a person who you know lives in Birmingham, regularly eats Cornish Yarg, and collects calculators you&#8217;ll get a far smaller set through using all three filters than through using each one individually and ignoring the others. That&#8217;s really the only point I&#8217;m making.</p>
<p>I am certainly not arguing that the proposed system would work (still less that it would be a good thing). All I&#8217;m saying is that Ben&#8217;s order-of-magnitude estimates just aren&#8217;t convincing to me given the huge range of evidence that could potentially be available. Ben&#8217;s general discussion failed to persuade me that the idea was obviously nonsense; of course, I fully realize that this may say as much about my intelligence (or my knowledge of the field) as it does about Ben&#8217;s arguments! To convince me I&#8217;d need to know in detail what was proposed; rough estimates that might be reasonable in different fields might not apply here.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: memotypic</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25275</link>
		<dc:creator>memotypic</dc:creator>
		<pubDate>Tue, 03 Mar 2009 13:57:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25275</guid>
		<description>A ficus for their anger. Hmm.

Got a gripe? Get a fig!  :)

F*o*cus...</description>
		<content:encoded><![CDATA[<p>A ficus for their anger. Hmm.</p>
<p>Got a gripe? Get a fig!  <img src='http://www.badscience.net/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>F*o*cus&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: memotypic</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25274</link>
		<dc:creator>memotypic</dc:creator>
		<pubDate>Tue, 03 Mar 2009 13:56:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25274</guid>
		<description>This is all very well, but either way the system is a loser (and I&#039;m not sure which outcome is worse).

1. System is pants: Little change except a few innocents get harrassed and probably we lose a few more civil liberties along the way.
2. System works by some miracle and isn&#039;t polluted by savvy people spamming innocents with attack plans and making spurious car journeys etc.

Frankly, god help us if it is (2). If there really were an efficient system it would really become a (~civil) war at that point. Nothing recruits people (who are largely drifting and disappointed with life and looking for a way out and a ficus for their anger) like having a &#039;real&#039; enemy who is goos at fighting you. A good system would make things much worse. Every new prisoner a martyr, so the martyr count soars. Every relative has grounds to protest against a secret assessment system (they could not reveal how it works/weights because that would kill it).

The only comfort is that governments are usually bloody awful at anything like this. They can barely spell &#039;computer&#039; let alone use them effectively (because those purchasing and those using are largely different groups, each of which consists of committees of committees). And they don&#039;t seem to be able to keep hold of them either. 

Basically the &#039;Dream of Putin&#039; (/Cheney+/whoever) could never be a reality. Reality is complicated -- just ask those in a police state who never did quite manage to get everyone (partly because it is tricky, but probably mostly because for every one sibling/lover/child/parent they captured, they made another two dissenters -- hydra-style).

You cant crush jelly in your hands.</description>
		<content:encoded><![CDATA[<p>This is all very well, but either way the system is a loser (and I&#8217;m not sure which outcome is worse).</p>
<p>1. System is pants: Little change except a few innocents get harrassed and probably we lose a few more civil liberties along the way.<br />
2. System works by some miracle and isn&#8217;t polluted by savvy people spamming innocents with attack plans and making spurious car journeys etc.</p>
<p>Frankly, god help us if it is (2). If there really were an efficient system it would really become a (~civil) war at that point. Nothing recruits people (who are largely drifting and disappointed with life and looking for a way out and a ficus for their anger) like having a &#8216;real&#8217; enemy who is goos at fighting you. A good system would make things much worse. Every new prisoner a martyr, so the martyr count soars. Every relative has grounds to protest against a secret assessment system (they could not reveal how it works/weights because that would kill it).</p>
<p>The only comfort is that governments are usually bloody awful at anything like this. They can barely spell &#8216;computer&#8217; let alone use them effectively (because those purchasing and those using are largely different groups, each of which consists of committees of committees). And they don&#8217;t seem to be able to keep hold of them either. </p>
<p>Basically the &#8216;Dream of Putin&#8217; (/Cheney+/whoever) could never be a reality. Reality is complicated &#8212; just ask those in a police state who never did quite manage to get everyone (partly because it is tricky, but probably mostly because for every one sibling/lover/child/parent they captured, they made another two dissenters &#8212; hydra-style).</p>
<p>You cant crush jelly in your hands.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Queex</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25273</link>
		<dc:creator>Queex</dc:creator>
		<pubDate>Tue, 03 Mar 2009 10:57:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25273</guid>
		<description>65 njdowrick:

The trouble is it&#039;s not necessarily the case that those genes are independent. If they&#039;re close to one another on the chromosome, they are in fact quite highly correlated.

I don&#039;t know about other fields in genetics, but in genome-wide association studies a &#039;SNP&#039; with a rarity of 1% is on the cusp of being rejected for being too rare to work with reliably.

But how independent can any of factors in the proposed system be? Someone who buys a book on the history of Islamist terrorism is more likely to use flagged words or phrases in their email, independently of whether or not they have any terrorist leanings themselves. The lack of independence makes a big difference to how much better your test gets.

One point Ben originally made is that the specificity goes down as what you&#039;re looking for becomes rarer. Even if you have a super-effective test, if the proportion of &#039;cases&#039; is super-small, you are still hip-deep in the cacky. Trying to dodge this problem by winnowing down the candidates first doesn&#039;t avoid the problem- it&#039;s just another way of trading away sensitivity for specificity. So to get a workable system in this context (generously allowing 1000 &#039;terrorists&#039; out of 60 million) your test needs to have an accuracy that many precise chemical or physical tests struggle to meet. Expecting to do so with behavioural analysis is foolish.</description>
		<content:encoded><![CDATA[<p>65 njdowrick:</p>
<p>The trouble is it&#8217;s not necessarily the case that those genes are independent. If they&#8217;re close to one another on the chromosome, they are in fact quite highly correlated.</p>
<p>I don&#8217;t know about other fields in genetics, but in genome-wide association studies a &#8216;SNP&#8217; with a rarity of 1% is on the cusp of being rejected for being too rare to work with reliably.</p>
<p>But how independent can any of factors in the proposed system be? Someone who buys a book on the history of Islamist terrorism is more likely to use flagged words or phrases in their email, independently of whether or not they have any terrorist leanings themselves. The lack of independence makes a big difference to how much better your test gets.</p>
<p>One point Ben originally made is that the specificity goes down as what you&#8217;re looking for becomes rarer. Even if you have a super-effective test, if the proportion of &#8216;cases&#8217; is super-small, you are still hip-deep in the cacky. Trying to dodge this problem by winnowing down the candidates first doesn&#8217;t avoid the problem- it&#8217;s just another way of trading away sensitivity for specificity. So to get a workable system in this context (generously allowing 1000 &#8216;terrorists&#8217; out of 60 million) your test needs to have an accuracy that many precise chemical or physical tests struggle to meet. Expecting to do so with behavioural analysis is foolish.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jodyaberdein</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25268</link>
		<dc:creator>jodyaberdein</dc:creator>
		<pubDate>Mon, 02 Mar 2009 20:52:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25268</guid>
		<description>One irritation I have is with the word terrorist.  Often it seems that there is deliberate circumspection as to what the word actually means.  Certainly if you commit an act of violence or threaten it then detection becomes quite easy.  So how to detect those covertly cooking up an act of destruction?  I don&#039;t know.  I am quite interested in how another criterion for screening might apply though: that there has to be a treatment that works and is acceptable once you have your screen positive population. Internment anyone?</description>
		<content:encoded><![CDATA[<p>One irritation I have is with the word terrorist.  Often it seems that there is deliberate circumspection as to what the word actually means.  Certainly if you commit an act of violence or threaten it then detection becomes quite easy.  So how to detect those covertly cooking up an act of destruction?  I don&#8217;t know.  I am quite interested in how another criterion for screening might apply though: that there has to be a treatment that works and is acceptable once you have your screen positive population. Internment anyone?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: richard_p_auckland</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25267</link>
		<dc:creator>richard_p_auckland</dc:creator>
		<pubDate>Mon, 02 Mar 2009 20:45:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25267</guid>
		<description>I&#039;d point something else out regarding the figure of 10,000 &quot;terrorists&quot;.

The IRA were reckoned to have had no more than 400 active members during the Troubles. With this small group, they were able to achieve death and disruption on a pretty regular basis.

The alleged Islamic terrorists operating in the UK have mounted one fatal and two or three non-fatal attacks in the last eight years. I&#039;d deduce from this that they are either a bunch of fantasists with either no ability or no inclination to mount attacks, or that there are very, very few of them.

Data mining for 20 terrorists would be even harder than for 10,000.</description>
		<content:encoded><![CDATA[<p>I&#8217;d point something else out regarding the figure of 10,000 &#8220;terrorists&#8221;.</p>
<p>The IRA were reckoned to have had no more than 400 active members during the Troubles. With this small group, they were able to achieve death and disruption on a pretty regular basis.</p>
<p>The alleged Islamic terrorists operating in the UK have mounted one fatal and two or three non-fatal attacks in the last eight years. I&#8217;d deduce from this that they are either a bunch of fantasists with either no ability or no inclination to mount attacks, or that there are very, very few of them.</p>
<p>Data mining for 20 terrorists would be even harder than for 10,000.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ajw</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25265</link>
		<dc:creator>ajw</dc:creator>
		<pubDate>Mon, 02 Mar 2009 18:47:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25265</guid>
		<description>This report says much the same thing:

Effective Counterterrorism and the Limited Role of Predictive Data Mining
Jeff Jonas and Jim Harper
Policy Analysis No 584
Cato Institute
11 December 2006

Full report: http://www.cato.org/pub_display.php?pub_id=6784

Abstract

The terrorist attacks on September 11, 2001, spurred extraordinary efforts intended to protect America from the newly highlighted scourge of international terrorism. Among the efforts was the consideration and possible use of &quot;data mining&quot; as a way to discover planning and preparation for terrorism. Data mining is the process of searching data for previously unknown patterns and using those patterns to predict future outcomes.

Information about key members of the 9/11 plot was available to the U.S. government prior to the attacks, and the 9/11 terrorists were closely connected to one another in a multitude of ways. The National Commission on Terrorist Attacks upon the United States concluded that, by pursuing the leads available to it at the time, the government might have derailed the plan.

Though data mining has many valuable uses, it is not well suited to the terrorist discovery problem. It would be unfortunate if data mining for terrorism discovery had currency within national security, law enforcement, and technology circles because pursuing this use of data mining would waste taxpayer dollars, needlessly infringe on privacy and civil liberties, and misdirect the valuable time and energy of the men and women in the national security community.

What the 9/11 story most clearly calls for is a sharper focus on the part of our national security agencies—their focus had undoubtedly sharpened by the end of the day on September 11, 2001—along with the ability to efficiently locate, access, and aggregate information about specific suspects.</description>
		<content:encoded><![CDATA[<p>This report says much the same thing:</p>
<p>Effective Counterterrorism and the Limited Role of Predictive Data Mining<br />
Jeff Jonas and Jim Harper<br />
Policy Analysis No 584<br />
Cato Institute<br />
11 December 2006</p>
<p>Full report: <a href="http://www.cato.org/pub_display.php?pub_id=6784" rel="nofollow">www.cato.org/pub_display.php?pub_id=6784</a></p>
<p>Abstract</p>
<p>The terrorist attacks on September 11, 2001, spurred extraordinary efforts intended to protect America from the newly highlighted scourge of international terrorism. Among the efforts was the consideration and possible use of &#8220;data mining&#8221; as a way to discover planning and preparation for terrorism. Data mining is the process of searching data for previously unknown patterns and using those patterns to predict future outcomes.</p>
<p>Information about key members of the 9/11 plot was available to the U.S. government prior to the attacks, and the 9/11 terrorists were closely connected to one another in a multitude of ways. The National Commission on Terrorist Attacks upon the United States concluded that, by pursuing the leads available to it at the time, the government might have derailed the plan.</p>
<p>Though data mining has many valuable uses, it is not well suited to the terrorist discovery problem. It would be unfortunate if data mining for terrorism discovery had currency within national security, law enforcement, and technology circles because pursuing this use of data mining would waste taxpayer dollars, needlessly infringe on privacy and civil liberties, and misdirect the valuable time and energy of the men and women in the national security community.</p>
<p>What the 9/11 story most clearly calls for is a sharper focus on the part of our national security agencies—their focus had undoubtedly sharpened by the end of the day on September 11, 2001—along with the ability to efficiently locate, access, and aggregate information about specific suspects.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: njdowrick</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25263</link>
		<dc:creator>njdowrick</dc:creator>
		<pubDate>Mon, 02 Mar 2009 17:57:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25263</guid>
		<description>Re: #59 (Queex)

I think that I probably do misunderstand something; I&#039;m certainly no expert. Thank you for trying to enlighten me!

The picture I had in mind was something like this. Let&#039;s suppose there&#039;s a (completely accurate) database containing DNA profiles for everyone in the country, including me. Suppose that you know that one of my genes is carried by only 1% of the population. You search the entire database, and find 6*10^5 matches. Although these include me, the result is not much use.

But if I have two other genes, each carried by 1% of the population, and if the probability of finding one of these genes in one person&#039;s genome is independent of whether either of the other two are present, then I can combine the tests by looking only for people who carry all three genes. In this case I&#039;d find about 60 matches, including me, which is not so bad.

That&#039;s all I was saying: if there are several genuinely independent filters that can be applied to the set of people being considered, the specificity of the test becomes much better. Whether this can in practice be applied to the case Ben was writing about, I don&#039;t know, but to me (knowing nothing about the subject) his claim that a specificity of 99.9% is unrealistically high did not seem to be obvious, given the possibility that multiple filters might be applied.

I hope this clarifies what I had in mind when I wrote my previous responses.</description>
		<content:encoded><![CDATA[<p>Re: #59 (Queex)</p>
<p>I think that I probably do misunderstand something; I&#8217;m certainly no expert. Thank you for trying to enlighten me!</p>
<p>The picture I had in mind was something like this. Let&#8217;s suppose there&#8217;s a (completely accurate) database containing DNA profiles for everyone in the country, including me. Suppose that you know that one of my genes is carried by only 1% of the population. You search the entire database, and find 6*10^5 matches. Although these include me, the result is not much use.</p>
<p>But if I have two other genes, each carried by 1% of the population, and if the probability of finding one of these genes in one person&#8217;s genome is independent of whether either of the other two are present, then I can combine the tests by looking only for people who carry all three genes. In this case I&#8217;d find about 60 matches, including me, which is not so bad.</p>
<p>That&#8217;s all I was saying: if there are several genuinely independent filters that can be applied to the set of people being considered, the specificity of the test becomes much better. Whether this can in practice be applied to the case Ben was writing about, I don&#8217;t know, but to me (knowing nothing about the subject) his claim that a specificity of 99.9% is unrealistically high did not seem to be obvious, given the possibility that multiple filters might be applied.</p>
<p>I hope this clarifies what I had in mind when I wrote my previous responses.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Frank</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25261</link>
		<dc:creator>Mark Frank</dc:creator>
		<pubDate>Mon, 02 Mar 2009 17:52:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25261</guid>
		<description>Re #61

&lt;i&gt;The same can be said of all screening, including that for cancer research. But the critical flaw still remains. When screening for a rare phenomenon, you’re going to get several orders of magnitude more false positives that true positives. &lt;/i&gt;

OK - lets get some examples on the table.

1) Suppose the question you are trying to answer is - do we need to raise the national threat level? Then you are not trying to identify a terrorist. You are just asking is at least one unidentified terrorist somewhere in the UK. This changes the maths completely. It is like running a screen which tests for smallpox (if there is one) to answer the question is there at least one case of smallpox in the country.

2) Suppose intelligence information suggests an imminent threat from a group based in Algeria who recently sent operatives to Leeds. You then run a pattern which is restricted to people living near Leeds and recently arrived from Algeria. Three things are going on

(a) the question is more focussed - we are looking for a specific terrorist

(b) the specificity and sensitivity have shot up to very high levels while the base rate has also increased (although less dramatically)

(c) you only have to indentify one of the operatives and then follow them to forestall the operation

I doubt anyone would attempt to quantify the specificity and sensitivity but they could run the test and see how many matches they get. It might be thousands which would be impossibly large. But for such a restricted search it might well be less than a hundred. Of course you have missed out on all sorts of possible positives who are not living in Leeds and recently arrived from Algeria. But that&#039;s not the problem right now. We are responding to this particular intelligence.

As the NAP book says - this is pretty much the kind of thing the intelligence services would do anyway. Are you suggesting it would be cheaper to do it manually?</description>
		<content:encoded><![CDATA[<p>Re #61</p>
<p><i>The same can be said of all screening, including that for cancer research. But the critical flaw still remains. When screening for a rare phenomenon, you’re going to get several orders of magnitude more false positives that true positives. </i></p>
<p>OK &#8211; lets get some examples on the table.</p>
<p>1) Suppose the question you are trying to answer is &#8211; do we need to raise the national threat level? Then you are not trying to identify a terrorist. You are just asking is at least one unidentified terrorist somewhere in the UK. This changes the maths completely. It is like running a screen which tests for smallpox (if there is one) to answer the question is there at least one case of smallpox in the country.</p>
<p>2) Suppose intelligence information suggests an imminent threat from a group based in Algeria who recently sent operatives to Leeds. You then run a pattern which is restricted to people living near Leeds and recently arrived from Algeria. Three things are going on</p>
<p>(a) the question is more focussed &#8211; we are looking for a specific terrorist</p>
<p>(b) the specificity and sensitivity have shot up to very high levels while the base rate has also increased (although less dramatically)</p>
<p>(c) you only have to indentify one of the operatives and then follow them to forestall the operation</p>
<p>I doubt anyone would attempt to quantify the specificity and sensitivity but they could run the test and see how many matches they get. It might be thousands which would be impossibly large. But for such a restricted search it might well be less than a hundred. Of course you have missed out on all sorts of possible positives who are not living in Leeds and recently arrived from Algeria. But that&#8217;s not the problem right now. We are responding to this particular intelligence.</p>
<p>As the NAP book says &#8211; this is pretty much the kind of thing the intelligence services would do anyway. Are you suggesting it would be cheaper to do it manually?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mikewhit</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25260</link>
		<dc:creator>mikewhit</dc:creator>
		<pubDate>Mon, 02 Mar 2009 17:15:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25260</guid>
		<description>It&#039;s surely entirely possible that all this talk by the Govt. is just for show/window-dressing/discouraging not-so-clever would-be villains ... like those ads about &quot;The Database&quot; that knows about you - was that TV Licensing or tax dodging ?

Also the times that &quot;satellite tracking&quot; gets used by the Govt/media leaving behind the belief in the ill-informed that a satellite is actually watching your every move, whereas it&#039;s just a GPS receiver.</description>
		<content:encoded><![CDATA[<p>It&#8217;s surely entirely possible that all this talk by the Govt. is just for show/window-dressing/discouraging not-so-clever would-be villains &#8230; like those ads about &#8220;The Database&#8221; that knows about you &#8211; was that TV Licensing or tax dodging ?</p>
<p>Also the times that &#8220;satellite tracking&#8221; gets used by the Govt/media leaving behind the belief in the ill-informed that a satellite is actually watching your every move, whereas it&#8217;s just a GPS receiver.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: brachyury</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25259</link>
		<dc:creator>brachyury</dc:creator>
		<pubDate>Mon, 02 Mar 2009 15:57:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25259</guid>
		<description>we are going round in circles-- with people pretending that datamining = supervised learning against the whole population using the highest bandwidth lowest utility form of data.

I presume they only keep all the data because they dont know who might become a suspect.

There are perfectly sensible ways in which you might search subsets of subjects with immediate contacts to known suspects-- either using low powered supervised learning-- or using unsupervised (exploratory datamining) to look for patterns amongst the suspects and their immediate contacts.

If you don&#039;t like the idea of surveillance- just say you don&#039;t like it - don&#039;t construct ludicrous straw methodologies to diss.</description>
		<content:encoded><![CDATA[<p>we are going round in circles&#8211; with people pretending that datamining = supervised learning against the whole population using the highest bandwidth lowest utility form of data.</p>
<p>I presume they only keep all the data because they dont know who might become a suspect.</p>
<p>There are perfectly sensible ways in which you might search subsets of subjects with immediate contacts to known suspects&#8211; either using low powered supervised learning&#8211; or using unsupervised (exploratory datamining) to look for patterns amongst the suspects and their immediate contacts.</p>
<p>If you don&#8217;t like the idea of surveillance- just say you don&#8217;t like it &#8211; don&#8217;t construct ludicrous straw methodologies to diss.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: The Biologista</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25258</link>
		<dc:creator>The Biologista</dc:creator>
		<pubDate>Mon, 02 Mar 2009 15:42:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25258</guid>
		<description>Mark Frank: &quot;Of course it is true that you can’t assess the sensitivity and specificity of some kind of test for terrorists, especially as they are going vary wildly from one situation to another. But that’s not my reason for suggesting the cancer screening model is not appropriate. The most important reasons are that the security forces may be trying to answer a different question and that merging with other information sources can make a dramatic difference to the usefulness of the model.&quot;

The same can be said of all screening, including that for cancer research. But the critical flaw still remains. When screening for a rare phenomenon, you&#039;re going to get several orders of magnitude more false positives that true positives. When that is scaled up to a population in the millions and a rarity on the order of one in hundreds of thousands, just what sort of other techniques are really going to prevent thousands of innocent people from  becoming suspects?</description>
		<content:encoded><![CDATA[<p>Mark Frank: &#8220;Of course it is true that you can’t assess the sensitivity and specificity of some kind of test for terrorists, especially as they are going vary wildly from one situation to another. But that’s not my reason for suggesting the cancer screening model is not appropriate. The most important reasons are that the security forces may be trying to answer a different question and that merging with other information sources can make a dramatic difference to the usefulness of the model.&#8221;</p>
<p>The same can be said of all screening, including that for cancer research. But the critical flaw still remains. When screening for a rare phenomenon, you&#8217;re going to get several orders of magnitude more false positives that true positives. When that is scaled up to a population in the millions and a rarity on the order of one in hundreds of thousands, just what sort of other techniques are really going to prevent thousands of innocent people from  becoming suspects?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: heng</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25256</link>
		<dc:creator>heng</dc:creator>
		<pubDate>Mon, 02 Mar 2009 15:17:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25256</guid>
		<description>Queex @58 made the comment I was about to make (although probably a bit more politely).

Following from his discussion, if 2 algorithms, A and B yielded better combined results than algorithm C, then we would have a new algorithm, D, which split the data into 2 and passed it through A and B and combined the result. Its nonsensical to talk about an algorithm as anything other than a black box with inputs (all data) and outputs (terrorist or diseased or stolen credit card or whatever).

For those thinking that medical diagnostics is anything other than data mining, frankly, you are wrong. Its *exactly* the same problem. You have inputs (blood tests, histories, CT images, MRI images) and outputs (have disease X?). Diagnostic doctors are just bayesian inference machines. If you could perfectly input the information into a computer, the computer would probably do a better job. Compare this to data mining for terrorists. You have inputs (phone taps, emails, cctv, informants, police interviews etc) and you have outputs (is terrorist?). Once you&#039;ve defined your feature vector, in both cases you could use any of the widely researched machine learning algorithms. For the moment, assume you have enough training data, how would you define your feature vector?

In principle, if you had enough computation, your feature vector could be everything: full frame CCTV images (because anything in the image could be of interest), every email on the internet (which of course can&#039;t be encrypted), lossless PCM speech digitised from every phone call in the country (you need to keep accents and so on), police reports etc. Put them all into one big black box algorithm, along with masses of training data, and see what pops out the end. Of course, in the process you&#039;ll pretty much solve several of the biggest machine intelligence problems around today. It would need to use every email because any self respecting terrorist is not going to flood the net with information.

Alternatively, because you don&#039;t have infinite computing power, you could reduce your feature vector by passing all the data through a data reduction algorithm. This would either be &quot;tuned&quot; (concentrating on individuals, or areas of the country or whatever), which comes down to profiling, or it would be random, throwing away random pieces of information until we have a computationally tractable problem (this is problematic for the reason given above). Whether what&#039;s left is even potentially useful is anyone&#039;s guess. The real problem here is that (unlike credit card transactions) nobody has the slightest clue what real terrorist actually do that&#039;s different to non-terrorists - mostly because terrorists realise this and go out of their way to be normal.

Now we come onto the problem of training data. How many terrorists have there been in the UK in the last 50 years (1000 tops?). Assuming that all the data these terrorists generate can be data mined for terrorist traits, and that these traits are correlated across all the data, so useful learning can actually be performed, what sort of problem do we have? Let&#039;s assume that for every terrorist, there is 10 years of incriminating chatter to store and analyse. That means we have a sum total of 10,000 person years worth of incriminating data to analyse and train on. Compare this to the total chatter from the rest of the population: 3e9 person years. That swamps the amount of positive data with hideously noisy negative data. We require a non-trivial (!) amount of computation to train on that. 3 billion years worth of every CCTV in the country, every phone call in the country, every email in the country. Of course, this relies on all the information being available for every known terrorist...

Already its a completely ludicrous proposal. I can&#039;t be bothered to think what the requirements would be to actually run such a preposterous scheme. The absolute, fundamental problem, is that the vast majority of people and the vast majority of data (even from terrorists) is in no way suggestive of terrorists. Its useless information that still has to be processed. Nothing beats: a) solving the problem in the first place; b) actually acquiring useful information (that means police and intelligence services work).</description>
		<content:encoded><![CDATA[<p>Queex @58 made the comment I was about to make (although probably a bit more politely).</p>
<p>Following from his discussion, if 2 algorithms, A and B yielded better combined results than algorithm C, then we would have a new algorithm, D, which split the data into 2 and passed it through A and B and combined the result. Its nonsensical to talk about an algorithm as anything other than a black box with inputs (all data) and outputs (terrorist or diseased or stolen credit card or whatever).</p>
<p>For those thinking that medical diagnostics is anything other than data mining, frankly, you are wrong. Its *exactly* the same problem. You have inputs (blood tests, histories, CT images, MRI images) and outputs (have disease X?). Diagnostic doctors are just bayesian inference machines. If you could perfectly input the information into a computer, the computer would probably do a better job. Compare this to data mining for terrorists. You have inputs (phone taps, emails, cctv, informants, police interviews etc) and you have outputs (is terrorist?). Once you&#8217;ve defined your feature vector, in both cases you could use any of the widely researched machine learning algorithms. For the moment, assume you have enough training data, how would you define your feature vector?</p>
<p>In principle, if you had enough computation, your feature vector could be everything: full frame CCTV images (because anything in the image could be of interest), every email on the internet (which of course can&#8217;t be encrypted), lossless PCM speech digitised from every phone call in the country (you need to keep accents and so on), police reports etc. Put them all into one big black box algorithm, along with masses of training data, and see what pops out the end. Of course, in the process you&#8217;ll pretty much solve several of the biggest machine intelligence problems around today. It would need to use every email because any self respecting terrorist is not going to flood the net with information.</p>
<p>Alternatively, because you don&#8217;t have infinite computing power, you could reduce your feature vector by passing all the data through a data reduction algorithm. This would either be &#8220;tuned&#8221; (concentrating on individuals, or areas of the country or whatever), which comes down to profiling, or it would be random, throwing away random pieces of information until we have a computationally tractable problem (this is problematic for the reason given above). Whether what&#8217;s left is even potentially useful is anyone&#8217;s guess. The real problem here is that (unlike credit card transactions) nobody has the slightest clue what real terrorist actually do that&#8217;s different to non-terrorists &#8211; mostly because terrorists realise this and go out of their way to be normal.</p>
<p>Now we come onto the problem of training data. How many terrorists have there been in the UK in the last 50 years (1000 tops?). Assuming that all the data these terrorists generate can be data mined for terrorist traits, and that these traits are correlated across all the data, so useful learning can actually be performed, what sort of problem do we have? Let&#8217;s assume that for every terrorist, there is 10 years of incriminating chatter to store and analyse. That means we have a sum total of 10,000 person years worth of incriminating data to analyse and train on. Compare this to the total chatter from the rest of the population: 3e9 person years. That swamps the amount of positive data with hideously noisy negative data. We require a non-trivial (!) amount of computation to train on that. 3 billion years worth of every CCTV in the country, every phone call in the country, every email in the country. Of course, this relies on all the information being available for every known terrorist&#8230;</p>
<p>Already its a completely ludicrous proposal. I can&#8217;t be bothered to think what the requirements would be to actually run such a preposterous scheme. The absolute, fundamental problem, is that the vast majority of people and the vast majority of data (even from terrorists) is in no way suggestive of terrorists. Its useless information that still has to be processed. Nothing beats: a) solving the problem in the first place; b) actually acquiring useful information (that means police and intelligence services work).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Queex</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25254</link>
		<dc:creator>Queex</dc:creator>
		<pubDate>Mon, 02 Mar 2009 10:46:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25254</guid>
		<description>njdowrick:

I think you&#039;re misunderstanding how meta-analysis of different algorithms would actually work.

To have two independent algorithms A and B, each must operate on different subsets of the data. You can combine the results to improve on either individually, but there&#039;s no reason to think it will be any better than an algorithm C working on all the data.

If both algorithms work on the full data (or indeed overlap at all), then they are not truly independent, which can greatly reduce the specificity gain from combining them- and there&#039;s still no reason to think they would beat C.

False positives don&#039;t just arise from quirks of the algorithm used, they also arise from quirks of the data itself. The point is, even with the ambitious estimates of accuracy, for every true positive there will be a number of false positives and the only way to tell them apart is by going out and finding more data.

Meta-analysis is not a panacea to wipe away false positive problems; it&#039;s a toolbox to help get at the results you would have obtained had you done a single big study. If this hypothetical single study is still not specific enough to be practical, you&#039;re SOL.

A lot of the apologists for the approach seem to think that as long as some gain can be shown, it&#039;s reason to support the idea. It&#039;s not. You have to show that it&#039;s more effective than spending a similar amount of money on other approaches. When you look at the likely cost (in terms of infrastructure and the expense of sorting wheat from chaff once you&#039;ve done your mining), even the most unrealistically optimistic guesses as to its efficacy still make it worse than say, hiring a few dozen extra agents or funding a public education programme.</description>
		<content:encoded><![CDATA[<p>njdowrick:</p>
<p>I think you&#8217;re misunderstanding how meta-analysis of different algorithms would actually work.</p>
<p>To have two independent algorithms A and B, each must operate on different subsets of the data. You can combine the results to improve on either individually, but there&#8217;s no reason to think it will be any better than an algorithm C working on all the data.</p>
<p>If both algorithms work on the full data (or indeed overlap at all), then they are not truly independent, which can greatly reduce the specificity gain from combining them- and there&#8217;s still no reason to think they would beat C.</p>
<p>False positives don&#8217;t just arise from quirks of the algorithm used, they also arise from quirks of the data itself. The point is, even with the ambitious estimates of accuracy, for every true positive there will be a number of false positives and the only way to tell them apart is by going out and finding more data.</p>
<p>Meta-analysis is not a panacea to wipe away false positive problems; it&#8217;s a toolbox to help get at the results you would have obtained had you done a single big study. If this hypothetical single study is still not specific enough to be practical, you&#8217;re SOL.</p>
<p>A lot of the apologists for the approach seem to think that as long as some gain can be shown, it&#8217;s reason to support the idea. It&#8217;s not. You have to show that it&#8217;s more effective than spending a similar amount of money on other approaches. When you look at the likely cost (in terms of infrastructure and the expense of sorting wheat from chaff once you&#8217;ve done your mining), even the most unrealistically optimistic guesses as to its efficacy still make it worse than say, hiring a few dozen extra agents or funding a public education programme.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark Frank</title>
		<link>http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/comment-page-2/#comment-25253</link>
		<dc:creator>Mark Frank</dc:creator>
		<pubDate>Mon, 02 Mar 2009 10:15:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.badscience.net/2009/02/datamining-would-be-lovely-if-it-worked/#comment-25253</guid>
		<description>Re #54 Biologista
“Yes indeed. We can assess the sensitivity, specificity etc for a cancer screening. We can’t assess a screen for terrorist detection because they don’t like to sit still for tests designed to catch them.

Given the target (largely predictable cancer cells versus largely unpredictable people) there are bound to be statistical differences. The data mining is surely far weaker.”

Of course it is true that you can’t assess the sensitivity and specificity of some kind of test for terrorists, especially as they are going vary wildly from one situation to another. But that&#039;s not my reason for suggesting the cancer screening model is not appropriate. The most important reasons are that the security forces may be trying to answer a different question and that merging with other information sources can make a dramatic difference to the usefulness of the model.

A full explanation is a bit long for a comment so I have put it on my &lt;a href=&quot;http://mark_frank.blogspot.com/2009/03/data-mining-for-terrorists.html&quot; rel=&quot;nofollow&quot;&gt;personal blog&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>Re #54 Biologista<br />
“Yes indeed. We can assess the sensitivity, specificity etc for a cancer screening. We can’t assess a screen for terrorist detection because they don’t like to sit still for tests designed to catch them.</p>
<p>Given the target (largely predictable cancer cells versus largely unpredictable people) there are bound to be statistical differences. The data mining is surely far weaker.”</p>
<p>Of course it is true that you can’t assess the sensitivity and specificity of some kind of test for terrorists, especially as they are going vary wildly from one situation to another. But that&#8217;s not my reason for suggesting the cancer screening model is not appropriate. The most important reasons are that the security forces may be trying to answer a different question and that merging with other information sources can make a dramatic difference to the usefulness of the model.</p>
<p>A full explanation is a bit long for a comment so I have put it on my <a href="http://mark_frank.blogspot.com/2009/03/data-mining-for-terrorists.html" rel="nofollow">personal blog</a>.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

