« NetApp and Kazeon Systems' Data Classification Appliance | Main | I.B.M. Puts Its Patent Filings Online »

24 September 2006

How translucency could defuse the Turnitin/McLean High controversy

Great (and fun) piece by Jon Udell on the use of one-way hash functions to create a database of documents, against which other documents would be compared to detect plagarism.

Jon Udell: How translucency could defuse the Turnitin/McLean High controversy

...
Turnitin's business is (or should be) only to detect plagiarism. To do that, it must build a database. But surprisingly and counterintuitively, the documents stored in that database need not be readable by human beings. To meet the business requirement, they need only be machine-readable versions derived from the human-readable originals.

Suppose the previous sentence appears in a student assignment. A cryptographic hash function can convert that sentence into this sequence of characters:

119ffe6a7c1f54b96beb6e38d822ebd0cb8df63d

The operation is called a one-way hash because although it will reliably and repeatedly convert the same sentence into the same sequence of characters, you cannot reverse it. The sentence is not recoverable from its derived sequence. What's more, it's very unlikely that two different sentences will yield the same sequence.

So here's a strategy for Turnitin. Convert each sentence of each student document into its corresponding sequence of characters, store only that sequence in the database, and discard the original sentence. Now the database contains no intellectual property subject to misuse. Even if it wanted to, Turnitin couldn't improperly mine the database. Neither could anyone who bought or stole the data.

But Turnitin can use the database for its sole valid purpose: to detect plagiarism. How? By deriving one-way hashes from each sentence of each document that it checks for plagiarism, and then by searching its database for those derived sequences of characters.
...

The important point here is that some properties of data can be selectively exposed, while simultaneously masking the "clear text". In other words, there exist degrees of freedom as to what a data steward can safely reveal. It's not just "all in the clear" or "all in the dark."

One question in the article that's a real brain-teaser: Does cryptographic transformation create a derivative work also subject to copyright protection?

Technorati Tags: , ,

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/245647/6164724

Listed below are links to weblogs that reference How translucency could defuse the Turnitin/McLean High controversy:

Comments

Post a comment

This weblog only allows comments from registered users. To comment, please Sign In.