An unofficial blog that watches Google's attempts to move your operating system online since 2005. Not affiliated with Google.

Send your tips to gostips@gmail.com.

March 26, 2006

Detecting Near-duplicate Documents

Approximately 30% of the pages on the web are (near) duplicates. Google has a patent for some improved duplicate and near duplicate detection techniques.

"From the perspective of users, duplicate and near-duplicate documents raise problems. More specifically, when users submit a query to a search engine, most do not want links to (and descriptions of) Web pages which have largely redundant information. For example, search engines typically respond to search queries by providing groups of ten results. If pages with duplicate content were returned, many of the results in one group may include the same content. Thus, there is a need for a technique to avoid providing search results associated with (e.g., having links to) Web pages having duplicate content."

One idea might be indexing the keywords in the documents and comparing the percentage of terms shared by the two documents, but that highly inefficient.

Or you can try to compute the edit distance (Damerau-Levenshtein distance) between the two documents. The edit distance between two input strings is the minimum cost of a sequence of edit operations (substitution of a symbol in another incorrect symbol, insertion of an extraneous symbol, deletion of a symbol, and transpositions ) needed to change one input string into the other string.

A much better method for detecting duplicate and near-duplicate documents involve generating "fingerprints" (hashes) for elements (paragraphs, sentences, words, shingles) of documents. Two documents would be considered to be near-duplicates if they share more than a predetermined number of fingerprints.

A k-shingle is a sequence of k consecutive words from a documents. If S(A) is the set of shingles contained by A, we can compute the resemblance of A and B like this: |S(A)VS(B)| divided by |S(A)US(B)|. The problem is that the intersection is hard to compute, so it has to be estimated.

Learn more from Andrei Broder's course at Princeton University [PDF, html version].

"Search without a box" - A chat with Andrei Broder

2 comments:

  1. I don't believe there is a problem with duplicate content. Google already has the highly effective PageRank to sort out the most relevant pages for any given search phrase.

    Why do they need to remove duplicate content when they already return the most relevant pages to their visitors?

    At my blog I link to a 13-minute movie I made debunking the Duplicate Content Penalty Myth.

    Neil Shearing.

    ReplyDelete
  2. The author is right, this is a big problem for search engines. But it is also a problem for any big company producing lots of documents. Just think about space they take, maintenance they require and what value you get. If you need to find dups or near-duplicates you can try this program in Java: http://softcorporation.com/products/neardup/

    ReplyDelete

Note: Only a member of this blog may post a comment.