w-shingling

In natural language processing a w-shingling is a set of unique "shingles" (n-grams, contiguous subsequences of tokens in a document) that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

The document, "a rose is a rose is a rose" can be tokenized as follows:

(a,rose,is,a,rose,is,a,rose)

The set of all contiguous sequences of 4 tokens (4-grams) is

{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } = { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.

See also

References

External links

This article is issued from Wikipedia - version of the 9/27/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.