case classApproximateWhitelist(filter: BitSet) extends Product with Serializable
An ApproximateWhitelist is a basic Bloom filter intended for holding natural-language
vocabularies.
An ApproximateWhitelist is a basic Bloom filter intended for holding natural-language
vocabularies. It deals with String values natively and can be trained from a sequence or from
an RDD of any element type T, as long as there is an implicit conversion in scope
from T to String.
Known limitation: while this filter uses several hashes, some of these will exhibit unusually
high collision rates when hashing strings that are permutations of one another. If you experience
poor filter performance on a given vocabulary, this might be worth investigating. The choice of
hash functions is subject to change in a future release.
An ApproximateWhitelist is a basic Bloom filter intended for holding natural-language vocabularies. It deals with String values natively and can be trained from a sequence or from an RDD of any element type T, as long as there is an implicit conversion in scope from T to String.
Known limitation: while this filter uses several hashes, some of these will exhibit unusually high collision rates when hashing strings that are permutations of one another. If you experience poor filter performance on a given vocabulary, this might be worth investigating. The choice of hash functions is subject to change in a future release.