Reference:

Matti Pöllä and Timo Honkela. Negative selection of written language using character multiset statistics. Journal of Computer Science and Technology, 25(6):1256–1266, November 2010.

Abstract:

We study the combination of symbol frequence analysis and negative selection for anomaly detection of discrete sequences where conventional negative selection algorithms are not practical due to data sparsity. Theoretical analysis on ergodic Markov chains is used to outline the properties of the presented anomaly detection algorithm and to predict the probability of successful detection. Simulations are used to evaluate the detection sensitivity and the resolution of the analysis on both generated artificial data and real-world language data including the English Wikipedia. Simulation results on large reference corpora are used to study the effects of the assumptions made in the theoretical model in comparison to real-world data.

Suggested BibTeX entry:

@article{polla10jcst,
    author = {Matti P{\"o}ll{\"a} and Timo Honkela},
    journal = {Journal of Computer Science and Technology},
    month = {November},
    number = {6},
    pages = {1256--1266},
    publisher = {Springer Boston},
    title = {Negative Selection of Written Language Using Character Multiset Statistics},
    volume = {25},
    year = {2010},
}

See dx.doi.org ...