Stylometry

Deanonymization based on a user's linguistic style.
Stylometry
[edit]Stylometry Threat Description
[edit]No operating system does obfuscate a user's writing style. Consequently, unless precautions are taken (see below), users are at risk from stylometric analysis based on their linguistic style![]()
. Research suggests only a few thousand words (or less) may be enough to positively identify an author and there are a host of software tools available to conduct this analysis.
This technique is used by advanced adversaries to attribute authorship to anonymous documents, online texts (web pages, blogs etc.), electronic messages (emails, tweets, posts etc.) and more.
Stylometry Tracking Techniques
[edit]The field is dominated by A.I. techniques like neural networks and statistical pattern recognition, and is critical to privacy and security. Current anonymity and circumvention systems are focused on location-based privacy, but ignore leakage of identification via the content of data which has a high accuracy in authorship recognition (90%+ probability). [1]
There are multiple ways to conduct statistical analysis on "anonymous" texts, including: [1] [2]
- Keystroke fingerprinting, for example in conjunction with Javascript.
- Stylistic flourishes.
- Abbreviations.
- Spelling preferences and misspellings.
- Language preferences.
- Word frequency.
- Number of unique words.
- Regional linguistic preferences in slang, idioms and so on.
- Sentence/phrasing patterns.
- Word co-location (pairs).
- Use of formal/informal language.
- Function words.
- Vocabulary usage and lexical density.
- Character count with white space.
- Average sentence length.
- Average syllables per word.
- Synonym choice.
- Expressive elements like colors, layout, fonts, graphics, emoticons and so on.
- Analysis of grammatical structure and syntax.
Stylometry Defenses
[edit]Fortunately research suggests that if users purposefully obfuscate their linguistic style or imitate the style of other known authors, this is largely successful in defeating all stylometric analysis methods so they are no better than randomly guessing the correct author of a document. However, using automated methods like machine translation services do not appear to be a viable method of circumvention. [1]
AI Based Stylometry Defense
[edit]In the future, Artificial Intelligence (AI) might help with rephrasing and thereby eliminating a writer's writing style. No specific AI recommendation is available yet. Some general criteria:
- Local, Offline AI: The AI should be running on the user's local computer and not in the cloud. It should not be based on a web interface unless it is running locally. This is because a cloud based AI provider such as ChatGPT should be assumed to log all conversations and there is also a keystroke fingerprinting risk as for any website.
- Sandboxed AI: Maybe the AI cannot be run inside a virtual machine (VM) because it requires access to a real graphics card for performance reasons. Ideally the AI would be used on an offline computer. Since that might be too impractical, it should at least be sandboxed so it cannot have internet access.
- Text auto correct AI: Optional. An AI specifically trained for text auto correction without other skills (such as conversion, programming or mathematics) might feature lower system requirements.
The Whonix project hasn't identified any Open Source AI yet. None is available from packages.debian.org at the time of writing. This will take time. In fact, many companies are playing games abusing the term open source for things which clearly are not Open Source. The term Open Source AI is also not well defined yet. Some stakeholders such as Debian are working on it. For an elaboration and further references, see Real Open Source Artificial Intelligence (AI)
.
llama.cpp is an interface, a tool that facilitates the use of AI models. I have tested ollama, which is similar.
I have tested local freeware [3] models such as DeepSeek R1, Facebook’s LLaMA, and various distilled models on fast consumer hardware with the latest gaming GPU. In my testing, the quality and performance were not sufficient for practical use. The better models, for which the quality might be usable, were too slow. A reply to “hi” took 30 minutes. Distilled (“simplified, faster”) models did not produce acceptable output quality for my use cases, including typo / grammar review and code review.
It might be feasible with an NVIDIA H100 for an approximately $38K USD purchase, or at $2.40/hr when rented in a data center. Renting in a data center raises privacy and security concerns. The data center could log / tamper with all inputs and outputs. I have not tried that yet.forum post by a Whonix developer
Other Mitigation
[edit]Mitigation strategies for stylometric analysis threats are further documented on this page: Surfing Posting Blogging.
See also:
Stylometry Future Research
[edit]Areas of interest for further research and development include:
- Add user documentation for Remote Administration, Keystroke Fingerprinting, Stylometry

- A4NT - Translating writing styles for anonymity

[4]
- Java Graphical Authorship Attribution Program

[5]
Footnotes
[edit]- ↑ 1.0 1.1 1.2 https://web.archive.org/web/20160304062339/https://www.cs.drexel.edu/~sa499/papers/adversarial_stylometry.pdf
- ↑ https://en.wikipedia.org/wiki/Stylometry

- ↑
[1] These models are sometimes labeled
Open Source, but they fail the definition of the OSD
, FSF

, and Debian DSFG

. So these are neither
Open SourcenorFree Softwareas typically defined. - ↑
https://github.com/rakshithShetty/A4NT-author-masking/issues/3

- ↑
https://github.com/evllabs/JGAAP/issues/107#issuecomment-476758584

We believe security software like Whonix needs to remain open source and independent. Would you help sustain and grow the project? Learn more about our 14 year success story and maybe DONATE!



