π Extracting Important Words Using Weirdness


Learn how to identify and extract rare or meaningful words from a corpus by leveraging the concept of “weirdness.” This approach enriches language processing applications, enabling you to highlight unique vocabulary and gain new linguistic insights.
What Is Word Weirdness?
“Weirdness” measures how unusual or infrequently a word appears in a specific context compared to a general corpus. Words with high weirdness scores often reflect domain-specific jargon, neologisms, or outliersβuseful for tasks such as keyword extraction, content filtering, and building rich glossaries.
The Extract-important-words-using-weirdness Project
The Extract-important-words-using-weirdness tool, available on GitHub, helps you automate the process of extracting these unusual words from your own datasets. Although the README is currently limited, similar tools identify key features in text, often using statistical comparisons between domain and reference corpora.
Common pipeline steps include:
- Collecting text data (your domain corpus)
- Comparing term frequencies with those in a general corpus
- Scoring words using weirdness (e.g., by dividing relative frequency in the domain by that in the reference)
- Extracting and exporting the words with the highest scores
Why Use Weirdness?
Extracting uncommon words is valuable when you want to:
- Build specialized dictionaries or lexicons
- Identify emerging trends or concepts
- Enhance search engines or NLP models with domain-specific awareness
- Support linguistic exploration
How Weirdness is Calculated
For each word:
Weirdness = (ws/ts) / (wg/tg)
Where:
- ws = frequency of the word in the specialist corpus
- ts = total words in the specialist corpus
- wg = frequency in the general corpus (Google Unigrams)
- tg = total words in the general corpus
Words with high weirdness are considered important or distinctive for the specialist domain.
Example Usage
Here is a practical example (conceptual pseudocode):
from extract_words import calculate_word_frequencies_from_text, extract_important_words_by_weirdness
specialist_text = """
The investors were excited about the supercritical fluid technology.
Dollars and cents were discussed, and the pressurization of the fluid was achieved.
Supercritical fluids are important in modern finance and chemistry.
The investors agreed that the dollars invested in supercritical fluid research would yield returns.
"""
specialist_corpus_freqs = calculate_word_frequencies_from_text(specialist_text)
general_corpus_file = "Eng_GoogleUnigrams.csv"
important_words = extract_important_words_by_weirdness(
specialist_corpus_freqs, general_corpus_file, top_n=10, min_weirdness=1.0
)
for word, score in important_words:
print(f"{word}: {score:.2f}")
References
Ahmad, K., Gillam, L., & Tostevin, L. (1999). University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER). In E. M. Voorhees & D. K. Harman (Eds.), Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17-19, 1999 (NIST Special Publication, Vol. 500-246). National Institute of Standards and Technology (NIST). PDF
BibTeX:
@inproceedings{weirdness_org,
author = {Khurshid Ahmad and
Lee Gillam and
Lena Tostevin},
editor = {Ellen M. Voorhees and
Donna K. Harman},
title = {University of Surrey Participation in {TREC8:} Weirdness Indexing
for Logical Document Extrapolation and Retrieval {(WILDER)}},
booktitle = {Proceedings of The Eighth Text REtrieval Conference, {TREC} 1999,
Gaithersburg, Maryland, USA, November 17-19, 1999},
series = {{NIST} Special Publication},
volume = {500-246},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {1999},
url = {http://trec.nist.gov/pubs/trec8/papers/surrey2.pdf},
timestamp = {Tue, 30 Jun 2020 17:23:13 +0200},
biburl = {https://dblp.org/rec/conf/trec/AhmadGT99.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}