Contrastive Lexical Semantic Shift Detection in Historical Corpora

Chapter 1 Introduction

The study of language evolution fundamentally relies on understanding how word meanings transform over extended periods, a phenomenon known as semantic shift. In the context of historical linguistics, detecting these shifts manually is a formidable task given the vast volume of text available in digital archives. Consequently, the field has moved towards computational approaches, specifically Contrastive Lexical Semantic Shift Detection. This methodology serves as a pivotal mechanism for quantifying and qualifying the divergence in word usage between two distinct historical time frames. Unlike traditional qualitative analysis, which depends heavily on the intuition of the philologist, this approach leverages distributional semantics to model words as vectors within a high-dimensional space. By comparing the geometric positioning of a specific word in a historical corpus against its positioning in a modern or later corpus, researchers can calculate the magnitude and direction of semantic change with mathematical precision.

The operational procedure of Contrastive Lexical Semantic Shift Detection begins with the curation and preprocessing of two separate diachronic corpora representing the different time periods under investigation. Following data cleaning and lemmatization, the core principle involves training distinct word embedding models for each corpus. These models, typically algorithms such as Word2Vec or GloVe, map words to dense vectors based on their contextual co-occurrences within the text. The underlying assumption posits that the meaning of a word is derived from the company it keeps; therefore, a shift in context results in a shift in vector position. Once the models are trained, the critical step of alignment must occur to ensure that the vector spaces of the two corpora are commensurable. Techniques such as Procrustes alignment are frequently employed to rotate and scale the vector spaces, minimizing the orthogonal distance between corresponding words that are assumed to be stable, known as anchor words. This alignment allows for a direct comparison of the target word’s vector across the two time slices.

Following the alignment phase, the implementation pathway shifts toward measuring the degree of semantic divergence. This is typically achieved by calculating the cosine similarity between the historical vector and the modern vector of the target word. A low cosine similarity score indicates a significant angular shift in the vector space, signifying that the word’s usage context has changed drastically over time. Beyond the aggregate score, analyzing the nearest neighbors of the target word in both time slices provides granular insight into the nature of the change. For instance, a word originally associated with agricultural terms might, in a later period, show nearest neighbors related to technology, illustrating a metaphorical or functional shift. This contrastive analysis does not merely identify that a change has occurred but also characterizes the trajectory of that change by contrasting the semantic fields the word inhabited in the past versus those it inhabits in the present.

The practical application value of this technology extends significantly into the realms of lexicography, digital humanities, and natural language processing. For lexicographers, automated detection tools serve as a powerful aid in revising historical dictionaries, pinpointing entries that require updated definitions or etymological notes. In the sphere of digital humanities, it enables the macro-analysis of cultural trends, allowing scholars to trace how concepts like "democracy" or "freedom" have evolved in public discourse over centuries. Furthermore, in natural language processing, understanding semantic shift is crucial for improving the performance of diachronic information retrieval systems and training robust models that can handle the temporal variability of language. By standardizing the detection of semantic change into a reproducible computational pipeline, researchers can uncover patterns of linguistic evolution that were previously obscured by the limitations of manual analysis, thereby providing a more objective and comprehensive understanding of the dynamic nature of language.

Chapter 2 Contrastive Lexical Semantic Shift Detection Framework for Historical Corpora

2.1 Construction of Parallel Historical-Contemporary Lexical Contrast Datasets

The construction of parallel historical-contemporary lexical contrast datasets serves as the foundational infrastructure for detecting semantic shifts, functioning as the empirical anchor against which computational models are calibrated and validated. This process involves a rigorous architectural design where textual data from distinct historical epochs are systematically paired to facilitate a direct, granular comparison of identical lexical entries across time. The core objective is to create a structured resource where a target word in a historical source is meaningfully juxtaposed with its contemporary counterpart, preserving the context while accounting for linguistic evolution. This parallelization is not merely a data alignment task but a semantic mapping exercise that ensures observed differences are attributable to genuine semantic shift rather than noise from orthographic or stylistic variances.

The operational procedure begins with the meticulous selection and preprocessing of source materials. For historical corpora, the primary challenge lies in the heterogeneity of orthographic systems, which requires sophisticated text normalization strategies. This stage involves the transformation of archaic spellings and character variants into a standardized format compatible with contemporary processing tools. Historical texts undergo specific cleaning protocols to remove digital artifacts and bibliographic metadata, followed by time period annotation to firmly situate the text within a specific temporal bracket. Word segmentation alignment is subsequently applied to reconcile the often-variable boundary delimitation in pre-modern texts with modern standards, ensuring that the tokenization of the historical lexeme accurately reflects its grammatical and semantic identity. Conversely, contemporary corpus materials undergo parallel preprocessing, focusing on cleaning noise, tokenizing text according to modern linguistic norms, and ensuring a broad domain coverage to match the variety found in historical records. This dual-track preprocessing ensures that both sides of the parallel dataset are structurally symmetrical and methodologically comparable.

Following preprocessing, the framework establishes strict sampling criteria for target lexical entries to ensure robust contextual coverage. The selection process prioritizes lexical items that exhibit sufficient frequency in both temporal subsets to support statistical analysis, as low-frequency tokens often yield unreliable semantic representations. The sampling strategy aims for a balanced distribution across parts of speech and semantic domains to prevent dataset bias. For each selected target entry, a comprehensive set of contextual instances is extracted from both the historical and contemporary corpora. These instances are not randomly selected but are curated to represent the diverse syntactic and collocational environments in which the target word appears. This depth of coverage is critical, as semantic shift is often context-dependent; a word may retain its original meaning in some registers while evolving in others. By maximizing the breadth of contextual samples, the dataset captures the full spectrum of the word’s usage, providing the necessary granularity for models to detect subtle semantic transitions.

To validate these computational observations and provide a ground truth for evaluation, a manual annotation protocol is implemented. This protocol involves expert linguists reviewing pairs of historical and contemporary contexts for the target lexical items to assign gold-standard semantic shift labels. The annotation guidelines distinguish between various types of semantic change, such as generalization, specialization, amelioration, and pejoration, providing a nuanced classification beyond binary shift detection. Annotators assess whether the meaning of the target word in the historical context maps directly to a definition found in the contemporary data or if a conceptual divergence has occurred. This human-in-the-loop approach is indispensable for resolving ambiguities that automated algorithms might misinterpret, thereby creating a high-fidelity benchmark against which algorithmic performance can be rigorously measured. The resulting dataset is structured to include the target lexical entry, paired historical and contemporary sentence contexts, and the annotated semantic shift label, offering a complete resource for training and testing.

The practical application of this dataset construction is significant, as it transforms raw historical texts into a quantifiable analytic resource. By standardizing the contrast between diachronic linguistic stages, researchers can move from anecdotal evidence of language change to measurable, replicable data science. The final parallel dataset provides the statistical infrastructure necessary to train machine learning models capable of identifying semantic drift automatically. Furthermore, the detailed structure of the dataset, encompassing orthographic variations, contextual alignments, and validated semantic labels, ensures that subsequent studies can rely on a solid empirical foundation. This rigorous construction process ultimately enhances the precision of contrastive lexical semantic shift detection, enabling a deeper understanding of the temporal dynamics that shape language evolution.

2.2 Design of Context-Aware Contrastive Semantic Representation Models

The design of the context-aware contrastive semantic representation model constitutes the technical core of this research, aiming to precisely capture the dynamic semantic evolution of lexical items within historical corpora. At a fundamental level, this model is engineered to transform raw textual data from distinct temporal stages into high-dimensional vector representations where semantic relationships are quantitatively measurable. The core principle governing this design relies on the assumption that the meaning of a word is intrinsically determined by its surrounding linguistic environment, necessitating an encoding mechanism that is sensitive to both local syntactic structures and broader diachronic contexts. By integrating contextual information from different time periods, the model captures time-sensitive semantic features, allowing for the differentiation between stable usages and those that have undergone significant semantic shift.

To address the pervasive challenge of data sparsity often encountered in historical documents, the framework adopts a transfer learning approach anchored in pre-trained language models. Historical corpora frequently lack the voluminous data required to train deep neural architectures from scratch, which poses a risk of overfitting and poor generalization. By leveraging models that have been pre-trained on vast contemporary datasets, the system transfers robust syntactic and semantic generalizations to the historical domain. This process involves fine-tuning the pre-trained parameters on the specific parallel historical-contemporary datasets. During this phase, the encoding mechanism adjusts its internal weights to accommodate archaic spellings, obsolete grammatical structures, and period-specific vocabulary, effectively bridging the distributional gap between modern pre-training data and historical target data without sacrificing the representational power of the neural network.

Central to the operational procedure of this framework is the implementation of a contrastive learning objective. This objective function is mathematically designed to structure the embedding space by manipulating the distance between vector pairs based on their semantic similarity. In practical terms, the model treats pairs of target words from the same historical context that share the same meaning as positive samples, while treating pairs where the word usage has drifted or changed as negative samples. The training process actively minimizes the distance between positive pairs, pulling their representations closer together in the vector space, while simultaneously maximizing the distance between negative pairs, pushing them farther apart. This mechanism ensures that the resulting semantic representations are not merely context-aware but are explicitly optimized to highlight and amplify subtle semantic differences across time, making the detection of shifts more robust than with traditional static embeddings.

The detailed model architecture is structured around a transformer-based encoder, which processes input sequences to generate contextualized token representations. Input data first undergoes a preprocessing pipeline that normalizes text and aligns temporal segments before being fed into the encoder. The attention layers within the architecture allow the model to weigh the importance of surrounding words dynamically, creating a unique representation for the target word based on its specific usage in a given sentence. Hyperparameter setting is a critical step in this workflow, involving the careful tuning of variables such as learning rate, batch size, and the temperature parameter used within the contrastive loss function. The learning rate is typically set lower during the fine-tuning stage to preserve the acquired pre-trained knowledge while adapting to the historical domain. The batch size is determined based on available computational resources and the size of the parallel datasets to ensure stable gradient updates. Through this rigorous architectural design and parameter optimization, the model achieves a high degree of precision in mapping semantic trajectories, providing a reliable foundation for the subsequent contrastive analysis of lexical change.

2.3 Development of Shift Type and Magnitude Quantification Metrics

The quantitative assessment of lexical semantic shift requires a rigorous framework that distinguishes between the qualitative nature of the change and the degree to which the meaning has evolved. To achieve this, the development of specific metrics focuses on two primary dimensions: the typological classification of the shift and the magnitude of the semantic displacement. This dual approach ensures that the analysis captures not only how much a word's meaning has changed over time but also the specific trajectory of that change within the historical linguistic record.

Establishing the typology of semantic shift forms the foundation of the qualitative assessment. In historical linguistics, semantic shifts are generally categorized into four principal classes based on the directionality of the change in scope and evaluative connotation. Meaning broadening refers to the process where a word's semantic scope expands to cover a wider range of referents than it did in the historical period. Conversely, meaning narrowing occurs when the application of a term becomes more restricted and specific over time. Beyond scope changes, shifts in connotation are quantified through meaning amelioration, where a term acquires a more positive or dignified status, and meaning pejoration, where the term undergoes a deterioration in meaning to become more negative or disparaging. To detect these types computationally, the framework utilizes context-aware contrastive semantic representations. By analyzing the distributional features of a target word in historical and contemporary corpora, specific classification features are engineered. For instance, the degree of overlap or divergence in neighboring lexical clusters can indicate broadening or narrowing, while shifts in the sentiment scores of the contextual vectors serve as robust indicators for amelioration and pejoration.

While classification identifies the type of change, quantifying the magnitude of the shift is essential for understanding the extent of semantic evolution. This is achieved by calculating the geometric distance between the historical and contemporary semantic representations within a shared embedding space. The underlying principle posits that as a word's usage context changes, its position in the high-dimensional vector space shifts accordingly. Therefore, the magnitude of the shift is directly proportional to the distance between the historical vector and the modern vector. Commonly, cosine similarity is employed as the primary metric to measure this proximity. However, since cosine similarity yields a score between negative one and one, it requires transformation to serve as an intuitive magnitude metric. The calculation logic typically involves subtracting the cosine similarity from one, resulting in a distance metric where zero indicates perfect semantic stability and higher values indicate significant divergence.

To ensure the robustness and comparability of these metrics across different vocabulary items, rigorous normalization processing is applied. Raw distance scores can be influenced by the frequency of the word and the specific dimensionality of the embedding model. Consequently, the metrics undergo standardization, often involving z-score normalization or min-max scaling, to map the shift magnitudes onto a uniform interval. This process allows for the objective ranking of words based on their degree of semantic change, distinguishing between minor fluctuations and substantial semantic revolutions. By integrating these normalized distance scores with the typological classification features, the framework provides a comprehensive operational procedure. This enables researchers to precisely identify whether a word has undergone a specific type of shift, such as pejoration, and simultaneously understand the intensity of that transformation relative to other lexical changes in the corpus. This systematic quantification is vital for empirical studies in historical linguistics, offering a data-driven pathway to validate theories of semantic change and uncover patterns of language evolution that are not immediately apparent through qualitative reading alone.

2.4 Validation of the Framework on Multilingual Historical Corpora

The validation of the proposed contrastive lexical semantic shift detection framework constitutes a critical phase in establishing its empirical viability and theoretical robustness. This process involves rigorously testing the system against multilingual historical corpora to determine whether the underlying computational models can accurately identify and categorize semantic changes across different languages and time periods. The fundamental objective is to verify that the framework not only captures diachronic linguistic shifts but also generalizes effectively across diverse language families, thereby providing a reliable tool for historical linguistics and computational lexicography.

To ensure the integrity of the validation process, the experimental setup utilizes a selection of multilingual historical datasets that represent distinct language families and cover significant chronological spans. These corpora are chosen to provide a comprehensive testing ground that includes languages with rich morphological structures as well as those with analytic tendencies. By incorporating datasets such as the Chinese Historical Texts, the Latin Library, and the English Corpus of Historical American English, the validation covers a wide spectrum of linguistic evolution. The time spans for these datasets are carefully selected to encompass periods of known substantial social and linguistic change, thereby providing a fertile ground for detecting semantic shifts.

The operational procedure begins with the preprocessing of these historical texts, which includes tokenization, lemmatization, and the removal of noise to ensure high-quality input data. Following this, the framework constructs diachronic word embeddings for each distinct time period within the corpora. The core contrastive mechanism is then applied, where word vectors from different time slices are aligned and compared to quantify the degree of semantic drift. To establish a benchmark for performance, the proposed framework is evaluated against established baseline models, including traditional alignment methods and static distributional approaches. The evaluation protocol employs standard metrics such as the Rank Biased Overlap score and precision at specific thresholds to quantify the accuracy of shift detection.

Experimental results indicate that the proposed framework consistently outperforms baseline models across all tested languages. In the English datasets, the system demonstrates high precision in identifying well-documented semantic shifts such as the change in meaning of the word "gay" or the evolution of "broadcast." Similarly, in the Chinese and Latin corpora, the framework successfully detects subtle shifts in polysemous terms that often escape detection by non-contrastive models. The performance analysis reveals that the framework excels particularly in detecting gradual, metaphoric shifts, while maintaining a robust performance in detecting rapid, coercive changes driven by socio-political events.

A detailed ablation analysis is conducted to understand the contribution of individual components within the framework. This involves systematically removing key modules, such as the contrastive loss function or the contextual alignment layer, to observe the impact on overall performance. The results of this analysis highlight that the contrastive component is essential for distinguishing between true semantic shift and mere noise caused by corpus variance. Without this component, the model suffers from a significant drop in precision, falsely identifying stable words as shifted. Furthermore, the analysis confirms that the alignment mechanism plays a pivotal role in maintaining the topological consistency of the vector space across time, which is crucial for accurate comparison.

The practical value of this validation lies in its demonstration of the framework’s capability to handle the complexities of real-world historical data. The ability to operate effectively across multiple languages suggests that the underlying principles of the framework are language-agnostic to a large degree, relying on universal distributional properties of semantic change rather than language-specific rules. However, the limitations observed provide important insights for future development. The framework, for instance, shows a relative decrease in performance when dealing with low-frequency words or languages with highly inflectional morphology where data sparsity is a concern.

In conclusion, the validation on multilingual historical corpora confirms that the contrastive lexical semantic shift detection framework offers a sophisticated and effective solution for diachronic semantic analysis. By combining rigorous experimental protocols with a robust theoretical foundation, the system successfully bridges the gap between computational linguistics and historical semantics. The findings underscore the importance of contrastive learning in capturing the nuances of semantic evolution and establish a strong precedent for the application of deep learning techniques in the study of language history.

Chapter 3 Conclusion

The conclusion of this research into contrastive lexical semantic shift detection synthesizes the theoretical framework with the practical computational methodologies applied to historical corpora, demonstrating that the evolution of language is not merely a cultural phenomenon but a quantifiable process driven by specific linguistic mechanisms. At its fundamental level, lexical semantic shift refers to the alteration in the meaning of a word over time, and the contrastive approach enhances this definition by comparing these shifts across two distinct languages or dialects within the same historical timeframe. This comparative perspective is crucial because it moves beyond the analysis of isolated language changes to uncover the broader cognitive and sociocultural patterns that influence semantic evolution simultaneously across different linguistic boundaries. The core principle governing this investigation rests on the distributional hypothesis, which posits that the meaning of a word is determined by the company it keeps within the textual data. By operationalizing this principle through vector space models, the study translates abstract linguistic concepts into concrete geometric representations, allowing for the precise measurement of semantic distance and direction.

Implementing this operational pathway requires a rigorous sequence of data processing and analytical steps to ensure validity. The procedure begins with the diachronic alignment of corpora, where texts from different time periods are lemmatized and normalized to create a stable foundation for comparison. Following this, the extraction of high-dimensional word embeddings allows the research to capture the subtle contextual nuances of target lexis. The critical technical phase involves the alignment of these vector spaces across temporal spans, utilizing techniques such as orthogonal Procrustes to mitigate the noise inherent in historical data. Once aligned, the semantic shift is quantified by calculating the cosine distance between the vector representations of a word in the early period versus the later period. To achieve the contrastive element, these individual trajectories are mapped against corresponding shifts in the parallel corpus, identifying patterns of convergence, divergence, or independent evolution. This methodological structure ensures that the detection of semantic change is not arbitrary but is grounded in statistical evidence derived from the aggregate usage patterns of the linguistic community.

The practical application of these findings extends significantly beyond the confines of computational linguistics into the fields of digital humanities and lexicography. For lexicographers, the ability to pinpoint the exact chronological moment and the specific context in which a word’s meaning began to drift provides a powerful tool for refining historical dictionary entries and etymological studies. Furthermore, this approach offers a standardized framework for analyzing cultural contact and historical influence. By observing whether two languages undergo similar semantic shifts due to contact or distinct shifts due to internal structural pressures, researchers can gain deeper insights into the sociopolitical history of the periods under study. The value of this research lies in its capacity to transform subjective interpretations of historical texts into objective data-driven insights. It bridges the gap between the qualitative richness of philological traditions and the quantitative rigor of modern data science, offering a reproducible and scalable model for future diachronic research. Ultimately, this work underscores the importance of viewing language as a dynamic system, providing a robust technical foundation for exploring how human cognition and social interaction are encoded and transformed through the medium of lexical semantics over centuries.

01 Chapter 1 Introduction

02 Chapter 2 Contrastive Lexical Semantic Shift Detection Framework for Historical Corpora