hackslash dot org

Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

Posted: 2025-05-18 16:09:01

The author used Sentence-BERT (SBERT), a semantic similarity model, to analyze the Voynich Manuscript, hoping to uncover hidden structure. They treated each line of "Voynichese" as a separate sentence and embedded them using SBERT, then visualized these embeddings in a 2D space using UMAP. While visually intriguing patterns emerged, suggesting some level of semantic organization within sections of the manuscript, the author acknowledges that this doesn't necessarily mean the text is meaningful or decipherable. They released their code and data, inviting further exploration and analysis by the community. Ultimately, the project demonstrated a novel application of SBERT to a historical mystery but stopped short of cracking the code itself.

A Hacker News user, "brianmg," has shared a project exploring the enigmatic Voynich Manuscript using modern natural language processing (NLP) techniques. The central hypothesis of this project is that despite the manuscript's unknown script and language, underlying structural patterns might be discernible through computational analysis. Specifically, the project utilizes Sentence-BERT (SBERT), a powerful model designed to generate semantically meaningful sentence embeddings. These embeddings represent the meaning of text as numerical vectors, allowing for comparisons of semantic similarity between different passages.

Brian's approach involved dividing the Voynich Manuscript into sections based on its distinctive "folios," or pages. Each section of text within a folio was then treated as a separate "sentence" for the purposes of generating embeddings with SBERT. The choice of SBERT was motivated by its ability to capture semantic relationships even in the absence of a known language, potentially revealing hidden structures within the manuscript.

After generating these embeddings, the project employed a variety of clustering and dimensionality reduction techniques. Clustering algorithms were used to group similar sections of the manuscript together based on the semantic proximity of their embeddings. Dimensionality reduction, specifically using t-SNE and UMAP, was implemented to visualize these high-dimensional embeddings in a more interpretable two-dimensional space. This allowed for a visual representation of potential clusters and relationships between different sections of the manuscript.

The ultimate goal of the project was to test whether any discernible structure emerges from this analysis. The presence of distinct clusters, for instance, might suggest the existence of different topics or themes within the manuscript. Furthermore, the visualization of the embeddings could reveal patterns in the ordering or arrangement of these topics throughout the manuscript. While the project does not claim to have deciphered the Voynich Manuscript, it offers a novel computational approach to exploring its structure and potentially uncovering clues about its content. The code and data for the project are publicly available on GitHub for further exploration and scrutiny by other researchers.

Summary of Comments ( 28 )
https://news.ycombinator.com/item?id=44022353

HN commenters are generally skeptical of the analysis presented. Several point out the small sample size and the risk of overfitting when dealing with such limited data. One commenter notes that previous NLP analysis using Markov chains produced similar results, suggesting the observed "structure" might be an artifact of the method rather than a genuine feature of the manuscript. Another expresses concern that the approach doesn't account for potential cipher keys or transformations, making the comparison to known languages potentially meaningless. There's a general feeling that while interesting, the analysis doesn't provide strong evidence for or against any particular theory about the Voynich Manuscript's origins. A few commenters request more details about the methodology and specific findings to better assess the claims.

The Hacker News post "Show HN: I modeled the Voynich Manuscript with SBERT to test for structure" linking to a GitHub repository detailing the analysis, sparked a moderate discussion with a few intriguing comments. Several commenters engaged with the methodology and findings, while others offered alternative perspectives or pointed towards related research.

One commenter questioned the fundamental assumption of the analysis, suggesting that treating the Voynich Manuscript as a single unified document might be flawed. They proposed that it could potentially be a collection of disparate texts bound together, which would complicate any analysis attempting to find overall structure. This raises the possibility that searching for a singular structure might be a red herring, and that a more fruitful approach might involve segmenting the manuscript before applying analytical techniques.

Another commenter brought up the intriguing possibility of the Voynich Manuscript being an elaborate hoax. They pointed to the lack of clear corrections in the text, which is unusual for handwritten documents of that length. This absence of corrections could suggest that the manuscript was deliberately constructed, perhaps using some sort of generative system or cipher, rather than being a genuine record of language. This comment highlights the ever-present skepticism surrounding the Voynich Manuscript and the difficulty in definitively ruling out a sophisticated hoax.

A third comment referenced a previous attempt to analyze the manuscript using similar NLP techniques. This earlier analysis suggested the presence of meaningful linguistic structure, potentially indicating a real language. The commenter drew a comparison between these earlier findings and the current analysis, highlighting the ongoing debate and the challenges in reaching a consensus about the nature of the Voynich Manuscript. This highlights the ongoing nature of this research and how different analytical approaches can yield varying interpretations.

Finally, one commenter expressed appreciation for the author's clear and concise write-up of their methodology and findings. They specifically praised the effective use of visualization, making the analysis more accessible to a wider audience. This underscores the importance of clear communication in scientific research, especially when dealing with complex topics like the Voynich Manuscript.

While the number of comments isn't extensive, they represent a range of perspectives and offer valuable insights into the complexities of analyzing the Voynich Manuscript and the ongoing debate surrounding its origins and meaning. The discussion reflects both the challenges in deciphering this enigmatic document and the continued fascination it holds for researchers and enthusiasts alike.

Story Details

Show HN: I modeled the Voynich Manuscript with SBERT to test for structure

Summary of Comments ( 28 ) https://news.ycombinator.com/item?id=44022353

Summary of Comments ( 28 )
https://news.ycombinator.com/item?id=44022353