The blog post explores visualizing the "ISBN space" by treating ISBN-13s as coordinates in 13-dimensional space and projecting them down to 2D using dimensionality reduction techniques like t-SNE and UMAP. The author uses a dataset of over 20 million book records from Open Library, coloring the resulting visualizations by publication year or language. The resulting scatter plots reveal interesting clusters, suggesting that ISBNs, despite being assigned sequentially, exhibit some grouping based on book characteristics. The visualizations also highlight the limitations of these dimensionality reduction methods, as some seemingly close points in the 2D projection are actually quite distant in the original 13-dimensional space.
This blog post, titled "Visualizing all books of the world in ISBN-Space," by Phiresky, explores a fascinating, albeit ultimately flawed, approach to visualizing the relationships between all published books using their International Standard Book Numbers (ISBNs) as coordinates in a multi-dimensional space. The author's core concept involves treating the digits of an ISBN – specifically the 10-digit ISBNs prevalent before 2007 – as dimensions in a 10-dimensional space. Each book, therefore, occupies a unique point within this hypothetical space, defined by its ISBN.
Phiresky begins by acknowledging the inherent abstractness of a 10-dimensional space, which is impossible for humans to directly visualize. To overcome this, the author employs dimensionality reduction techniques. Specifically, they utilize Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE), both commonly used methods for reducing high-dimensional data to a more manageable number of dimensions, typically two or three, while attempting to preserve important relationships between data points.
The author's process involves retrieving a dataset of ISBNs, converting each ISBN's digits into numerical representations, and then applying PCA and t-SNE to these numerical vectors. The resulting two or three-dimensional coordinates are then plotted, creating a visual representation of "ISBN-space." Different visualization attempts are presented, including a static 2D scatter plot colored by publication year and an interactive 3D visualization.
Phiresky discusses the interpretation of these visualizations, pointing out clusters and patterns that seem to emerge. For example, books published in similar years appear to cluster together, suggesting that parts of the ISBN structure might relate to publication date. The author also notes the influence of the check digit, the final digit of a 10-digit ISBN, which is mathematically derived from the preceding digits to detect errors. This check digit creates dependencies within the ISBN structure, which consequently influences the arrangement of points in the visualized space.
However, the author crucially acknowledges the significant limitations of this approach. The primary issue stems from the nature of ISBNs themselves. While designed for unique identification, ISBNs are not inherently semantically meaningful. The assignment of ISBNs reflects factors such as publisher and publication order rather than the content or subject matter of the books. Therefore, the proximity of two books in "ISBN-space" does not necessarily indicate any genuine relationship between them beyond potentially sharing a publisher or being published around the same time. The observed patterns and clusters are likely artifacts of the ISBN allocation system and not indicative of deeper connections between the books.
Ultimately, the author concludes that while visually interesting, visualizing books in ISBN-space doesn't offer meaningful insights into the literary world. The imposed structure of ISBNs drives the visualizations rather than inherent relationships between books. The project serves as an exploration of data visualization techniques applied to an unusual dataset, highlighting both the potential and the pitfalls of interpreting patterns in high-dimensional data.
Summary of Comments ( 35 )
https://news.ycombinator.com/item?id=42897120
Commenters on Hacker News largely praised the visualization and the author's approach to exploring the ISBN dataset. Several pointed out interesting patterns revealed by the visualization, such as the clustering of books by language and subject matter. Some discussed the limitations of using ISBNs for this kind of analysis, noting that not all books have ISBNs (especially older ones) and the system itself has undergone changes over time. Others offered suggestions for improvements or further exploration, such as incorporating data about book sales or using different dimensionality reduction techniques. A few commenters shared related projects or resources, including visualizations of other datasets and tools for working with ISBNs. The overall sentiment was one of appreciation for the project and its insightful presentation of complex data.
The Hacker News post "Visualizing all books of the world in ISBN-Space" generated a fair amount of discussion, with several commenters intrigued by the visualization and the underlying data.
One of the most compelling threads revolved around the "holes" or gaps in the ISBN space visualized. Commenters discussed the reasons for these gaps, speculating about blocks of ISBNs being allocated but not used, books published without ISBNs, or simply limitations in the data source used for the visualization. This led to further discussion about the efficiency of ISBN allocation and the potential for wasted ISBN ranges. Some users with experience in publishing shared insights into how ISBNs are assigned and managed, offering a more practical perspective on the observed gaps.
Another interesting thread explored the limitations of using ISBNs for such a visualization. Some commenters pointed out that ISBNs don't perfectly represent all published books, as some books, especially older ones, might not have ISBNs. This led to a discussion about alternative ways to visualize the "world of books," such as using Library of Congress Control Numbers (LCCNs) or other bibliographic identifiers. The challenges and benefits of each approach were discussed.
Several commenters also expressed interest in the technical aspects of the visualization itself, inquiring about the tools and techniques used to create it. The original poster (OP) provided some details about the data processing and visualization methods, sparking a brief exchange about data visualization best practices and libraries.
Beyond these main threads, there were several individual comments offering observations and insights. Some commenters noted the interesting patterns visible in the visualization, such as the clustering of ISBNs. Others shared anecdotes about their own experiences with ISBNs and the publishing industry. A few commenters also questioned the practical value of the visualization, while others defended its artistic and exploratory merits. Overall, the comments section provided a rich and varied perspective on the visualization, touching upon technical, practical, and philosophical aspects of the project.