Applying Textual Embeddings for Numerical Data Clustering | sam

HOME
학술논문
사회과학
- 인문학
- 사회과학
- 자연과학
- 공학
- 의약학
- 농수해양
- 예술체육
- 복합학
- 경제경영
- 법학
- 어문학
사회복지학

학술논문

Applying Textual Embeddings for Numerical Data Clustering

이용수 0

영문명
발행기관: 한국공공가치학회
저자명: Aaditya Yadav Min Seo Park Ikshita Yadav
간행물 정보: 『Journal of Public Value』Vol. 9, 85~98쪽, 전체 14쪽
주제분류: 사회과학 > 사회복지학
파일형태: PDF
발행일자: 2025.06.30

이용권 구매하기

이용가능 이용불가

sam무제한 이용권 으로 학술논문 이용이 가능합니다.
이 학술논문 정보는 (주)교보문고와 각 발행기관 사이에 저작물 이용 계약이 체결된 것으로, 교보문고를 통해 제공되고 있습니다. 1:1 문의

국문 초록

Purpose: This study investigates whether text-based embedding techniques—originally designed for natural language processing— can be effectively applied to numerical data. Method: We transform numerical datasets into space-separated strings and encode them using five embedding techniques: DistilBERT, TF-IDF, Doc2Vec, Multilingual-e5, and SFR-Mistral. To manage the resulting high-dimensional vectors, we reduce their dimensionality using both local and global configurations of UMAP. Clustering algorithms—including K-Means, Agglomerative, BIRCH, GMM, Genie, K-Medoids, K-Modes, LDA, MiniBatch K-Means, and Spectral Co-Clustering—are applied to these embeddings and compared against two baselines: clustering on raw numerical data and on UMAP-reduced numerical data. Performance is evaluated using Normalized Clustering Accuracy across a diverse set of benchmark datasets. Results: While text-based embeddings do not universally outperform traditional methods, several configurations—especially those using Multilingual-e5 and SFR-Mistral—demonstrate consistent improvements in clustering accuracy. In certain cases, embeddingbased approaches yield dramatic gains (over 500% increase in NCA compared to raw data). Algorithms such as K-Means, K- Medoids, and Spectral Co-Clustering benefit most from the transformed representations. Visual analyses on datasets like Graves Dense, Ring, and ZigZag show enhanced cluster separability and balanced densities after embedding. Conclusion: Textual embeddings can serve as a viable alternative preprocessing strategy for numerical clustering tasks, offering substantial improvements in specific contexts. These findings encourage further research into hybrid embedding techniques tailored for numerical data, potentially involving training specialized models or integrating with tabular-focused architectures to capitalize fully on the observed benefits.

영문 초록

키워드

Text embeddings UMAP Numerical Data Clustering DistilBERT SFR-Mistral Multilingual-e5 Doc2Vec TF-IDF

국문 초록

영문 초록

목차

키워드

해당간행물 수록 논문

참고문헌

최근 이용한 논문

APA

MLA