Product Classification and Clustering

Donated on 8/6/2023

This dataset was collected from PriceRunner, a popular product comparison platform. It includes 35311 product offers from 10 categories, provided by 306 different merchants. This dataset offers an ideal ground for evaluating classification, clustering, and entity matching algorithms. Although it contains product-related data, it can still be applied to any problem involving text/short-text mining.

Dataset Characteristics

Tabular, Text

Subject Area

Business

Associated Tasks

Classification, Clustering, Other

Feature Type

Categorical, Integer

# Instances

35311

# Features

7

Dataset Information

For what purpose was the dataset created?

Product classification, clustering and entity matching. Short-text clustering algorithms.

Who funded the creation of the dataset?

No funding

What do the instances in this dataset represent?

product offers by various merchants

Are there recommended data splits?

no

Does the dataset contain data that might be considered sensitive in any way?

no

Was there any data preprocessing performed?

Case folding and punctuation removal were applied to the titles of column 2.

Has Missing Values?

No

Introductory Paper

A self-verifying clustering approach to unsupervised matching of product titles

By Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis, C. Makris. 2020

Published in Artificial Intelligence Review

Variables Table

Variable NameRoleTypeDescriptionUnitsMissing Values
Product IDFeatureIntegerno
Product TitleFeatureCategoricalno
Merchant IDFeatureIntegerno
Cluster IDFeatureIntegerno
Cluster LabelFeatureCategoricalno
Category IDFeatureIntegerno
Category LabelFeatureCategoricalno

0 to 7 of 7

Dataset Files

FileSize
pricerunner_aggregate.csv3.7 MB

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (602.1 KB)
1 citations
21550 views

Creators

Leonidas Akritidis

lakritidis@ihu.gr

International Hellenic University

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy