Several on sentence classification use the (two versions: 5 classes vs 2 classes). I’m looking for this and cannot find it.

The ULMFit paper says the 5-class dataset has 650K samples, while the binary one has 560K samples. They refer to the paper on char-level convnets from NIPS 2015. The latter paper says that they took 1 569 264 samples from the Yelp Dataset Challenge 2015 and constructed two classification tasks, but the paper does not describe the details.

The current version of the Yelp dataset has ~6M reviews. The version on Kaggle has 5.2M samples.

Does anyone know how to obtain the version used in the papers?

Source link
thanks you RSS link


Please enter your comment!
Please enter your name here