Hacker Newsnew | past | comments | ask | show | jobs | submit | ccgreg's commentslogin

I don't know of anyone who uses Common Crawl as pre-training data without filtering it. We have an annotation system that lets people pick and choose which subsets they'd like to use.


Common Crawl is a sample of the web, so it's not that directly helpful for someone wanting to make a product price dataset.


I'm a life-long hacker, and my crawler crawls with consent.


> and the data that I’ve experimented with from 2014 seemed high quality

That's because it's from the blekko search engine.


Sounds like blekko had a larger impact on the early urls than I thought. Out of curiosity, do you remember how large blekko’s index was at it’s peak?


The largest index we had was 4 billion, which is tiny. Our crawl frontier was much larger.


That's already been happening for more than a year now.


Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. Our latest crawl now exceeds 689 tebibbles. Common Crawl Foundation


The complete list hides in the web graph:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...

and the specific file that's every host we've seen in the latest 3 crawls is:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...


> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.

Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.


oh, i didn't see that the 1.97 billion pages were crawled in a 11 day period earlier this month. either way, nearly 2,000,000,000 pages fit in ~third of a petabyte...

p.s. thanks for correcting me, i was using this information for something else, and now it's correct!


Common Crawl has been running a low-resource language project for 1.5 years now -- it's a hard problem.


The guts on the inside changed several times during that timespan.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: