More

ccgreg · 2026-04-26T22:33:45 1777242825

I don't know of anyone who uses Common Crawl as pre-training data without filtering it. We have an annotation system that lets people pick and choose which subsets they'd like to use.

ccgreg · 2026-04-26T06:18:33 1777184313

Common Crawl is a sample of the web, so it's not that directly helpful for someone wanting to make a product price dataset.

ccgreg · 2026-04-26T06:14:14 1777184054

I'm a life-long hacker, and my crawler crawls with consent.

ccgreg · 2026-04-16T01:22:29 1776302549

> and the data that I’ve experimented with from 2014 seemed high quality

That's because it's from the blekko search engine.

n1xis10t · 2026-04-16T15:27:41 1776353261

Sounds like blekko had a larger impact on the early urls than I thought. Out of curiosity, do you remember how large blekko’s index was at it’s peak?

ccgreg · 2026-04-16T15:57:32 1776355052

The largest index we had was 4 billion, which is tiny. Our crawl frontier was much larger.

ccgreg · 2026-04-10T09:56:44 1775815004

That's already been happening for more than a year now.

ccgreg · 2026-04-01T07:41:49 1775029309

Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. Our latest crawl now exceeds 689 tebibbles. Common Crawl Foundation

ccgreg · 2026-03-27T18:32:08 1774636328

The complete list hides in the web graph:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...

and the specific file that's every host we've seen in the latest 3 crawls is:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main...

ccgreg · 2026-03-27T02:00:12 1774576812

> Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.

Common Crawl is 300 billion webpages and 10 petabytes. I suppose your number is 1 of our 122 crawls.

genewitch · 2026-03-27T03:16:18 1774581378

oh, i didn't see that the 1.97 billion pages were crawled in a 11 day period earlier this month. either way, nearly 2,000,000,000 pages fit in ~third of a petabyte...

p.s. thanks for correcting me, i was using this information for something else, and now it's correct!

ccgreg · 2026-03-21T20:21:13 1774124473

Common Crawl has been running a low-resource language project for 1.5 years now -- it's a hard problem.

ccgreg · 2026-02-24T23:23:47 1771975427

The guts on the inside changed several times during that timespan.