> One of the strengths of having most of the vectors contain 0 values is that it makes computation much, much easier, even if the vector size is so large. I can quite quickly dismiss any vectors that have 0 dimensions in common.
I agree, and depending on the data, it might totally be worth sticking with this for simplicity. One has to think if it is good or not to dismiss these pairs which have zero dimensions in common. Say if one url has "people, tech, jokes", the other has "ppl, technology, fun", maybe they should not be missed. On the other hand, maybe because of the tag suggestions, as you said earlier, the tags might be quite homogeneous.
Your intuition about PCA is very correct. The way it works is that it splits your matrix into a product of two matrices. Say you have a large matrix with n rows (urls) and 60000 columns (tags). Let's call this X.
PCA will give you X = A*S (approximately)
Matrix A will have n rows and say 20 columns, matrix S will have 20 rows and 60000 columns (notice that there are much less entries in total, this is a compression of the data)
Matrix A will tell you how the original urls's are represented in the new features, matrix S will tell you how the new features are represented in the old features (a linear mixture of the old features a.k.a the tags). Depending on how you do this factorization into two matrices, you can have different methods but PCA is optimal in certain important ways.
Lot of nice intros if you search for "pca tutorial" and there are ready implementations in most languages. The basic ideas are simple but then the rabbithole goes deeper and deeper - just like this thread, so I'll stop here :)
I agree, and depending on the data, it might totally be worth sticking with this for simplicity. One has to think if it is good or not to dismiss these pairs which have zero dimensions in common. Say if one url has "people, tech, jokes", the other has "ppl, technology, fun", maybe they should not be missed. On the other hand, maybe because of the tag suggestions, as you said earlier, the tags might be quite homogeneous.
Your intuition about PCA is very correct. The way it works is that it splits your matrix into a product of two matrices. Say you have a large matrix with n rows (urls) and 60000 columns (tags). Let's call this X. PCA will give you X = A*S (approximately) Matrix A will have n rows and say 20 columns, matrix S will have 20 rows and 60000 columns (notice that there are much less entries in total, this is a compression of the data)
Matrix A will tell you how the original urls's are represented in the new features, matrix S will tell you how the new features are represented in the old features (a linear mixture of the old features a.k.a the tags). Depending on how you do this factorization into two matrices, you can have different methods but PCA is optimal in certain important ways.
Lot of nice intros if you search for "pca tutorial" and there are ready implementations in most languages. The basic ideas are simple but then the rabbithole goes deeper and deeper - just like this thread, so I'll stop here :)