Related Posts
Boston Technology Corporation is Hiring for below roles (Health Tech Company
1- Senior Java Developer (2-10 years Experience) 2- Senior Business Analyst(5 -10 years in US Health care is must)
3- Account Manager
4- Project Manager (Exp 9+ years)
5- Technical Architect (Exp 18+ years)
6- GCP Engineer (Exp 2-5 years)
Interested candidates send your resume at daieemkhanm@boston-technology.com #hiringdevelopers #javadeveloperjobs #gcpengineer #businessanalyst #javadevelopers #projectmanager #sof
Additional Posts in Data & Analytics Consultants
Has anyone else begun to resent data science?
New to Fishbowl?
unlock all discussions on Fishbowl.



It depends: How many documents in the corpus? Is the difference in size uniform (I.e. only two sizes among all documents?)
Does order of text matter? Or will bag of words analysis suffice? If the latter, I’d assume TF-IDF should work just fine, paring down the words to highlight most unique terms, allowing comparison on the most unique words… but I’ve experienced BERT classifier having superior results, however it is not interpretable (aka it is good, but you won’t know why it is good).
How about TF-IDF and compare that side by side with BERT?
I've had 20 documents compared with 1000 documents and they do vary in size. I've had 5 documents compared with 125k documents and they also vary in size. The bag of words suffice so the order of text does not matter. I must be doing something wrong or not understanding something though... I thought I did do TDIDF and tried cosine similarity but could not get the matrix to be M x M? I tried doing a dot product of the vectors? Eli5 please lol.
Hard to say without actually seeing the code… and at that point, stack overflow would be more apppropriate.
I found the below! Hope it helps
https://stackoverflow.com/questions/44862712/td-idf-find-cosine-similarity-between-new-document-and-dataset