java - Plagiarism Analyzer (compared against Web Content)

Question

Hi everyone all over the world,

Background

I am a final year student of Computer Science. I've proposed my Final Double Module Project which is a Plagiarism Analyzer, using Java and MySQL.

The Plagiarism Analyzer will:

Scan all the paragraphs of uploaded document. Analyze percentage of each paragraph copied from which website.
Highlight only the words copied exactly from which website in each paragraph.

My main objective is to develop something like Turnitin, improved if possible.

I have less than 6 months to develop the program. I have scoped the following:

Web Crawler Implementation. Probably will be utilizing Lucene API or developing my own Crawler (which one is better in terms of time development and also usability?).
Hashing and Indexing. To improve on the searching and analyzing.

Questions

Here are my questions:

Can MySQL store that much information?
Did I miss any important topics?
What are your opinions concerning this project?
Any suggestions or techniques for performing the similarity analysis?
Can a paragraph be hashed, as well as words?

Thanks in advance for any help and advice. ^^

score 4 · Accepted Answer

Have you considered another project that isn't doomed to failure on account of lack of resources available to you?

If you really want to go the "Hey, let's crawl the whole web!" route, you're going to need to break out things like HBase and Hadoop and lots of machines. MySQL will be grossly insufficient. TurnItIn claims to have crawled and indexed 12 billion pages. Google's index is more like [redacted]. MySQL, or for that matter, any RDBMS, cannot scale to that level.

The only realistic way you're going to be able to pull this off is if you do something astonishingly clever and figure out how to construct queries to Google that will reveal plagiarism of documents that are already present in Google's index. I'd recommend using a message queue and access the search API synchronously. The message queue will also allow you to throttle your queries down to a reasonable rate. Avoid stop words, but you're still looking for near-exact matches, so queries should be like: "* quick brown fox jumped over * lazy dog" Don't bother running queries that end up like: "* * went * * *" And ignore results that come back with 94,000,000 hits. Those won't be plagiarism, they'll be famous quotes or overly general queries. You're looking for either under 10 hits or a few thousand hits that all have an exact match on your original sentence or some similar metric. And even then, this should just be a heuristic — don't flag a document unless there are lots of red flags. Conversely, if everything comes back as zero hits, they're being unusually original. Book search typically needs more precise queries. Sufficiently suspicious stuff should trigger HTTP requests for the original pages, and final decisions should always be the purview of a human being. If a document cites its sources, that's not plagiarism, and you'll want to detect that. False positives are inevitable, and will likely be common, if not constant.

Be aware that the TOS prohibit permanently storing any portion of the Google index.

Regardless, you have chosen to do something exceedingly hard, no matter how you build it, and likely very expensive and time-consuming unless you involve Google.

score 1 · Accepted Answer

1) 独自の Web クローラーを作成しますか? このタスクのためだけに、利用可能なすべての時間を簡単に使用できるようです。そのための標準的な解決策を使用してみてください。それはプログラムの中心ではありません。

自分で作成したり、後で別のものを試したりする機会はまだあります (時間があれば!)。特定のクローラー/API に縛られないように、プログラムはローカルファイルでのみ動作する必要があります。

サイトごとに異なるクローラーを使用する必要があるかもしれません

2) 段落全体のハッシュ化が可能です。任意の文字列をハッシュできます。しかしもちろん、それは正確にコピーされたパラグラフ全体のみをチェックできることを意味します。たぶん、文はテストするのに適した単位になるでしょう。大文字/小文字などの小さな違いを整理するために、ハッシュする前に文/パラグラフを「正規化」(変換) する必要があります。

3) MySQL は大量のデータを保存できます。

通常のアドバイスは、標準 SQL に固執することです。データが多すぎることに気付いた場合でも、別の SQL 実装を使用する可能性があります。

もちろん、データが多すぎる場合は、それを削減する方法、または少なくとも mySQL の内容を削減する方法を検討することから始めてください。たとえば、ハッシュは MySQL に保存できますが、元のページ (必要な場合) はプレーンファイルに保存できます。

score 0 · Accepted Answer

オンラインコードは通常、オープンソースライセンスの下で配布されます。そして、ほとんどのコードは単なるチュートリアルです。あなたの論理によれば、どんなウェブサイトからでも何かをコピーすることは盗作です。つまり、ここで得られた回答を受け入れて使用することはできません。プロジェクトを本当に終了したい場合は、同じクラスと前のクラスの学生のコードを比較するシステムを作成するだけです。それははるかに効率的です。そのようなシステムの例はMOSSです（それがどのように機能するかについて話している論文もあります）。これは、Webクローラーがなくても非常に効率的です。

java - Plagiarism Analyzer (compared against Web Content)

3 に答える 3

Related

Reference