使用PHP或Javascript以编程方式比较两个word或excel或powerpoint文档

使用PHP或Javascript以编程方式比较两个word或excel或powerpoint文档

问题描述:

Following are some requirements for my new project.

Admin will upload a file which will be in format of Ms Word 2007 or Ms Excel 2007 or Ms Power Point 2007.

Lets say that admin has uploaded a file named demo1.docx file.

Now demo1.docx is a master file.

Now other users will upload their own files like demo2.docx, demo3.docx etc.

I want to compare demo2.docx and demo3.docx files with master file demo1.docx.

Files uploaded by other users must be copy of the master file. I mean number of characters, text, formatting have to be same as the master file.

If it is excel file, then number of sheets, no. of cells filled have to be same and same thing apply to power point files.

I want to do this using PHP or Javascript.

So can u please tell me if it is possible or not? and if it is possible then suggest me some ways to accomplish this task.

Thanks in advance.

以下是我的新项目的一些要求。 p>

管理员将上传一个 文件格式为Ms Word 2007或Ms Excel 2007或Ms Power Point 2007格式。 p>

假设管理员已上传名为demo1.docx文件的文件。 p> \ n

现在demo1.docx是一个主文件。 p>

现在其他用户将上传他们自己的文件,如demo2.docx,demo3.docx等。 p>

我想比较demo2.docx和demo3。 主文件demo1.docx的docx文件。 p>

其他用户上传的文件必须是主文件的副本。 我的意思是字符数,文本,格式必须与主文件相同。 p>

如果是excel文件,那么页数,没有。 填充的单元格必须相同,并且相同的东西适用于power point文件。 p>

我想使用PHP或Javascript执行此操作。 p>

所以可以 请告诉我是否有可能? 如果有可能,请告诉我一些完成此任务的方法。 p>

提前致谢。 p> div>

To match them byte for byte the most efficient way is

if(hash_file('sha1', $pathToFile1) == hash_file('sha1', $pathToFile2))

if that's too exact, you could strip whitespace. From text files, not binary files like docx or xlsx files.

if(hash('sha1', str_replace(' ', '', file_get_contents( $pathToFile1))) == hash('sha1', str_replace(' ', '', file_get_contents( $pathToFile2))))

Or something like that to normalize the text. For binary file types you will have to use some library for that type of file to convert them first to text.

In other words you will have to come up with some way to normalize the text contents of the file, such as upper casing everything and removing spaces or other acceptable differences.

Normalizing is a fancy way of saying, removing the differences. A simple example is this.

Some text

Now is that the same as Some text.? Or Some Text or some Text that depends. But "normalizing them" may look like this sometext with no punctuation, spaces or casing. It's up to you to decide how you normalize them.

Because of the mention of the binary formats I can't help you there as you will need to find a way to open them in PHP, which will require some 3rd party libraries.

Your question is very Broad, so I can only give you a Broad overview of how to do it.

Hashing is nice because it takes a file of {x} size and makes it 40 characters long (in the case of sha1) which is a lot easier to store in a DB, or visualize. I mention the DB because you can cut the operation in half by pre-normalizing and hashing the Known file (the source file). This will reduce the overall cost of comparing them.

UPDATE

Here is an example

echo hash('sha1', 'The same text') == hash('sha1', 'the same text') ? 'true' : 'false';

The output will be false However if you do this:

echo hash('sha1', strtolower('The same text')) == hash('sha1', strtolower('the same text')) ? 'true' : 'false';

The output will be true

Sandbox

A small amount of text is no different then a large amount. The difference between the two pieces of code above, is I normalized one and not the other.

UPDATE1

ok. do u know the softwares like Typing Tutor.. which takes typing test. There is one fixed paragraph and user will write that paragraph in text box with same formatting.

$old = 'The same text';
$arr_old = explode(' ', $old);
$new = 'the same text';

$pattern = '/\b('.implode(')\b|\b(', array_map('preg_quote', $arr_old)).')\b/';

preg_match_all($pattern, $new, $matches );

print_r($matches);

Output

  Array
(
    [0] => Array
        (
            [0] => same
            [1] => text
        )

    [1] => Array
        (
            [0] => 
            [1] => 
        )

    [2] => Array
        (
            [0] => same
            [1] => 
        )

    [3] => Array
        (
            [0] => 
            [1] => text
        )

) 

It's important to mention that the index of the match(-1), will match the index of the word. For example in the above there is no match in $matches[1] there is no match. This corresponds to The which is the first item in $arr_old = explode(' ', $old); or [0=>'The', 1=>'same', 2=>'text'] But because the match is 1 based and the array is 0 based you have to subtract 1.

PS to check these I would do something like

$len = count($matches);
for($i=1;$i<$len;$i++){
    if(!empty(array_filter($matches[$i]))) echo "match ".$arr_old[$i-1]."
";
}

Output:

match same
match text

Sandbox

I hope that helps.