2013년 8월 8일 목요일

intersection speed test

I want to get intersection of files of different two folders.(A∩B)
environment is as follows

$ ls fd1 |wc -l
7254

$ ls fd2 | wc -l
6797

$ uname -a

Linux cent.desktop 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/cpuinfo 
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping : 7
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 6784.92
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual

power management:
(total 8 core)

$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ go version
go version go1.1rc3 linux/amd64

$ gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


$ R --version
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.


$ python --version
Python 2.6.6

$ java -version
java version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

$ ruby --version

ruby 1.9.3p385 (2013-02-06) [x86_64-linux]


No file name is duplicated. So searching time should be the worst case.(There is no need to break as soon as we find the target file.)



comm command


$ time comm -12 <(ls fd1) <(ls fd2)

real 0m0.023s
user 0m0.035s
sys 0m0.003s


easiest way, and very fast.


bash

$ cat diff.sh
#!/bin/bash
FD1=`ls fd1`
FD2=`ls fd2`
for i in $FD1
    do for j in $FD2
        do if [[ $i == $j ]]
            then echo $j
        fi
    done
done


$ time ./diff.sh

real 7m16.008s
user 7m13.976s
sys 0m0.431s

tremendously slow.


go

$ cat diff.go
package main

import "io/ioutil"
import "fmt"

func main() {
    files1, _ := ioutil.ReadDir("fd1")
    files2, _ := ioutil.ReadDir("fd2")
    for _, fi1 := range files1 {
        for _, fi2 := range files2 {
            if fi1.Name() == fi2.Name() {
                fmt.Println(fi1.Name())
            }
        }
    }
}

$ go build diff.go
$ time ./diff

real 0m0.808s
user 0m0.762s

sys 0m0.043s


C++

$ cat diff.cpp
#include <iostream>
#include <boost/filesystem.hpp>

// I will use boost library
using namespace boost::filesystem;

int main() {
    path fd1("fd1");
    path fd2("fd2");
    for(directory_iterator dit1(fd1); dit1 != directory_iterator(); dit1++) {
        for(directory_iterator dit2(fd2); dit2 != directory_iterator(); dit2++) {
            if(dit1->path().filename().compare( dit2->path().filename() ) == 0)
            {
                std::cout << dit1->path().filename() << std::endl;
            }
        }
    }
    return 0;
}

$ time ./a.out 

real 1m12.773s
user 1m0.730s
sys 0m11.786s


Looks weird. 
slower then go?
maybe because of use of iterators.
(std::set_intersection uses iterators too. So I think using STL would not make big difference. Moreover, it needs sorting. Same is true for std::set_difference.)


C

$ cat diff.c
#include <dirent.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

#define MAX_SIZE 10000

#define MAX_CHAR 100

int main() {

    char *flist1[MAX_SIZE], *flist2[MAX_SIZE];
    int i, j;
    for(i=0; i<MAX_SIZE; i++) {
        flist1[i] = (char*)malloc(sizeof(char)*MAX_CHAR);
        flist2[i] = (char*)malloc(sizeof(char)*MAX_CHAR);
    }

    DIR *dir = opendir("fd1");

    DIR *dir2 = opendir("fd2");
    struct dirent *ent;
    int list1size = 0, list2size = 0;
    for(; (ent = readdir(dir)) != 0; list1size++) {
        strncpy(flist1[list1size], ent->d_name, MAX_CHAR);
    }
    closedir(dir);
    for(; (ent = readdir(dir2)) != 0; list2size++) {
        strncpy(flist2[list2size], ent->d_name, MAX_CHAR);
    }
    closedir(dir2);

    for(i=0; i<list1size; i++) {

        for(j=0; j<list2size; j++) {
            if(strcmp(flist1[i], flist2[j]) == 0) {
                printf("%s\n", flist1[i]);
            }
        }
    }
    return 0;
}

$ time ./diffc.out 
.
..

real 0m0.272s
user 0m0.266s
sys 0m0.004s


very long and tedious code.
but fast.


python

$ cat diff.py
import os
print set(os.listdir('fd1')) & set(os.listdir('fd2'))

$ time python diff.py
set([])

real 0m0.015s
user 0m0.006s
sys 0m0.008s


much faster then C !
(why?)


Java


$ cat Diff.java 
import java.io.File;

public class Diff {
    public static void main(String[] args) {
        String[] fns1 = new File("fd1").list();
        String[] fns2 = new File("fd2").list();
        for(String a: fns1)
            for(String b: fns2)
                if(a.equals(b))
                    System.out.println(a);
    }
}


$ time java Diff

real 0m0.357s
user 0m0.400s
sys 0m0.013s

Interestingly, user time is bigger than real time. It may means JVM optimised this code to use multicore functionality.



$ cat Diff2.java 
import java.io.File;
import java.util.Arrays;
import java.util.Set;
import java.util.HashSet;

public class Diff2 {
    public static void main(String[] args) {
        Set<String> s1 = new HashSet<String>(Arrays.asList(new File("fd1").list()));
        Set<String> s2 = new HashSet<String>(Arrays.asList(new File("fd2").list()));
        s1.retainAll(s2);
        for(String a: s1)
            System.out.println(a);
    }
}


$ time java Diff2

real 0m0.141s
user 0m0.180s
sys 0m0.020s


very fast.(considering it's Java)


R

(at R console)
> diff <- function() {
+ fd1 = list.files('fd1')
+ fd2 = list.files('fd2')
+ fd1[fd1 %in% fd2]

+ }
> a = proc.time(); diff(); proc.time()-a
character(0)
user  sys  real  

  0.050   0.002   0.053


> a = proc.time(); intersect(list.files('fd1'), list.files('fd2')); proc.time()-a
character(0)
user  sys  real  

  0.044   0.003   0.048


not bad.

ruby

$ cat diff.rb
puts Dir.entries('fd1') & Dir.entries('fd2')

$ time ruby diff.rb
.
..

real 0m0.021s
user 0m0.014s
sys 0m0.006s


I knew ruby is slow. It looks like somehow basic set operation is well written.
And the most simple code ever. (except comm command)



coarse speed test result

languagetime(sec, min(real, user))
python0.006
ruby0.014
comm(command)0.023
R0.044
Java0.141
C0.266
go0.762
C++(using boost)60.730
bash433.976


Then, how about 'difference of sets'(A-B)

Some scripts(or languages) will be a little more complicated.(and a bit slower. I guess)
I have tested it using and ruby.
(R: fd1[!(fd1 %in% fd2)]
ruby: Dir.entries('fd1') - Dir.entries('fd2') )
Results was almost same with the cases of getting intersection.



Conclusion

It seems like languages which have their own built-in function about set operations are more advantageous for this speed test. Top five ​​all had a built-in function in any way.

댓글 없음:

댓글 쓰기