environment is as follows
$ ls fd1 |wc -l
7254
$ ls fd2 | wc -l
6797
$ uname -a
Linux cent.desktop 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 42
model name : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
stepping : 7
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips : 6784.92
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
(total 8 core)
$ bash --version
GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ go version
go version go1.1rc3 linux/amd64
$ gcc --version
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ R --version
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
Copyright (C) 2012 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.
$ python --version
Python 2.6.6
$ java -version
java version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
$ ruby --version
ruby 1.9.3p385 (2013-02-06) [x86_64-linux]
No file name is duplicated. So searching time should be the worst case.(There is no need to break as soon as we find the target file.)
comm command
$ time comm -12 <(ls fd1) <(ls fd2)
real 0m0.023s
user 0m0.035s
sys 0m0.003s
easiest way, and very fast.
bash
$ cat diff.sh
#!/bin/bash
FD1=`ls fd1`
FD2=`ls fd2`
for i in $FD1
do for j in $FD2
do if [[ $i == $j ]]
then echo $j
fi
done
done
$ time ./diff.sh
real 7m16.008s
user 7m13.976s
sys 0m0.431s
FD1=`ls fd1`
FD2=`ls fd2`
for i in $FD1
do for j in $FD2
do if [[ $i == $j ]]
then echo $j
fi
done
done
$ time ./diff.sh
real 7m16.008s
user 7m13.976s
sys 0m0.431s
tremendously slow.
go
$ cat diff.go
package main
import "io/ioutil"
import "fmt"
func main() {
files1, _ := ioutil.ReadDir("fd1")
files2, _ := ioutil.ReadDir("fd2")
for _, fi1 := range files1 {
for _, fi2 := range files2 {
if fi1.Name() == fi2.Name() {
fmt.Println(fi1.Name())
}
}
}
}
$ go build diff.go
$ time ./diff
real 0m0.808s
user 0m0.762s
sys 0m0.043s
$ go build diff.go
$ time ./diff
real 0m0.808s
user 0m0.762s
sys 0m0.043s
C++
$ cat diff.cpp
#include <iostream>
#include <boost/filesystem.hpp>
// I will use boost library
using namespace boost::filesystem;
int main() {
path fd1("fd1");
path fd2("fd2");
for(directory_iterator dit1(fd1); dit1 != directory_iterator(); dit1++) {
for(directory_iterator dit2(fd2); dit2 != directory_iterator(); dit2++) {
if(dit1->path().filename().compare( dit2->path().filename() ) == 0)
{
std::cout << dit1->path().filename() << std::endl;
}
}
}
return 0;
}
$ time ./a.out
real 1m12.773s
user 1m0.730s
sys 0m11.786s
user 1m0.730s
sys 0m11.786s
Looks weird.
slower then go?
maybe because of use of iterators.
maybe because of use of iterators.
(std::set_intersection uses iterators too. So I think using STL would not make big difference. Moreover, it needs sorting. Same is true for std::set_difference.)
C
$ cat diff.c#include <dirent.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#define MAX_SIZE 10000
#define MAX_CHAR 100
int main() {
char *flist1[MAX_SIZE], *flist2[MAX_SIZE];
int i, j;
for(i=0; i<MAX_SIZE; i++) {
flist1[i] = (char*)malloc(sizeof(char)*MAX_CHAR);
flist2[i] = (char*)malloc(sizeof(char)*MAX_CHAR);
}
DIR *dir = opendir("fd1");
DIR *dir2 = opendir("fd2");
struct dirent *ent;
int list1size = 0, list2size = 0;
for(; (ent = readdir(dir)) != 0; list1size++) {
strncpy(flist1[list1size], ent->d_name, MAX_CHAR);
}
closedir(dir);
for(; (ent = readdir(dir2)) != 0; list2size++) {
strncpy(flist2[list2size], ent->d_name, MAX_CHAR);
}
closedir(dir2);
for(i=0; i<list1size; i++) {
for(j=0; j<list2size; j++) {
if(strcmp(flist1[i], flist2[j]) == 0) {
printf("%s\n", flist1[i]);
}
}
}
return 0;
}
$ time ./diffc.out
.
..
real 0m0.272s
user 0m0.266s
sys 0m0.004s
very long and tedious code.
but fast.
python
$ cat diff.pyimport os
print set(os.listdir('fd1')) & set(os.listdir('fd2'))
$ time python diff.py
set([])
real 0m0.015s
user 0m0.006s
sys 0m0.008s
much faster then C !
(why?)
Java
$ cat Diff.java
import java.io.File;
public class Diff {
public static void main(String[] args) {
String[] fns1 = new File("fd1").list();
String[] fns2 = new File("fd2").list();
for(String a: fns1)
for(String b: fns2)
if(a.equals(b))
System.out.println(a);
}
}
$ time java Diff
real 0m0.357s
user 0m0.400s
sys 0m0.013s
Interestingly, user time is bigger than real time. It may means JVM optimised this code to use multicore functionality.
$ cat Diff2.java
import java.io.File;
import java.util.Arrays;
import java.util.Set;
import java.util.HashSet;
public class Diff2 {
public static void main(String[] args) {
Set<String> s1 = new HashSet<String>(Arrays.asList(new File("fd1").list()));
Set<String> s2 = new HashSet<String>(Arrays.asList(new File("fd2").list()));
s1.retainAll(s2);
for(String a: s1)
System.out.println(a);
}
}
$ time java Diff2
real 0m0.141s
user 0m0.180s
sys 0m0.020s
very fast.(considering it's Java)
R
(at R console)> diff <- function() {
+ fd1 = list.files('fd1')
+ fd2 = list.files('fd2')
+ fd1[fd1 %in% fd2]
+ }
> a = proc.time(); diff(); proc.time()-a
character(0)
user sys real
0.050 0.002 0.053
> a = proc.time(); intersect(list.files('fd1'), list.files('fd2')); proc.time()-a
character(0)
user sys real
0.044 0.003 0.048
not bad.
ruby
$ cat diff.rbputs Dir.entries('fd1') & Dir.entries('fd2')
$ time ruby diff.rb
.
..
real 0m0.021s
user 0m0.014s
sys 0m0.006s
I knew ruby is slow. It looks like somehow basic set operation is well written.
And the most simple code ever. (except comm command)
coarse speed test result
language | time(sec, min(real, user)) |
python | 0.006 |
ruby | 0.014 |
comm(command) | 0.023 |
R | 0.044 |
Java | 0.141 |
C | 0.266 |
go | 0.762 |
C++(using boost) | 60.730 |
bash | 433.976 |
Then, how about 'difference of sets'(A-B)
Some scripts(or languages) will be a little more complicated.(and a bit slower. I guess)I have tested it using R and ruby.
(R: fd1[!(fd1 %in% fd2)]
ruby: Dir.entries('fd1') - Dir.entries('fd2') )
Results was almost same with the cases of getting intersection.
댓글 없음:
댓글 쓰기