There is a wealth of information in the archive of US Department of Justice v Microsoft but it is locked away in PDF images. As part of my contribution to FLOSS I have begun to run the exhibits through the Tesseract Optical Character Recognition programme. Tesseract does a fine job but it is not well documented and is a bit fussy. By lots of trial-and-error I worked out a reasonable script to do most of the work. I used ImageMagick to convert the PDFs to TIFFs for Tesseract. It is slow (tesseract seems to use only one core for much of the work), Beast will take several days to complete this:
#!/bin/bash
for f in *.pdf;do
lines=`identify $f|wc|awk ' { print $1; }' -`
echo $f has $lines frames
h=${f%.pdf};
for ((gg=$lines+1;gg=gg-1;));do g=$(($lines-$gg));echo $g;
convert -density 900 $f[$g] -compress None -monochrome -depth 1 tifs/$h-$g.tif;
tesseract tifs/$h-$g.tif tifs/$h-$g
cat tifs/$h-*.txt >> tifs/$h.txt
rm tifs/*-*.txt
rm tifs/*.tif
done
done
That script could use improvement. Some of the documents are rotated 90 degrees. I will have to fix those up manually but I have to do that anyway for unrecognizable texts.
The PDFs came from http://www.justice.gov/atr/cases/ms_exhibits.htm and there is a description at http://www.justice.gov/atr/cases/exhibits/mslist.pdf. I pulled everything in using wget and kept local copies.
Once the OCRing is complete, search engines will be able to find the documents and index the contents. While the documents are old they give a chilling insight into the development and maintenance of the Wintel monopoly.
Here is a snippet from a PDF about Bill Gates wanting an opinion about ease of porting C++ code to Java…
The OCR version:
“This is all somewhat interesting because Microsoft cant ligure out howto run things like IE4, Trident and Oflice97 on platforms like Mac and Win 3.1 and yet people who do Java applications seem to make us look like fools – particularly with the upcoming Java native code compilers (which for some stupid reason is not an explicit part of our plan – we will be forced to do it).
Ironically our original application strategy was based on the portability of Pcode to many platforms – we ran Multlplan on the VAX. UNIX. Datapoint, TI 9900 and Commodore 64 among other platforms.”
The PDF looks like this on my screen:

Chuckle. While touting that other OS as the right way to do IT, M$ actually realized it was a pain for ISVs and developers. They used this pain as part of the lock-in. Now, more than a decade later, M$ is struggling against its own lock-in to produce “8″.
Bible (KJV) Revelation 13:10:
“He that leadeth into captivity shall go into captivity: he that killeth with the sword must be killed with the sword. Here is the patience and the faith of the saints.”
I wish the PDFs were all that easy. BTW, Google does have that one indexed. They can do OCR, too, so searching for site:justice.gov “ironically” “java” finds it. My copy, however, will not disappear when US DOJ v M$ falls off the radar at US DOJ…




9628
8997
102
2
0
13236
5986
5948
3836
1690
1553
199
1
0
0
0
0