Robert Pogson

One man, closing all the windows.

Turning Tesseract Loose on US DOJ v M$

  • Feb 21 / 2012
  • 15
technology

Turning Tesseract Loose on US DOJ v M$

There is a wealth of information in the archive of US Department of Justice v Microsoft but it is locked away in PDF images. As part of my contribution to FLOSS I have begun to run the exhibits through the Tesseract Optical Character Recognition programme. Tesseract does a fine job but it is not well documented and is a bit fussy. By lots of trial-and-error I worked out a reasonable script to do most of the work. I used ImageMagick to convert the PDFs to TIFFs for Tesseract. It is slow (tesseract seems to use only one core for much of the work), Beast will take several days to complete this:
#!/bin/bash
for f in *.pdf;do
lines=`identify $f|wc|awk ' { print $1; }' -`
echo $f has $lines frames
h=${f%.pdf};
for ((gg=$lines+1;gg=gg-1;));do g=$(($lines-$gg));echo $g;
convert -density 900 $f[$g] -compress None -monochrome -depth 1 tifs/$h-$g.tif;
tesseract tifs/$h-$g.tif tifs/$h-$g
cat tifs/$h-*.txt >> tifs/$h.txt
rm tifs/*-*.txt
rm tifs/*.tif
done
done

That script could use improvement. Some of the documents are rotated 90 degrees. I will have to fix those up manually but I have to do that anyway for unrecognizable texts.

The PDFs came from http://www.justice.gov/atr/cases/ms_exhibits.htm and there is a description at http://www.justice.gov/atr/cases/exhibits/mslist.pdf. I pulled everything in using wget and kept local copies.

Once the OCRing is complete, search engines will be able to find the documents and index the contents. While the documents are old they give a chilling insight into the development and maintenance of the Wintel monopoly.

Here is a snippet from a PDF about Bill Gates wanting an opinion about ease of porting C++ code to Java…

The OCR version:
“This is all somewhat interesting because Microsoft cant ligure out howto run things like IE4, Trident and Oflice97 on platforms like Mac and Win 3.1 and yet people who do Java applications seem to make us look like fools – particularly with the upcoming Java native code compilers (which for some stupid reason is not an explicit part of our plan – we will be forced to do it).

Ironically our original application strategy was based on the portability of Pcode to many platforms – we ran Multlplan on the VAX. UNIX. Datapoint, TI 9900 and Commodore 64 among other platforms.”

The PDF looks like this on my screen:

Chuckle. While touting that other OS as the right way to do IT, M$ actually realized it was a pain for ISVs and developers. They used this pain as part of the lock-in. Now, more than a decade later, M$ is struggling against its own lock-in to produce “8″.

Bible (KJV) Revelation 13:10:
“He that leadeth into captivity shall go into captivity: he that killeth with the sword must be killed with the sword. Here is the patience and the faith of the saints.”

I wish the PDFs were all that easy. BTW, Google does have that one indexed. They can do OCR, too, so searching for site:justice.gov “ironically” “java” finds it. My copy, however, will not disappear when US DOJ v M$ falls off the radar at US DOJ… ;-)

15 Comments

  1. oiaohm

    “please don’t tell me Linux can run x86 code on ARM without migrating the native code to ARM”

    Yes this is true. qemu usermode translation and TransARM-IU. Mostly not required.

    http://www.youtube.com/watch?v=mMzTFMpAQVM Watch the video Phenom. Yes that is x86 binary on a arm chip in a old n900. That is 1/6 of the speed on that old arm core yes that is a single core. This is better than what Ithanium achieved at its best. At its best the Ithanium was 1/20 of the arch speed running 32bit x86.

    The more modern arm cores that speed hit reduces.

    There is faster than Qemu. TransARM-IU Its faster for the simple fact it only emulates the x86 bit and will use arm native libraries where able. Yes using TransARM-IU method you can call x86 code from arm code. This starts making x86 run on arm with about the same speed hit .net since you are no longer lifting a full x86 environment. C calls from x86 code go to the arm c library and the like. Massive speed up since those no longer have to be translated. Even more fun endlin in Linux arm and Linux x86 lines up perfectly so memory pointers and every like that works without translation.

    Why is Ithanium so bad to run x86. big-endian is Ithanium and Little-endian is x86. So every pointer has to be translated.

    Arm the scary bugger is biendian. Linux distributions default build Little-endian but you can build the Linux kernel big-endian. If you do build the Linux kernel big-endian forget any decent performance when running emulation of x86.

    There is a differences between the Ithanium and x86 is massive. Difference between arm and x86 is not that much. Why because arm risc core is very similar to the risc core inside an x86 chip. The big thing arm is lacking is the translator chip from x86 to risc that the x86 chips have.

    Phenom
    “puny CPU like ARM has simply no chance to provide any bearable experience.”

    The puny ARM has a better chance of providing a bearable emulation experience than the Ithanium. Just because something is bigger does not means its suited todo the job. Selections in Ithanium design make it not suitable to run x86 code.

    Loongson that is a mips64 has 200 instructions to assist with x86 emulation and does quite a decent job. If arm ever added equal x86 emulation would be fairly much a walk in park.

    Phenom
    http://www.indeed.com/jobtrends?q=c%23%2Cphp%2Cpython%2Cruby%2Cc%2Cc%2B%2B%2Cjava%2Cjavascript

    Yes I was not kidding when I said mostly a failure. If you flick that across to relative you will find the thing with the highest growth is ruby.

    Stats taken not broad enough give a false idea. Yes the idea that C is still close to the most demanded form of coders is shocking to most people.

    Yes shocking to most people that C out numbers C# or C++. PHP and Python are also out growing C#.

    Yes there are a lot of Linux desktop applications written in python and ruby so they are cross platform already. There is qml script as well. Rest of the scripting languages of Linux then java as well.

    So x86 requirement is not that high.

  2. Phenom

    The keyword, Pogs, is “compile”.

    Any company can bother to compile its code for ARM and be good and ready. Of course, little companies will do that, because it would be outright stupid. If you don’t optimize for the low-resource devices which ARMs all are, you are doomed. If you don’t make use of the new UI and usability paradigms on tablets, you are just annoying users.

    Therefore, companies will either stick to the desktop, or revamp their apps to make the best of tablets – the UI and multitouch experience. Or go both ways, and have two versions of their apps.

  3. Robert Pogson

    Phenom wrote, “please don’t tell me Linux can run x86 code on ARM without migrating the native code to ARM”

    Much of the Debian GNU/Linux repository runs on ARM. see http://packages.debian.org/wheezy/armel/allpackages?format=txt.gz, that’s 45000 packages and counting.

    Much of this software is written in C, a “high-level” language and porting is not particularly difficult because GNU/Linux is modular and everything is not connected to everything as in that other OS. That makes porting much easier. You compile the C-code which links to libraries. Port the libraries and you are done. Debian GNU/Linux has run on ARM for years.

  4. Phenom

    The word you have is false, Pogs. Or you twist it to your own liking, ignoring reality, which I hope you do not. For the sake of sensible discussion, let me summarize the case with 8 and software.

    1) 8 on x86/64 will run any software which 7 can run.

    2) 8 on ARM will be able to run only software based on pure .NET, or software compiled natively for ARM. (Now, please don’t tell me Linux can run x86 code on ARM without migrating the native code to ARM).

    The key here is .NET. It enables developers target both ARM and x86.

    The reason ARM will not emulate x86 code is simple – performance. A monster like Ithanium is struggling to emulate x86 to compare even with PIII, and a puny CPU like ARM has simply no chance to provide any bearable experience.

    Btw, I am quite impressed that MS managed to both roll out a new version of Office with new features, but also simultaneously migrate it to ARM native code. For me, that is a sign of quite pure code base, with little platform-specific or Win32 API magic. Not bad from software engineering point of view.

  5. Robert Pogson

    Phenom, if “8″ is “ready”, why does it lack many of the basic features of an OS, like the ability to run software? The word I have is that it won’t run any legacy apps except IE and M$’s office suite… Where’s the lock-in? Most PCs on the Earth do not run those apps today. Are you sure it has “copy-and-paste”. Phoney “7″ did not have that when issued.

    It’s not yet released for beta-testing, so it must still be alpha.

  6. Phenom

    “8″ is basically ready, Mr Pogson. It is currently under heavy testing, and being polished and optimized.

  7. Robert Pogson

    oldman wrote, ““The World is moving on without Microsoft.”

    Are you sure Mr. K?”

    I am sure, oldman. While M$’s revenue overall is looking healthy, the client division is definitely losing traction with more non-Wintel personal computers produced last year than Wintel and “8″ still vapourware. How could one not hold that opinion? M$ does seem to have a lock on business desktops and client management and some servers but the whole rest of IT is wide open for FLOSS. I just posted today that US VA and NASA are both welcoming FLOSS. Some parts of the world are moving on without M$.

    Change is happening in IT faster than M$ can manage. They are a bottleneck we don’t need.

  8. Clarence Moon

    Ah, Mr. Oiaohm, still fixated on the power drain! Seen any Marines hanging around your submarine lately? You had better watch out!

  9. Kozmcrae

    Clarence and his Ego said:

    “Tiger Woods is no longer the leader of the pack in his milieu, either, but it does not diminish his prior achievements.”

    You missed the analogy entirely. Tiger Woods is still playing great golf but the world has moved on to soccer. Microsoft is still at the top of their game, but fewer people are playing their game. The World is moving on without Microsoft.

  10. oiaohm

    Clarence Moon really .net has mostly been a failure.

    I really would not call mono that successful. Mostly due to every head to head test it losing. tomboy vs gnote. Gnote the c++ clone is faster. Vala clones of Mono applications are also faster. Vala is a related syntax to .net. Asp.net vs hiphop also sees .net have its ass kicked.

    You go make applications running monotouch. Your users notice a higher battery consumption rate than one build for davik on android. So your competition in time bets the crap out of you as you get a bad reputation for producing shoddy product.

    You also run into the same problem if you put java on davik and just build of too high of power usage. davik does have a few things you need to take into consideration. Like telling davik that losing data from these objects is not a problem since regeneration is not that costly.

    Just because some company lets you do something does not mean its going to be good for your company image.

    Xamarin is one of those things.

  11. Clarence Moon

    Well, I am aware that a lot of phones and tablets were sold last year, Mr. Pogson. As were a huge number of banana grown and harvested. Microsoft is not an important factor in either business. I don’t despair for Microsoft in either instance myself.

    I am equally aware that java is the lingua franca for android apps, but I am reasonably adept at it (as a language) as I am with C# also. I have found a product, Mono (and MonoTouch), from a company called Xamarin that lets you create Android or iOS apps using C# and Visual Studio. I may prefer that for my own work. Else it is off to burn the midnight oil with Eclipse.

    As an aside, if Microsoft is now locked in a private cell, as you suggest, it is an opulently padded cell and is still an important business. I don’t see any disgrace in having time pass you by. Bill Gates is no longer setting the pace in personal computing devices, I agree. Neither is the late Steve Jobs. But both are going to be a notable part of history. Tiger Woods is no longer the leader of the pack in his milieu, either, but it does not diminish his prior achievements.

    In terms of changing society, Gates is among the most influential of all time. You can argue that he could have done better, but the facts are that no one else did do any better.

  12. Robert Pogson

    Clarence Moon wrote, “that shows and ultimate triumph for Gates, but perhaps you are satisfied with chuckling over the irony anyway.”

    Chuckle. I am amused that you don’t comprehend that more smart thingies were produced last year than PCs running that other OS and a lot of smart thingies run apps originally written in Java so Java has indeed had the last laugh. That PDF with contents so old is still relevant today. You may choose to ignore history but the world repeats it. M$ has locked the world into Wintel and the world has moved on, leaving M$ in its private cell.

  13. Clarence Moon

    Every boy needs a hobby, Mr. Pogson, and it seems like you have found something suited to your interests!

    As my own retirement looms large for this year, I am looking myself. I plan to revisit my engineering days and go into Android app creation for my Kindle Fire and perhaps I will get a compatible phone and work with it as well. I have a few programs that I have created in the past that work for me and my interests on Windows PC and I think that I can adapt them to mobile devices.

    Regarding your poke at Bill Gates and company’s wrestling with future directions of personal computing, I can only say that this cabbage has already been chewed so many times by you anti-MS fellows that further discussion seems meaningless. I know that Microsoft’s early struggles with java resulted in the creation of .NET and C# and F# which have become major elements of the modern Windows software scene and Microsoft’s profits have risen immensely since those days.

    To me, that shows and ultimate triumph for Gates, but perhaps you are satisfied with chuckling over the irony anyway.

Leave a comment