Sony Reader - Library Link

Today Sony released details regarding their upcoming digital reader products.

sony-reader



One of the big features hardware-wise is a wireless 3G connection, much like Amazon's Kindle. On the software side, Sony announced that the reader will be able to be synchronized with a user's local library (if they support such a thing). The combination of these two features has me pretty excited.

From ArsTechnica,
According to Sony's Haber, the new version of its online book store will allow users to enter their ZIP code, and determine whether the local library offers electronic versions of its books. These books can be downloaded, at which point they'll have a 21-day expiration date—no late fees, as Haber was happy to point out. The New York Public Library's representative announced that his organization would be taking part in the service. That's a rather significant announcement, given that he said that the NYPL's website was the second-most visited online library, behind only the Library of Congress.

I think this might have a big impact for students or anyone else who wants/needs access to many books. Imagine going to college and pairing your reader with your school's library. Suddenly, buying textbooks becomes a thing of the past. Need to look something up, but don't have the book? No problem: search the library's catalog and pull in the book over the 3G connection.

This scenario might be wishful thinking for a little longer, but I think Sony's announcement today goes a long way toward making it reality.

Wolfram|Alpha

I've spent a fair amount of time using Wolfram|Alpha recently. Here are my impressions.

The weeks leading up to the launch of Wolfram|Alpha have created incredible amounts of hype surrounding the project. The creator, Stephen Wolfram, has not helped to lessen this hype. Even the project's description sets a rather lofty goal

Wolfram|Alpha's long-term goal is to make all systematic knowledge immediately computable and accessible to everyone.


The comparions to Google were immediate and plentiful. After all, Google describes their goal as making all the world's information searchable and catalogued. However, Alpha is not Google. Here the term computable can nearly be thought of as making all of the data formatted in such a way that the computer can read, evaluate, and manipulate the data. A great example of this is to do something like produce a graph for a data set over time (such as population). Google, on the other hand, is best used as a catalogue that ultimately points you to data. It does very little of it's own analysis other than what is required to build their index.

Alpha is a calculator, only a calculator that can understand natural language, access the relevant data and present it in a form that is easily understood by humans. As such, it's an interesting tool. Alpha's ability to display data on a webpage is particularly impressive. Take a look at what you get when you query Alpha about the International Space Station.

Alpha displaying the current location of the ISS. Alpha displaying the current location of the ISS.

The results are impressive when Alpha knows how to handle your input. In the cases where it doesn't (and there are many) leave much to be desired. For instance, searching for general terms does not yield much beyond basic information, and sometimes that information is only related to the core concept. Consider the next image where I searched for Computer Science. The result is no information other than some queries that Alpha can display results for.

cs Alpha's results for the term "Computer Science."

If you are interested in results that are analytic or mathematical in nature, Alpha might be a really good resource. For encyclopedia like results, Wikipedia still rules the day. Of course, Alpha is not positioning itself to compete with Google or Wikipedia, but I think the comparisons are fair along the lines that all the mentioned services intend to provide information to people seeking information.

 

This brings me to the question, "exactly how useful is Wolfram|Alpha?" I though this might be the type of question that Alpha itself could help me answer. I started by searching for information about the United States' current population.




US Population Results. US Population Results.

These results are good and quite useful. Now, I wanted to get some information about education in the US.

education_usa

Not as much as I'd like to see, and the numbers are a bit out of date. Here I'm a little dissappointed. I'd assume that certainly raw numeric data such as this could be computed upon to reach more conclusions. In my next search, I try something a little more advanced: let's try to calculate the current portion of the population that has obtained a college degree. This should be a simple calculation, given the data. The term for this value is "education attainment," something that I learned through a Google search that lead to the US Census Bureau's page.


edu_attainment

Nothing. Ok, maybe I need to be more verbose and use terms that I know already lead to data. So, I searched for "education usa, population usa."

edu_comparison

What? The calculation isn't even complete. Education enrollment has no result, even though I've already found those numbers before. Even if all the numbers had appeared, no useful calculation or comparison has been done.

The number that I'm actually searching for is 28%. This represents the portion of the population over the age of 25 that has obtained at least Bachelor's degree. This information comes from a Census Bureau page that turned up in a Google search. It would have saved me a lot of time if Alpha had produced this number. Given the data that it has access too (US Census Bureau is listed as one of Alpha's resources), the calculation is fairly straight forward as well.

The reason I'm searching for this number, 28%, is that I believe that the majority of people interested in the types of results Alpha can provide likely have a background in mathematics or science. Most of the impressive results are related to these fields. This is not hard to understand when you realize that "computable" human knowledge is much more likely to come from these fields than say, literature. Compare the results when searching for Poisson distribution to Gulliver's Travels.

Search: "poisson distribution" Search: "poisson distribution"

Search: "Gulliver's Travels" Search: "Gulliver's Travels"

I think you can see the vast difference in the quality of the result returned. Now, back to that 28% number. I'd assume that people with a background in math or science are likely to have a college degree. Of the people with a college degree, only a portion will have studied math and/or science, but let's just keep using the 28% number. What is 28% of the US population? Alpha can answer that.

college_pop

The number: 85.63 million people. I'd conclude that this is a high estimate of the number of people who might ever be interested in the Alpha project. I think the majority of the population might type a query, get a graph, determine this is not what they want and move on.

People who are interested in these types of results will likely want to dig in deeper, as I tried to do above. I think at that point they will become disappointed in Alpha's inability to understand what information they are trying to elicit. Further more, if they are able to obtain just the data they need, I wonder of what use it is other than as a cure for their own curiosity? It is difficult to impossible to determine where, exactly, Alpha got the source information it needed to reach its conclusions. Even if this source information was available, the user still has no idea exactly what equations, algorithms, or process was used on the data. Without this information, I find it difficult to believe that anyone would be able to use Alpha's results in any sort of document or report.

I think a fundamental improvement to Alpha would be to return precise source information that ultimate points to the raw data along with a bit of Mathematica (or other) code that performs the calculation. Then, the user could take this information, verify its validity, modify it as necessary, and incorporate it into their own work. Until this is possible, I'm afraid Alpha will only be used for curiosity's sake.

Despite these criticisms, I find Alpha to be a fascinating project, if only for it's data presentation and cataloguing engine. The project is in an early stage, so perhaps the flaws I see are simply due to this. Perhaps the desired features are already in the pipeline. I hope that they are, because I'm a fan of what Wolfram is trying to do here. If they can improve the project to the point where these points are no longer an issue, I can see Alpha becoming the world changing product what Wolfram would have you believe it is. Until then however, you might be better off searching somewhere else.

Living in the Clouds

More than a few months ago, Richard Stallman commented on cloud computing and made some striking remarks.

It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. –RMS

I understand why Stallman said what he said, however I think he’s looking at the problem from only one direction when it should be viewed from multiple directions.

Stallman’s position is that “cloud computing” equates to handing your data over to organizations/systems that are out of your control. In many cases, this is absolutely true and even dangerous. To this end, I say that when using apps in the cloud, one should make sure that there is some, relatively painless, way to export all of the data that the application may have.

I think Google Mail is a great example of an application that makes retreiving your data easy, chiefly by supporting known standards (namely POP and IMAP) and providing export functions for things like your contact list. For this reason, I feel pretty comfortable letting Google manage my mail for the time being. In return, I get some nice things that would be difficult to provide on my own, such as really powerful search, a consistent interface no matter what computer I use, and replication of my data with copies likely spread to multiple physical locations.

Now, where Stallman starts to make a lot of sense is when you consider organizations and users that need to have absolute control over their data. You can’t get this with a cloud application by the very fact that you don’t control (Stallman would probably say own) all of the software and the systems that are managing your data. This, obviously, can be a big problem if you need to make guarantees about the system to your customers.

In general, I try to remain aware of the above concerns. The reason I like Open Source Software is that I feel it empowers users. It is wonderful to know that you don’t have to pay a tax to any set of companies in order to use the hardware that you own to its fullest potential. Computation is much too important to have it controlled by any limited set of corporations or people. Stemming from this follows the support of open standards and formats. I don’t want my data to be locked in to some format that can only be manipulated by program X.

Carrying this idea forward,  I see a large opening for cloud computing. What Stallman failed to realize is that the concept of cloud computing can be used to provide greater access to the data that I do control. To me, cloud computing, at its heart, is about making users’ data more accessible. Why should this idea not carry over to the data on my desktop system? Why should the software that I run not be able to take full advantage of the Internet and the connectivity it provides?

Following with Stallman’s central argument, I think he would favor some way of sharing photos online other than using a service such as Flickr. But to a large portion of the users, I think, the main draw of a service like Flickr is the community. I could always share my photos by running my own web site that simply provides a way to navigate through my images, but this misses the point.

What I think should be investigated, for example, is some way to take the photos on my computer and have them, without effort, become part of the cloud. I don’t lose control of what I’ve created and what I own, but others are able to access my data, comment on my work and, generally, participate in a community.

I think this blog post by Google gives good insight to the sorts of advantages that cloud computing brings.

In coming years, computer processing, storage, and networking capabilities will continue up the steeply exponential curve they have followed for the past few decades. By 2019, parallel-processing computer clusters will be 50 to 100 times more powerful in most respects. Computer programs, more of them web-based, will evolve to take advantage of this newfound power, and Internet usage will also grow: more people online, doing more things, using more advanced and responsive applications.

By harnessing the power of a collection of computing systems, interesting things can be accomplished. What I would like to do, is be able to plug my system into the cloud.

Instead of having to sync my smartphone with my system when I’m at my desk, my phone should simply have access to my data, no matter if that data lives in a box under my desk or a server in Google’s cluster. Access should be seamless and updates should be available on every system, without me having to connect a cable to each one individually. If I add a contact on my phone, I should be able to view that data on any device that has access to the web.

This is what I’d like to see cloud computing bring to the table. If the open source / GNU community doesn’t recognize this or attempt to support it, I think they will be doing their community a disservice. Open source software can make an impact by making cloud computing more peer-to-peer rather than the largely client-server mode that it has currently. The open source movement would likely not exist if it were not for the Internet, so it would seem counter-culture to not make an effort to support growing connectivity to the next level.

What could be more open than that?

Those who are interested should read this article on Ars Technica for another opinion that mirrors my own as well as to see how some current open source projects are embracing the idea of the cloud.

Stay on Target: Real Life Tron on an Apple IIgs

This story is so great that I had to link to it. Go read it now.

Stay on Target: Real Life Tron on an Apple IIgs.

As soon as they started explaining how missiles were added to the game, I knew where this was going. I didn't expect the result to be so entertaining and interesting. It's certainly a spectacular bug.

It also reminded me of my early programming days in high school when I learned that you could crash Windows 95 by doing something similar to this.



I remember being disappointed when this didn't work in Windows 98.

Cutting out the fat

Mac OS X has the unfortunate problem of trying to support two chip architectures without causing headaches for users. The solution to this is to compile applications as "fat binaries", meaning, code for more than one achitecture is part of the binary package.

I'm not really a fan of this solution, since it wastes harddrive space for everyone. The only upside to it is that allows users who know nothing about what chip is in their computer to remain ignorant. Also, it typically doesn't cause problems for anyone.

Unless you are a developer.

If you happen to straddle the fence between open source and Apple provided libraries and tools, you could run into problems.

For instance, software compiled and installed using Fink is built only for the architecture that you are running on. There's no reason to waste space by building PPC code if you are using an x86 chip.

This can cause problems if you are trying to build a Python module that links against Fink installed libraries. Most Python distros for OS X are fat binaries. If you get a Python module that builds from source using distutils, it will try to build the code to match your architecture. In my case this means it creates a fat binary.

However, if you are missing the fat versions of libraries that the module references (say things from Fink), then you get compile errors when it can't find the library code for the other architecture.

I've run into this problem more than once and until recently haven't been able to find a way to tell distutils to only build code for one architecture. So, I'm documenting the solution here (mostly for myself and perhaps for some other poor soul who is able to find this using Google).

The solution comes by modifying the Makefile that is part of the Python distribution:



Now, edit the Makefile to remove instances of "-arch x" for the architectures that you are not using. Here is an example of the edits I needed to make:



And there you have it. Module builds using distutils will automatically pick up the change and stop building for unused architectures.

Thanks to Wongo, linked below, for making the blog post that I found and allowed me to solve this problem.

Wongo’s Scraps of Code » Does my Mac really love me?.

Threads and the future of computer programming

Threads are evil.

There, I said it, and I stand by it too. If you find this sentiment a strange one, I suggest you start with a little reading.

Here is a great paper that says what I’d like to say better than I ever will in this blog: The Problem with Threads.

Illuminating, isn’t it? I agree 100% with what was written here. I suppose at this point it would be useful to clarify that its not so much that threads themselves are evil. Instead, it is the hoops that programmers must jump through in order to write correct multi-threaded code when they are working with most popular procedural based languages (think C/C++, Java, Python, and so on).

I’m going to repeat a choice phrase from the paper here:

To offer a third analogy, a folk definition of insanity is to do the same thing over and over again and to expect the results to be different. By this definition, we in fact require that programmers of multithreaded systems be insane. Were they sane, they could not understand their programs.

This is not really hyperbole. To exhaustively test a multi-threaded program, one must consider all possible execution orders for the atomic instructions that make up the individual threads. In practice, this never happens because its not possible to reproduce all execution sequences. Further more, if you are coding in a high level language like Python, it is probably not even clear what sorts of operations make up any one method or language operation.

Modern computer systems are organized chaos, meaning that when considered as a whole, the system is nondeterministic. While each program and thread may be deterministic, the way the system interacts with them and they are scheduled, with respect to one another, to execute on the hardware is not deterministic.

Because of this, programmers are required to guard against this nondeterminism and ensure that memory is modified in a sequential and controlled manner. Beyond the most basic embarrassingly parallel problems, programmers are forced to deal with mutexes, semaphores, and other synchronization tools. I feel that every moment spent thinking about these items is a moment wasted.

To top it off, it is becoming increasingly difficult to write code without giving any thought to multiple threads. Try writing even a basic GUI without the use of threads and you will find that even many of the simplest cases really require additional threads to do the interesting work.

Even worse is the fact that individual processors are not becoming any faster. Instead, multiple processing units are placed on one chip. This requires the programmer to use multiple threads to make full use of the system’s hardware. This has implications beyond the application level, see SMP Scaling Considered Harmful for information about how operating system performance is affected by SMP systems.

So, I think it has clearly been established that concurrent programming, in its current form, is bad and will not be able to support a future where hardware becomes massively parallel (think 32 or 64 computation cores on a single system).

What can be done?

Well, there are a few ideas that are being worked on. One is to add more features to programming languages to support concurrent execution. If we take good old C as our example, it is clear that this is a language that was not designed for concurrent computing. Access to threads is supplied as a library (something like pthreads) and is not a part of the language itself. Some have proposed extensions to C to enable concurrent programming, possibly supported by the compiler. However, approaches similar to this don’t seem to have gained much traction lately.

Instead, I think a better alternative to extending a language like C will be to develop novel languages that allow programmers to express their design in cooperation with a parallel computation environment. I don’t have any brilliant examples to offer, but I imagine that such a language might offer synchronization methods and safe ways to communicate information from one computation element to another without having to burden the programmer with the details of synchronization.

Still, it seems that it will be some time before such languages ever gain any sort of traction for every day programming. An alternative to entirely new languages might be frameworks designed to support parallel computations. One such framework, made famous by Google, is MapReduce (see Hadoop for an open source implementation of the same concept). I think MapReduce is a great idea because it allows the programmer to concentrate on the problem while the MapReduce framework handles the details of actually performing the computation in parallel.

It seems likely that frameworks with similar goals to MapReduce will be created to allow the programmer to write code that executes in parallel. Another example of a framework that supports concurrent computation is CUDA which is developed by NVIDIA to support computation on GPUs. Development with CUDA is largely based on the concept of many threads operating on independant areas of memory. While CUDA doesn’t guarantee thread safe code, it goes a long way to providing the resources necessary to make writing such code easier.

Another avenue of development resides along the lines of entirely new types of hardware. As the amount of transistors that can be placed on a single chip increases, processors are going to gain more features and, most likely, more memory. Currently, the existence of cache memory is hidden from the programmer. That is, you cannot control the contents of the cache. Instead the data that is present is a product of previously requested memory locations.

I think that future systems will provide programmers ways to control what resides in the fast memory that is available on the chip. An example of such a chip is the Cell processor where many vector units operate on local memory only. I think that systems similar to the Cell as well as different NUMA designs will only become more common as transistor counts increase.

Another example of new hardware is the GPU, which I already mentioned above with the CUDA framework. When exposed through CUDA, a GPU is really nothing more than a very fast vector processor with its own on-board memory. I think that CPUs of the future will likely begin to more closely resemble the Cell and GPUs chips that are available now. Expect these chips to provide more access to the low level memory that directly feeds the computation units. Expect these new processors to resemble a system on a chip instead ofjust a processing unit.

The hurdle right now is to create software to support these sorts of systems and I think there is a lot of development that can be done here. Imagine taking a language similar to Python and, without further involving the programmer, having it fully utilize all the computation units available. For some types of computations, this may be trivial, for others it is likely very difficult.

In short, I think it might become necessary for programmers to rethink how they approach describing computation to the computer. It seems that in addition to dividing our programs into subroutines or objects and then tying them together new abstractions will be created that make it easier for compiliers and interpreters to produce machine code that works efficiently on massively parallel hardware.

Whew, this has been quite a long and rambling first post to this blog. The problems mentioned here are ones that I’ve been giving more thought to recently. They are certainly problems that should have a lot of brains thrown at them, as they stand in my mind as some of the fundamental problems facing computer science in the upcoming years. Outside of specialized code, hardware is quickly out pacing the state of the art in software. New and powerful chips aren’t useful if no one can program for them. I think John McCarthy stated it best when he said, “Such systems tend to be immune to programming.”

I hope you enjoyed this little rant. I’d like to finish it by thanking the fine folks (namely btillyand Hangar) from the XKCD forum for posting links to the papers I’ve sited here. It’s a good thread and I’ve contributed some of my thougths to it.

http://forums.xkcd.com/viewtopic.php?f=12&t=22614