| Fast Luhn-digit generation algorithm |
[okt. 16-a, 2005|12:48 pm] |
| [ | Tags | | | algorithms, java | ] |
| [ | Nuna humoro |
| | relaxed | ] |
| [ | Nuna muziko |
| | Comalies (Lacuna Coil) | ] |
I referenced the Luhn mod-10 check for credit card numbers in my last post.
The fastest implementation I'd encountered so far was my dad's, which was also pleasantly simple.
The weird part about the Luhn algorithm, from the point of view of a modern binary computer, is that it's oriented toward base-10 digits. One step of the algorithm doubles each second digit, and adds the digits of the result together until one digit remains. Popular with numerologists, but semi-complex if your data is in binary.
My dad simply built a table of 10 integers, containing the result of doubling-and-adding each possible digit — in other words, { 0, 2, 4, 6, 8, 1, 3, 5, 7, 9 }. In hindsight, this is a pretty obvious optimization, but boy, does it speed things up.
I've taken it a step farther. Here's mine in pseudocode.
deltas := { 0, 1, 2, 3, 4, -4, -3, -2, -1, 0 }. sum := the sum of all digits in the input. for every other digit d in the input: sum := sum + deltas[d].
check := 10 - (sum mod 10).
The deltas array contains the correction for doubling the digits — the difference between adding the digit straight, and adding it with the double-and-add algorithm.
In C, this is 150% faster than the original (already fast) implementations. It also lends itself reasonably well to Altivec optimization, not that it really matters, since the input data sets are so small.
The strange thing? I've seen a lot of Luhn implementations. I've never seen anyone do this. Weird. |
|
|
| Generating large numbers of valid credit card numbers for fun and !profit. |
[okt. 15-a, 2005|07:32 pm] |
| [ | Tags | | | java | ] |
| [ | Nuna muziko |
| | random cafe jazz | ] |
So, let's say you need a Java program to generate a large number of "valid" credit card numbers quickly. The credit card numbers need to be passed into another function as 16-position byte arrays filled with base-10 ASCII digits.
By "valid" numbers, I mean that they're of appropriate length, format, BIN, and meet the Luhn mod-10 check. They may or may not map to an actual credit card number.
This would be useful if one were, hypothetically, brute-force colliding an SHA-1 hash on a Saturday night to prove a point. But I digress.
How best to do it?
( The evolution of an algorithm )
Using the current semi-final algorithm, my little iBook can generate a valid number per 27ns — a sustained throughput rate of 565MB/sec.
Anyone have any additional suggestions? |
|
|
| Java splash screens. Yay. |
[sep. 30-a, 2005|01:02 pm] |
Through a convoluted circuit of weblogs, I discovered an article on the new Splash Screen functionality that's going into Java 1.6/6.0/60/whatever.
To cut to the chase, I think this is a spectacularly bad idea. For a more involved discussion, see the dialog with the article that follows:
Splash screens are a standard part of any modern graphical user interface (GUI) application.
That's stretching the definition of the word "modern." Splash screens are a part of many GUI applications that have poor startup times. They are frequently modal, which irritates the hell out of those of us trying to multitask.
Their primary purpose is to let the user know that the application is starting up. An application that displays a polished and professional-looking splash screen can occupy the user's attention and gain the user's confidence that the application is starting.
Once upon a time, my Nova was having engine problems. It would often require several seconds of pumping the gas and holding the ignition before it would start. According to the theory referenced here by Sun, the solution to this would be a happy little "I'm starting!" light on the dashboard, to "gain the user's confidence" by reassuring me that, yes, the car would eventually start.
In the real world, of course, the solution is to make the car start when I tell it to, dammit, and the same goes for software. Splash screens are a useless distraction from the slow application startup, which itself is consuming what could be productive time.
Sure, there's going to be some latency at startup for any application, but proper design can minimize the effects. Cesta, my network analysis app (which is progressing slower now that I have an actual job), starts up nigh-instantaneously on outdated hardware, because of a number of techniques (most of which have the word "lazy" in their name).
The user should be able to get to a productive point as quickly as possible, even if the application is still loading stuff in the background.
Now, this does tend to be difficult in Java, for a number of reasons, but Sun should be addressing that issue, not giving us splash screen libraries. |
|
|
| Palm is dead. |
[sep. 25-a, 2005|11:28 am] |
Well, Palm (the hardware company) released a Treo running WinCE.
I think this is probably the death-knell for the Palm platform, in the sense of PalmOS. I have mixed feelings on this.
On the one hand, I've always considered Palm to be far superior to WinCE for organization and such. Even on underpowered hardware, it was always so much more responsive, largely because it never had to shuffle data around in memory. (Programs and data ran out of their storage space.)
On the other hand, the Palm platform has not advanced since, arguably, 4.0. Sure, 5.0 introduced some snazzy new APIs and ARM compatibility — but most user code is still 68k. If you want to take advantage of the ARM processor (without an emulation layer), you have to write these odd little code fragments called ARMlets, and miss out on most of the system API that makes Palm so powerful.
Not to mention the fact that there's still no solid support for background tasks, memory protection, or modern languages. I run a third-party Java VM on my Tungsten C, but it, like ARM code, can't access the APIs. Gotta code in 68k C for that.
It might be that Palm has been dead for several years, continuing only out of momentum. That's sure what their new-sale market share suggests. But it makes me sad. Some tasks are particularly well-suited for a handheld computer, but WinCE is (ironically) not. Nor is Linux, in many ways.
Sigh. |
|
|
| Python is strange. |
[sep. 23-a, 2005|11:20 am] |
Python's doing something weird to me.
I can't really blame it, because I'm sure as hell doing something weird to it.
Given the following code:
array = [1, 2, 3, 4, 5]
funcs = [(lambda y: x * y) for x in array]
...one would think funcs would contain five functions, which would each multiply their sole argument by a number between 1 and 5.
So, then, what do we get from this:
[foo(2) for foo in funcs]
Should print [2, 4, 6, 8, 10] — we're evaluating each function, in sequence, with the integer 2 as the argument.
What does it print?
[10, 10, 10, 10, 10]
But if you print the funcs array, you see that the five lambdas are, indeed, distinct and separate objects!
It seems all the lambdas bind to the same variable, temporarily named x, which is updated each iteration. This strikes me bad, or at least counterintuitive.
I sincerely hope there's a cleaner way of doing this, but here's the easiest way I've found to make it work:
funcs = [eval("lambda y : %s * y" % str(x)) for x in array]
*shudder*
I'm sure some Pythonite out there can explain to me why this makes sense. I'd like to hear it. After that, please explain the results of this one:
[x(2) for x in [(lambda y: x * y) for x in [1,2,3,4,5]]]
(Nested lexical scopes my ass.) |
|
|
| Source to the AddressBook SIM example |
[sep. 22-a, 2005|11:03 am] |
| [ | Tags | | | java, sim, swing | ] |
| [ | Nuna humoro |
| | amused | ] |
| [ | Nuna muziko |
| | Jolene (Cake) | ] |
Here's the full example sources — SIM and Java — for the AddressBook example.
This does not include the SIM sources. You won't be able to build this. But it makes for interesting reading.
( Sources! )
The end result? Click the button, a person is added to the list, the table updates instantly, etc. It just works. |
|
|
| A more interesting SIM example |
[sep. 21-a, 2005|09:55 am] |
| [ | Tags | | | java, sim, swing | ] |
| [ | Nuna humoro |
| | amused | ] |
| [ | Nuna muziko |
| | Honey (Moby) | ] |
Here's a slightly more interesting SIM example. This one presents a traditional master-detail interface for a simple address book.
New features:
- Splitters, which provide a draggable divider between two components.
- Tables, which are lists of items with multiple columns.
- Bindings between UI components themselves, instead of direct to the controller.
<?xml version="1.0"?>
<s:sim-ui xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:s="http://www.cliff.biffle.org/sim/ui.xsd"
xsi:schemaLocation="http://www.cliff.biffle.org/sim/ui.xsd ./sim.xsd">
<!-- A vertical splitter presents two components, one above the other. -->
<s:splitter axis="vertical" id="addressPanel">
<!-- The table gets its data from the owner's 'people' property,
and provides a virtual bean called 'person' for each row to bind to. -->
<s:table id="peopleList" var="person"
value="#{.people}">
<s:tableColumn name="Name" value="#{person.name}"/>
<s:tableColumn name="Phone" value="#{person.phone}"/>
</s:table>
<s:grid columns="2">
<s:col autosize="false"/>
<s:col autosize="true"/>
<!-- The text fields bind to properties of the table's selectedObject object,
causing them to update as the selection changes. -->
<s:label for="name" value="Name:" mnemonic="n"/>
<s:textField id="name" value="#{peopleList.selectedObject.name}"/>
<s:label for="phone" value="Phone:" mnemonic="p"/>
<s:textField id="phone" value="#{peopleList.selectedObject.phone}"/>
</s:grid>
</s:splitter>
</s:sim-ui>
I'll post a demo JAR once I'm not at work (no external ssh here). For now, some details:
- The table cells and text fields automatically become editable or non-editable, depending on whether their bound property can be changed.
- All the same two-way bound-property mojo from the last example exists here. That is, for editable properties, the table cells affect the text fields, the text fields affect the table cells, and programmatic updates affect both.
- Column sorting in the table will work soon. Swing makes this marginally more difficult than it should be (though, I would argue, not as difficult as Cocoa).
|
|
|
| SIM: Simple Interface Markup |
[sep. 19-a, 2005|07:02 pm] |
| [ | Tags | | | java, sim, swing | ] |
| [ | Nuna muziko |
| | No Education (Apocalyptica) | ] |
Over the past few months, I've been working with JavaServer Faces. As with any 1.0 technology, it has its share of warts — but despite them, it really is a generational advance over and above previous frameworks.
I've also been doing a lot of Cocoa development, using Xcode and Interface Builder.
Then, I tried to put together a Swing app. DEAR GOD. And to think I used to do Swing development for a living!
So, as I mentioned in a previous post, I've been working on desuckifying Swing. Here's my opening volley.
Allow me to present SIM, Simple Interface Markup. SIM allows you to define your user interface using a simple XML structure, and dynamically bind it to your controller and model layers.
SIM uses basically the same model as Cocoa NIB files, with a binding architecture inspired by JSF. As an example, here's a Fahrenheit-Celsius temperature converter, stolen from Apple's Cocoa Bindings examples. This is written in a very preliminary dialect of SIM, but it works:
<?xml version="1.0"?>
<s:sim-ui xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:s="http://www.cliff.biffle.org/sim/ui.xsd"
xsi:schemaLocation="http://www.cliff.biffle.org/sim/ui.xsd ./sim.xsd">
<s:grid columns="2">
<s:col autosize="false"/>
<s:col autosize="true"/>
<s:label for="degreesF" value="F:" mnemonic="f"/>
<s:textField id="degreesF" value="#{fahrenheit}"/>
<s:spacer/>
<s:slider min="-40" max="300" value="#{fahrenheit}"/>
<s:label for="degreesC" value="C:" mnemonic="c"/>
<s:textField id="degreesC" value="#{celsius}"/>
<s:spacer/>
<s:slider min="-40" max="300" value="#{celsius}"/>
</s:grid>
</s:sim-ui>
A dissection: - The grid element defines a display panel (in the Swing implementation, a JPanel) with the specified number of columns. Layout is much like an HTML table: columns and rows expand as needed to house their contents. - The col elements (which are optional) specify that the second column should autosize as the panel is resized. - The label, textField, and slider elements declare UI controls. The textField and slider controls in this example use value bindings to two properties, "fahrenheit" and "celsius." - The spacer element simply eats a position in the grid.
Like NIBs in Cocoa, the SIM file is instantiated at runtime with a particular object as its owner. All bound properties (like "fahrenheit" and "celsius") are connected to this owner object through standard JavaBeans methods. Using a simple F/C converter bean, one can drag the sliders and watch everything update in real-time.

If you've done this in Swing, you'll probably recognize how much work we just saved. (Hint: clubbing GridBagLayout into doing what we want, writing event listeners to keep properties in sync without excessive method calls, doing value conversions and validation to/from the model objects.)
And unlike a certain offense against nature *coughJSF-on-JSPcough*, the "for" attribute on the labels will work even if the labels come before their peer components. :-)
I'm packaging up a demo, which will appear here when ready. |
|
|
| JFrame minimum size stupidity! |
[sep. 14-a, 2005|08:59 pm] |
I'm writing a neat library. It lets you define a UI in simple, JSF-inspired XML, and use Cocoa-inspired bindings to connect it to your controller and model objects.
Basically, it aims to take the suck out of Swing.
But that's proving more difficult than anticipated. It's clear to me that the Swing layout managers, in particular, weren't designed test-first. It's not just that they don't lend themselves to a specific use — they really don't lend themselves to use, in general. Doing the World's Most Common UI Layout — controls with text labels on the left — is surprisingly difficult.
But that's nothing.
I discovered, with the working prototype of my engine in hand, that JFrames will happily shrink to tiny sizes. They seem to totally ignore the "minimum size" of their contents. They have a setMinimumSize() method that does nothing.
Yay.
The recommended method? Register a ComponentListener with the JFrame, which will get notification when the size changes. If the size is outside your desired bounds, set it back.
Yay. If the system is slow enough (or on the Mac, where the UI updates strangely), you can visibly see the frame bounce between sizes. |
|
|
| Safari's WebArchive format |
[sep. 10-a, 2005|01:27 pm] |
I've seen several discussions online about Safari's WebArchive format (new in 10.4).
This thread on CocoaBuilder claims the byte-level format is proprietary, but that you can load them with the WebArchive class in WebKit.
An entry from Paul Thurrott's blog (search for Safari, the permalink is broken) claims that Safari's WebArchive format is Safari-proprietary just like IE's.
Seriously. Does nobody use a hexeditor anymore?
It took me about 30 seconds to determine that Safari's WebArchive is, in fact, a standard Mac OS X plist file, albeit in the new binary encoding (introduced in Tiger along with WebArchives). To confirm this, save any page as a WebArchive (Safari: File -> Save As, choose Web Archive as the format). Then, with this file (which we'll call foo.webarchive), do this:
open -a 'property list editor' foo.webarchive
Alternatively, you can rename the file to foo.plist and double-click it.
Either way, if you've got the developer tools installed, you'll get the WebArchive open in tree format. Here's what my sample contains:
/WebMainResource/WebResourceData: a binary data chunk containing the page source exactly as received. Property List Editor (the worst program on the Mac) presents this only as a string of hex digits; pipe this through xxd -r -p to decode. /WebMainResource/WebResourceMIMEType: the mime type of WebResourceData, as described by the server. /WebMainResource/WebResourceTextEncodingName: the text encoding used in WebResourceData. This is only apparent for text data, like HTML. /WebMainResource/WebResourceURL: the URL from which the resource was originally retrieved. /WebSubResources: an array of dictionaries describing linked objects, each of which contains identical keys to the /WebMainResource dict.
I'm working on a tool to manipulate these from the command line, just to prove my point.
This isn't exactly an open format, but at least it's an obvious format. Hooray obvious formats!
Update: With my new command-line WebArchive tools in hand, I've discovered that there's an additional key in each WebSubResources entry: WebResourceResponse. It contains -- get this -- another binary plist file, embedded as binary data.
The embedded plist is actual serialized ObjC objects respresenting the server response.
I'm working on code to deserialize and inspect them now. |
|
|
| Selfish packet decoding |
[sep. 4-a, 2005|01:14 pm] |
Haunted Xtreme Bean last night. They had the same folk guy they have every Saturday, who did the same folk Ace of Base cover.
*shudder* Thank god for folk-reducing headphones.
Anyway: at about 1AM on a yerba mate-fueled coding binge, I hit on something neat.
At the moment, I'm referring to it as "selfish decoding" to distinguish it from "lazy decoding." In short, the decode engine only decodes packets that contain data in which it is interested. All others are simply stored away (if requested).
The engine is 'interested' in a particular datum if a user-specified filter or display makes use of it. For example, let's say the user has asked to see all URLs of all HTTP requests. These are represented internally by a unique attribute name, which we'll call http.url for the sake of discussion.
Now, here's the key: since I control the entire protocol stack, I have rich information about each layer. For example, I know that http.url attributes are defined for only one type of event: an HTTP request. These can come about two ways.
First, an HTTP request can be a sequence of bytes in an HTTP stream. The only defined transport for such streams is TCP at the moment, and the only defined transport for TCP is IP.
Second, an HTTP request can be a UDP datagram complying with the Multicast-HTTP standard. UDP, again, is transported over IP. (Yes, I know other transports are possible, but this is what Cesta's decoders support right now.)
IP packets can be arrived upon in a number of ways, including IP-in-IP tunnelling, AOL's ludicrous TCP port 5190 tunnels, or plain ol' layer-3 transmission.
So. If the user is only interested, at the moment, in the aforementioned http.url attribute, how many ways can we derive it?
Link → IP → TCP → (stream) → HTTP → Request → http.url Link → IP → UDP → MHTTP → Request → http.url Link → (some tunneling scheme comprising multiple layers) → IP → (either of the above)
We can merge these paths together into a decision tree that tells us, at each point, whether it's possible to derive an http.url attribute from the current data. If, at any point, the decode engine detects that it's not possible, decoding for that unit stops.
For example, if we receive an Ethernet frame that indicates that it contains a Netware packet, don't bother decoding. IP packet that contains ICMP? Don't bother. TCP segment for a stream that was previously determined to contain, say, an SSH session? Foo.
The goal? To do as little work as possible to get the requested information, but no less. |
|
|
| Network analysis: a user-centric approach |
[sep. 2-a, 2005|02:19 pm] |
| [ | Tags | | | cesta | ] |
| [ | Nuna humoro |
| | contemplative | ] |
| [ | Nuna muziko |
| | Amp (Fluke) | ] |
For nearly two years now, I've been hacking on my network analyzer, Cesta. It's a nice piece of work, if I do say so myself, and can already has a number of unique features.
However, of late, I've been feeling a little muddled in my directions with the product.
So, I've decided to step away for a moment and look at the user side. What would the user (in this case, me) want to do with a network analyzer? "Viewing packets" isn't a very interesting use case; what else can I come up with?
Here's the list I'm working with so far. Suggestions would be appreciated. ( What do you do with a network analyzer? ) |
|
|
| java.lang.Cloneable is not remotely.freaking.Usable |
[aŭg. 20-a, 2005|04:12 pm] |
| [ | Tags | | | java | ] |
| [ | Nuna humoro |
| | aggravated | ] |
| [ | Nuna muziko |
| | His Truth is Marching On (Mike Doughty) | ] |
I discovered today, after much fighting, that JSF's marvelous h:dataTable component won't work with a Set — it requires an array or List (or an SQL ResultSet, if you have no layers). However, it won't actually tell you this — instead, it will present itself as a table with a single row, which is your Set. Yay.
Sets are my most-commonly-used Collection, so I needed a conversion method. Easy enough, I thought, thanks to generics:
public <E> List<E> listFromSet(Set<E> items) {
return new ArrayList<E>(items);
}
Easy enough — arguably, easy enough that it doesn't deserve a method. However, I soon discovered that I needed to deep-copy the Set to force lazy initialization of its contents. The Collections framework doesn't provide a deep-copy method anywhere, which is probably for the best: I didn't need a true deep copy, so much as a shallow copy of the Set's contents. So I wrote one:
public <E extends Cloneable> List<E> listFromSet(Set<E> items) {
List<E> list = new ArrayList<E>();
for(E item : items) {
list.add(item.clone());
}
}
If you see the problem in the above code, you're wiser than I.
For those on my level, let me point out that java.lang.Cloneable does not actually include the clone() method. In fact, there is no interface in the JDK that specifies a public override of clone()!
As a result, even this one-level-deep copying cannot be implemented generically.
Yay. |
|
|
| Spring, EasyMock, and making Java OLTP not suck. |
[aŭg. 7-a, 2005|12:35 am] |
| [ | Tags | | | hibernate, java, spring | ] |
| [ | Nuna humoro |
| | amused | ] |
| [ | Nuna muziko |
| | And She Was (Talking Heads) | ] |
So, that neato whiz-bang annotation I described in my last post?
Somebody beat me to it, by a good margin. It's one of the features of the Spring Framework.
I had looked at Spring some months ago, and decided it was trying to do too many things. I've looked again.
As many of you already know, my great programming quest is to reduce the amount of code I must write and maintain. CIS, the system I maintain at work, is a great example of how not to do this. Duplication of labor and code, combined with a few seemingly innocent design decisions, have made the system tightly coupled and difficult to reuse.
I've been maintaining a number of CIS-alikes, testbeds for new technology. This weekend I created a new one, based on Spring, Hibernate, and JSF.
Holy crap. It's like all the things EJB promised to be, but without having to touch EJB.
I've got about a quarter of the domain layer implemented. The largest portion is 386 lines of simple JavaBeans for the business objects. Persistence is handled by Hibernate; wiring, transactions, and session state is handled by Spring.
The equivalent layer in CIS is roughly 30 thousand lines of code. Its classes are tightly coupled and nearly impossible to unit test. Testing the domain objects in the Spring/Hibernate system is trivial using EasyMock; I've already got a full test suite for the completed portion.
The damn thing Just Works for once. (Though I wish I could say the same for JSF, but more on that later.) |
|
|
| Transaction control with annotations |
[jul. 31-a, 2005|09:27 pm] |
It seems like the body of every damn business method written in Java looks something like this:
try {
beginTransaction();
doSomeWork();
commit();
} finally {
endTransaction();
}
In other words, you get a database connection, start a unit of work, do some stuff, and commit if it succeeds. Whether or not it succeeds, you call some end-of-transaction method, that rolls back if there hasn't been a commit() and returns the database resources.
I'm reworking CIS (the system I support), and this idiom was a wart on an otherwise clean redesign. Unacceptable.
I thought, "Hey, this seems like a job for a macro!" But, Java lacks a hygenic macro system, so that was right out. Instead, I've duct-taped together a solution using Java 5 annotations and load-time code generation.
The annotation, @Persistence, has two modes. READ_ONLY indicates that the function needs no separate unit-of-work; the code generator ensures that a database connection is available, and handles the cleanup on exit. READ_WRITE, on the other hand, manages the unit-of-work. The function will commit if it completes successfully (with a 'return') or roll-back automatically if an exception is not caught.
For example:
@Persistence(READ_ONLY)
public List<Customer> findCustomers(String namePattern) {
/° whatever */
}
@Persistence(READ_WRITE)
public void deleteCustomer(CustID id) {
/* make customer go bye-bye */
}
In general, I feel like Java 5 annotations are going to be overused. I don't like mixing configuration information into the source code. However, I find them terribly handy for controlling bytecode instrumentation.
This task was needlessly complex, however. We need an annotation-driven macro system. |
|
|
| Medians! |
[jul. 24-a, 2005|08:17 pm] |
|
After my most recent Google interview (and pursuant to subsequent discussion with mge), I've been looking into the various definitions of the median.
I've encountered three thus far. I've given them names for the purposes of discussion; these are not formal names by any means.
- The statistical median is what I learned in stats, and is what is used by social scientists (like psych folks) for reporting the median. To avoid ambiguity, it is best stated graphically: if all the elements in your data are placed on a histogram, the statistical median is the value that divides the graph area exactly in half, were you to draw a vertical line through it.
- The arithmetic median is apparently taught in some elementary schools. If you have an odd number of data points, sorted in order, the arithmetic median is the middle value. If you have an even number, choose the two middle values and average them.
- The median element is frequently used in discussions of algorithms in computer science. When working with numbers, it is identical to the arithmetic median. When working with non-number items — say, letters — it is more loosely defined as the middle element, which may be either of the two "middles" if there are an even number of data points (depending on the algorithm).
In the order I've presented them, these are basically approximations. The arithmetic median is a decent approximation of the statistical median: if the data set does not have a number of equal values clustered at the median, they produce the same result. Likewise, the median element is a good approximation of the arithmetic median — and if the operation "average the two middle elements" makes sense for the data, they are equivalent.
Why the disconnect? Why have three methods with the same name? Here's how I see it:
- The median element algorithm is popular in computer science because it's a special case of a more general problem: the k-smallest problem. Frequently, one will wish to find the k-smallest element of some data set — the 1st-smallest is the smallest element, and so on. Finding the median element for data of size n is finding the n/2-smallest element; most methods for finding the median element can be generalized to finding the k-smallest for any value of k. (It's also the only version of "median" that makes sense if one is dealing with something other than numbers, which is quite common in computer science.)
- The statistical median exists as an exact measure of central tendency; the two other popular measures are the mean, or average, and the mode. These are frequently used in statistics. It accurately reflects skewing of the data set toward the median, but only makes sense if your data lies along a continuum — it won't work if discrete math is necessary.
I'm not sure what the purpose of the arithmetic median is, except as a simplified case of the statistical median. Its results will hold in some cases, but be off by ±1 element in others. Clearly, users of the arithmetic median want to take skew into account — otherwise, they'd just compute min(data) + (max(data) - min(data))/2, which will equal the median in a normal (unskewed) distribution.
I'm vaguely annoyed at the communities involved for this disconnect, and at Google for asking for a median algorithm rather than a k-smallest algorithm. In hindsight, they were nudging me toward deriving Rivest's algorithm for computing the median element, which unfortunately won't work for the statistical median. (I derived a distributed, low-O way of approximating the median using a few other measures of central tendency and some measures of skew, which is probably more complex than they were expecting.)
If anyone can come up with a reason for the arithmetic median to exist, I'm interested.
|
|
|
| Sonofabitch. |
[jul. 18-a, 2005|11:38 am] |
| [ | Nuna humoro |
| | bored | ] |
| [ | Nuna muziko |
| | Girlfriend for a Day (Radio Free America) | ] | Well, color me embarrassed. I misinterpreted parts of the JVM spec.
I've been working on the code generation for Kona's once-methods. The semantics for these methods are:
- The method body will execute precisely once.
- The return value from that execution will be cached, and returned from all future calls.
Thinking myself clever, I was generating code that followed this rough pattern:
private Foo theFoo;
public once Foo getFoo() {
if(theFoo == null) {
synchronized(this) {
if(theFoo == null) theFoo = new Foo();
}
}
return theFoo;
}
This was an attempt at avoiding the overhead of a synchronized method. However, I got to thinking about it, and it ain't gonna work, for the following reasons:
- At the level of the VM (and the compiled JIT code), object allocation and object initialization (via the constructor) are separate steps. In the face of reordering (both of code and writes), I'm pretty sure the method could return an uninitialized object instance. (I need to do some disassembly to see if the JIT will really do this.)
- Moreover, threads are capable of making their own copy of a variable, including instance fields. It's possible for a thread to see its copy of
theFoo as null even when another thread has already initialized it. volatile may help there, but last time I tried it, it didn't do anything.
- The
monitorexit op, used to end a synchronized block, does not flush state to memory.
- Don't even get me started on issues like cache coherency in multiprocessor systems.
So, I've gotta quit doing that. I'm looking at two alternative approaches for once-methods: synchronizing the whole method, or using a Strategy object to implement the method.
Synchronizing the method may be bad, since the author didn't mark it as synchronized. That's a recipe for surprise deadlocks.
The Strategy object has higher memory overhead, and requires an additional virtual function call.
We'll see which wins. The choice may differ depending on context. |
|
|
| Kona, and significant-bits |
[jul. 16-a, 2005|02:03 pm] |
|
I've put up a preliminary page for my superset of Java, tentatively called Kona. It's looking more and more like I'll need to implement my own javac — which is fine, because javac sucks in the warnings department.
If only it were properly open-source, eh?
Currently, the only area in which Kona is incompatible with Java — that is, where source that compiles as Java will not compile as Kona — has to do with implicit conversion to floating-point. Java will implicitly convert an int (signed, 32-bit integer) to a float (signed, 32-bit IEEE-754 floating-point value). This is bad, because while a float has a greater dynamic range than an int, it is of lower precision — 22 bits worth, to be exact. (The rest of the word is reserved for exponent and sign bits.) Thus, any int with more than 22 significant bits may change in value when converted to a float.
Kona will have a warning for this case, just as javac warns about float⇒int conversions.
However, it got me to thinking more about the notion of 'significant bits,' and how a compiler can track this. I'm sure someone else in CS has already done this, but I haven't read their work, so feel free to correct me where I screw this up:
( Computation and uses of significant bits ) |
|
|
| JVM bytecodes <--> Java mismatches |
[jul. 12-a, 2005|04:28 pm] |
The JVM was designed to run Java — and, most of the time, it shows. As of Java 1.0, the language mapped more-or-less directly onto JVM bytecodes and classfile structures.
Some features have deviated further — inner classes in 1.1; assertions in 1.4; autoboxing and the new for loop in 1.5/5.0. But in all these cases, the Java language became a superset of the JVM, which still mapped to Java's basic concepts.
However, there are a few JVM operations that Java doesn't expose.
Three operations, lcmp, dcmp, and fcmp (each with a g and l variant) work like the "flying-saucer" operator <=> from FORTRAN: a <=> b is -1 if a is less than b, 0 if the two are equal, or 1 if a is greater than b. Java has no such operator; it uses the cmp instructions with its many integer-comparison ops to compare non-integers.
It looks like Java's generic method-invokation operation, invokevirtual, may permit overloading on return type, in addition to argument types.
The JVM has a goto op, for use within methods. This is used to implement a few flow-control constructs.
The op used to implement Object casts, checkcast, does what its name implies: it checks that a value is assignment-compatible with another type. If execution passes the op, the 'cast' succeeded. This means the pattern of "Use instanceof to check class, and if compatible, create a new variable to hold the casted value" is completely unnecessary. In fact, conditional assignment (from Eiffel or Mongoose) is closer to the actual operation performed.
The invokestatic op is simply a non-OO function call.
jsr and ret ops are available which allow subroutine branching within methods. The linkage (return) address can be stashed into local variables, and nested jsrs can be returned from out-of-order. This could be used for some complex flow control.
monitorenter and monitorexit, the ops used for either end of a synchronized block, appear to be able to overlap. That is, one could run "lock a; lock b; unlock a; unlock b;". This may be nixed by the verifier, and needs testing.
In addition to lookupswitch — the op used to implement switch — the JVM has a tableswitch op which is, in essence, a computed goto. This seems to be used to optimize switch blocks with a small, consecutive range of case values, but could create more interesting flow control. |
|
|
| navigation |
| [ |
viewing |
| |
most recent entries |
] |
| [ |
go |
| |
earlier |
] |
| |
|
|