Re: GEOM architecture and the (lack of) need for foot-shooting

From: Poul-Henning Kamp <phk_at_phk.freebsd.dk>
Date: Fri, 08 Apr 2005 00:57:55 +0200
In message <440f480855b36bcc43281835e1e3781d_at_xcllnt.net>, Marcel Moolenaar writes:

>One can argue that these tools weren't broken before GEOM came along
>and that the implementation of GEOM in FreeBSD could have been done
>slightly better? (see also below)

[Don't shoot the piano player, he's doing the best he can]

>Questionable. What about the following reasoning:
>
>The partition table on a disk is there to help the firmware and OS
>to identify the kinds of file systems on that disk and their bounds.
>Once the OS has been loaded and has obtained all the information it
>cares about, the partition table is not needed anymore. Its existence
>has become irrelevant. Removal of the partitition table does not in
>any way invalidate the file systems that are on that disk, nor make
>them inaccessible to the CURRENTLY RUNNING OS. It is only when the
>partitions are to be found again across a reboot that the partition
>table needs to be there and needs to be valid.

I think that is a recipe for disaster and the fact that all operating
systems which implement it have resorted to all but forcing a reboot
right after any change seems to validate my point.

Which view do you offer the user if he enters the partitioning tool
a second time before he reboots ? 

The in-memory or the on-disk ?

What about crash-safety ?

How do you deliver a credibly convincing argument that the users
system is going to boot again ?  At the very least, the diskpartitioning
tool needs to say "This will hose your system  Abort/Retry/Ignore"

You basically need to implement a high level of overview and inference
in the tools to be able to protect the novice users.

That is why I didn't go that way.  Instead I opted for a system where
we are at all times in a single consistent state, and where we do
not allow operations which takes us into an inconsistent state.

>Even if a replacement partition table encodes a completely different
>layout, does it not have to be a problem. The OS just needs to ignore
>the partition table.

It does not have to be a problem, but how do you implement code to
find out _if_ it is a problem so you can warn the user ?

In one of my GEOM prototypes I had a protocol where a provider could
ask the consumer which bits it really cared about.  That way you
could tell a mountpoint to shrink what bit of the disk it used
and afterwards reclaim that bit of the partition.

In the end I decided that this was waaaay too much code to
justify the functionality.

If you want to go the way you describe, you will need to do something
like that because otherwise you have no way of making sure that
you are really having two *functionally* identical views.

The reason I tried to go that way was the overcommit of the first
8k of a BSD partition for both filesystem and bsd-label.  I wanted
the filesystem to say "I'm really only using [8192....N] of this disk"
so that the BSD-label would be under a different protection domain.

In the end I realized that I was trying to accomodate a bug which
only one filesystem in the world ever made, and dropped it.

>Thus:
>Is it actually the right behaviour to invalidate the OS's notion of
>disk partitions whenever the on-disk tables are changed or removed
>and if so does that hold in all cases?

We don't invalidate "whenever the on-disk tables are changed", we
veto the change if it would jeopardize the currently opened providers
under the presumption that all our current users (filesystems)
explode if you look at them wrong from that angle.

You can safely extend a partition, because the filesystems will not
really notice, but you cannot shrink it (for the lack of the above
mentioned protocol).

You can remove any non-open parition/slice/whatever.  (The current
trouble is that you can't get to communicate this intent, if the
communcation succeeded it would happen).

>> The correct way to do that is to use the g_ctl() api because what
>> is needed is an out-of-band mechanism to tell that we want to loose
>> one of the partitions.
>
>Such mechanism would be needed only to inform the OS that it should
>forget about partitions it currently knows about (whether mounted or
>not).

I think you presuppose a much higher level of ability than the majority
of our superusers lay claim to.  You scheme would require a much
stronger set of userland utilities to avoid unintentional footshooting.

Considering the sorry state of our current tools, not to mention
libdisk/sysinstall, I think it switching to such a scheme is a non-starter,
the amount of code to write is simply prohibitive.

>> 1. Find out which partition format we migrate to instead of BSDlabel
>>    which runs out of steam around 2TB.  GPT has been proposed but
>>    seems to be a rather dead end with Itanic sinking fast.
>
>Itanium is not sinking fast. It's submerged, but holding for now. The
>Open Source community simply isn't the audience for it yet and may very
>well never be. This doesn't mean that there isn't an audience at all.

Well, it is out of the commercial market more or less completely
and therefore the earlier presumption that it would rule and force
GPT on everybody is no longer by definition true.

I don't particularly care which format we select, it's a political/
marketing decision which should be made so that 7.0 can use the new
format so we don't have to retrofit it in that release.  7.0 is
probably too late, but 6.0 is not feasible.

>> Anybody who expect me to do all of this singlehandedly can take a peek
>> here http://people.freebsd.org/%7Epeter/srcsys.window.txt and go stick
>> their head in a bucket of cold water before telling me I have to work
>> harder.
>
>A yes, the classic trick of showing quantity when quality is questioned.

No, the classical trick of telling people to pitch in instead of whining.

And by pitch in I don't mean "try to take ever shortcut imaginable rather
than offer a bit of help getting things in the right direction" like
some writers of GEOM classes I could mention!

>unanswered and so far the problems have remained unsolved. What I fail to
>see is the proverbial "let's take a step back and look at it again from a
>distance" attitude from you. Instead everybody else's got it wrong or is
>missing bits and pieces from the puzzle. Fine, that's certainly possible,
>but you're not making a good case for it and I remain unconvinced 
>(FWIW).

I don't think anybody else have spent as long time as I have on
this subject and if you want to spend the next ten years of your
life studying it and at some point come with an implementation which
works better you are more than welcome, heck, I wished you had done
that 10 years ago!

>So, maybe it's time to step back and take a look at it again. Define the
>problems that have been raised, describe the cause (real or artificial) 
>and identify possible solutions, not just yours, and build consensus for the
>best solution. Chances are that you actually get other people to help 
>out implementing the solution.

That would be great.

The first requirement for that to be a success is that people stop
trying to find a quick fix so they don't have to think about the
problems.

The problem is that that's all what people care about: Make it work
for me right now.  Nobody seems to pay much attention to the long
view or the architectural view.  And if they take the long or architectural
view, the outcome is usually "somebody should", rather than "I'll do".

Yes, I am frustrated, and this pisses me off far more than it should,
and it pisses me off in particular when it comes from somebody who has
provided such enthusiatic resistance all the way without ever really
taking the time to think things through before hitting me on the head
with halfbaked proposals.

I would suggest that you go of and do a prototype of your scheme,
all the GEOM work has given you some very nice interfaces to the
relevant pieces of the system, all you have to do is offer geom_dev,
geom_disk and geom_vfs outwards APIs and you're sailing.

Let us know when you have a prototype we can try out.

In the meantime please shut up and get out of the way.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk_at_FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.
Received on Thu Apr 07 2005 - 20:57:59 UTC

This archive was generated by hypermail 2.4.0 : Wed May 19 2021 - 11:38:31 UTC