Thursday, May 5, 2011

Restrict word list in XML schema

I'm writing an XML schema and need to prevent the text of an element from matching certain values. (Eg. the variableName element cannot match 'int', 'byte', 'string' etc.)

I have tried using a restriction with a pattern element similar to "^(int|byte|string)", but without success.

Do you know the way to format the regular expression, or any other way to make this work?

From stackoverflow
  • Without negative lookahead, this is pretty tedious. Attached is a regex that works with some unit tests. This is written in Perl, not XSD, but it's pretty basic regex so it should work... You should remove the whitespace from the regex before using it. I added the whitespace just to make it a little easier to read.

    Note: I don't know if "\A" and "\z" are allowed in XSD. If not, replace with "^" and "$" respectively.

    use Test::More 'no_plan';
    
    my $re = qr/\A(\z|[^ibs]                                                        
       |i(\z|[^n]|n(\z|[^t]|t.))                                                    
       |b(\z|[^y]|y(\z|[^t]|t(\z|[^e]|e.)))                                         
       |s(\z|[^t]|t(\z|[^r]|r(\z|[^i]|i(\z|[^n]|n(\z|[^g]|g.))))))/x;
    
    for my $str ( qw(inter bytes ins str strings in sdgsdfger i b s by byt bite st \
    str stri strin strink) ) {
       like($str, $re, $str);
    }
    
    for my $str ( qw(int byte string) ) {
       unlike($str, $re, $str);
    }
    
  • Does it have to be W3C Schema (aka "xml schema")? Or would a standard alternative like RelaxNG work? I may be wrong but I thought it had a few impovements on combining restrictions, including ability to do intersections.

    andypaxo : This schema will be part of an system which already uses W3C schema and includes support for this standard. So yes, I'm afraid it does have to be W3C.
  • After triple-checking that XML Schema (XSD) regexes really don't support any of the features that would make this task reasonably easy (particularly lookaheads and anchors), I've come up with an approach that seems to work. This is written in free-spacing mode to make it easier to read, but (of course) XSD regexes don't support that either. :-/

    /
      [^ibs].* |
      i(.{0,1} | [^n].* | n[^t].* | nt.+) |
      b(.{0,2} | [^y].* | y[^t].* | yt[^e].* | yte.+) |
      s(.{0,4} | [^t].* | t[^r].* | tr[^i].* | tri[^n].* | trin[^g].* | tring.+)
    /
    

    The first alternative, obviously, matches anything that doesn't start with the initial letter of any of the keywords. Each top-level alternative after that matches all strings that start with the same letter as a keyword but:

    • are shorter than the keyword,
    • have a different second letter, different third letter, etc., or
    • are longer than the keyword.

    Although XSD regexes don't support explicit anchors (i.e., ^, $, \A, \z), all matches are implicitly anchored at both ends. If the list of keywords is long, you might run up against a limit on the sheer length of the regex. Barring that (and much to my surprise), it looks like this job may actually be doable. :-)

    andypaxo : Wow, thanks for putting the effort into coming up with this. It's long, but it does work brilliantly.

0 comments:

Post a Comment