Sunday, March 6, 2011

Java change and move non-standard XML file

I am using a third party application and would like to change one of its files. The file is stored in XML but with an invalid doctype.

When I try to read use a it errors out becuase the doctype contains "file:///ReportWiz.dtd" (as shown, with quotes) and I get an exception for cannot find file. Is there a way to tell the docbuilder to ignore this? I have tried setValidate to false and setNamespaceAware to false for the DocumentBuilderFactory.

The only solutions I can think of are

  • copy file line by line into a new file, omitting the offending line, doing what i need to do, then copying into another new file and inserting the offending line back in, or
  • doing mostly the same above but working with a FileStream of some sort (though I am not clear on how I could do this..help?)
DocumentBuilderFactory docFactory = DocumentBuilderFactory
        .newInstance();
docFactory.setValidating(false);
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(file);
From stackoverflow
  • My first thought was dealing with it as a stream. You could make a new adapter at some level and just copy input to output except for the offending text.

    If the file is shortish (under half a gig or so) you could also read the entire thing into a byte array and make your modifications there, then create a new stream from the byte array into your builder.

    That's the advantage of the amazingly bulky way Java handles streams, you actually have a lot of flexibility.

    Adam Lerman : could you maybe help me with some example code(or a link), this sounds a lot like what I want to do.
    Bill K : Looks like what you want to do is subclass FilterInputStream and overwrite read(). When your read is called, call super.read() to get the data, scan & modify the data, and return it. I'll fool around with it if I get some time, but it shouldn't be too hard.
    Bill K : Here is an example that has very simple filtering (it excludes unprintable characters from the stream I believe). http://www.cafeaulait.org/slides/sd2000west/javaio/44.html Your case is harder because you need to recognize a multi-character pattern.
  • Another thing I was debating was storing all of the file in a string, then doing my manipulations and wiring the String out to a file.None of these seem clean or easy, but what is the best way to do this?

  • Handle resolution of the DTD manually, either by returning a copy of the DTD file (loaded from the classpath) or by returning an empty one. You can do this by setting an entity resolver on your document builder:

     EntityResolver er = new EntityResolver() {
      @Override
      public InputSource resolveEntity(String publicId, String systemId)
        throws SAXException, IOException {
       if ("file:///ReportWiz.dtd".equals(systemId)) {
        System.out.println(systemId);
        InputStream zeroData = new ByteArrayInputStream(new byte[0]);
        return new InputSource(zeroData);
       }
       return null;
      }
     };
    
    Adam Lerman : More complex then I needed. I didnt try this but I was really only looking for a way to ignore it completely.
  • Tell your DocumentBuilderFactory to ignore the DTD declaration like this:

    docFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
    

    See here for a list of available features.

    You also might find JDOM a lot easier to work with than org.w3c.dom:

    org.jdom.input.SAXBuilder builder = new SAXBuilder();
    builder.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
    org.jdom.Document doc = builder.build(file);
    
    Adam Lerman : EXACTLY what I needed. THANKS!! Welcom to SO.

0 comments:

Post a Comment