An XML-Based Framework for Language Neutral Program Representation and Generic Analysis.By Raihan Al-Ekram and Kostas Kontogiannis
XML based tools are very useful in re- engineering era, because XML representation of source code gives the flexible way to analysis the high level information about the system using text based source code. XML is easy to understand, portable, open standard, extensible and interoperable. AST which is hierarchical representation of source code and tightly coupled with programming language's grammar. There are various XML based applications are available to represent the source code written in different languages. JavaML, CppML, srcML, PascalML XMLizer, Agile Parsing and GXL are some of them which produce complete or partial ASTs for specific programming languages. Moreover AST abstracts the program at a very fine level of granularity and hence not suitable to be used directly for higher level sophisticated program analysis. For a complete analysis of a source code, we need generic program analysing tool which support various languages and should give ability to analysis high level abstraction such as data flow and control flow among these blocks/components. Dataflow and control flows of a program can be  represented using graphs in programming language independent way. Following graphs are some form of representations, used to represent high level abstractions.
Control Flow Graph (CFG) is used to represent the execution paths of program, widely used for code optimization, data flow analysis and testing. Program Dependence Graph (PDG) is used to represent both data and control dependencies, used for code optimization, parallelism and loop fusion. System Dependency Graph (SDG) is an extension of PDG, constructed by connecting individual PDGs using edges. Call Graphs are used to represent relationship between caller and callee in the program procedures for traditional inter-procedural analysis. Program Summary Graph (PSG) is an extension to Call Graph, considering global variables and reference parameters at individual call points.
The following figure shows architectures of “Generic Program Representation Framework” which is based on XML application developed for specific programming languages and concept such as object oriented programming, data flow diagram and control flow etc.  It is consist three major abstracted layers. 
1. Source Code:[Layer 0]: Original  source text of the program to be analysed.
2. AST Level Representation [Layer1]:  First level of abstraction of program in terms of  AST of program. [JavaML, CppML, CML, PascalML]. This layer also contains another sub layer which contains Object Oriented Mark-up Languages[OOML] and Procedural Mark-up Language[ProML] which are generic models for represent  oriented languages and procedural languages respectively.  These models are derived from generalising mark up language model for OO and Procedural languages respectively.
3. Higher Level Representation  [Layer 2]: next level of abstractions in terms of intra- procedural and inter-procedural graphs. This layers has two sub layers Layer2.1 represent the basic fact of the program in FactML format, and Layer 2.2 is representation of intra-procedural and inter-procedural  dependence and flow graphs of the program expressed as  CFGML, PDGML,SDGML and CGML. Where;
FactML is used to represent the building blocks of programs such as classes, association among them using XML and corresponding DTD. These building blocks include Types, Variables, Statements, Functions and association classes. CFGML - CFG is a directed graph indicating basic blocks in program and possible flows of control from one to another. This is also represented using XML and corresponding DTDs. PDGML and CGML are XML representation of PDG and Call Graph.  These XML based representations are derived from UML Class diagrams for each level.
Transformation tools are used to convert representation from one level to another level. Transformers may be source code transformer or XML to XML transformers. 
Thanks
TS

