Data mapping: Difference between revisions

Content deleted Content added

Inline

Latest revision as of 17:46, 2 September 2023

In computing and data management, data mapping is the process of creating data element mappings between two distinct data models. Data mapping is used as a first step for a wide variety of data integration tasks, including:^[1]

Data transformation or data mediation between a data source and a destination
Identification of data relationships as part of data lineage analysis
Discovery of hidden sensitive data such as the last four digits of a social security number hidden in another user id as part of a data masking or de-identification project
Consolidation of multiple databases into a single database and identifying redundant columns of data for consolidation or elimination

For example, a company that would like to transmit and receive purchases and invoices with other companies might use data mapping to create data maps from a company's data to standardized ANSI ASC X12 messages for items such as purchase orders and invoices.

Standards

X12 standards are generic Electronic Data Interchange (EDI) standards designed to allow a company to exchange data with any other company, regardless of industry. The standards are maintained by the Accredited Standards Committee X12 (ASC X12), with the American National Standards Institute (ANSI) accredited to set standards for EDI. The X12 standards are often called ANSI ASC X12 standards.

The W3C introduced R2RML as a standard for mapping data in a relational database to data expressed in terms of the Resource Description Framework (RDF).

In the future, tools based on semantic web languages such as RDF, the Web Ontology Language (OWL) and standardized metadata registry will make data mapping a more automatic process. This process will be accelerated if each application performed metadata publishing. Full automated data mapping is a very difficult problem (see semantic translation).

Hand-coded, graphical manual

Data mappings can be done in a variety of ways using procedural code, creating XSLT transforms or by using graphical mapping tools that automatically generate executable transformation programs. These are graphical tools that allow a user to "draw" lines from fields in one set of data to fields in another. Some graphical data mapping tools allow users to "auto-connect" a source and a destination. This feature is dependent on the source and destination data element name being the same. Transformation programs are automatically created in SQL, XSLT, Java, or C++. These kinds of graphical tools are found in most ETL (extract, transform, and load) tools as the primary means of entering data maps to support data movement. Examples include SAP BODS and Informatica PowerCenter.

Data-driven mapping

This is the newest approach in data mapping and involves simultaneously evaluating actual data values in two data sources using heuristics and statistics to automatically discover complex mappings between two data sets. This approach is used to find transformations between two data sets, discovering substrings, concatenations, arithmetic, case statements as well as other kinds of transformation logic. This approach also discovers data exceptions that do not follow the discovered transformation logic.

Semantic mapping

Semantic mapping is similar to the auto-connect feature of data mappers with the exception that a metadata registry can be consulted to look up data element synonyms. For example, if the source system lists FirstName but the destination lists PersonGivenName, the mappings will still be made if these data elements are listed as synonyms in the metadata registry. Semantic mapping is only able to discover exact matches between columns of data and will not discover any transformation logic or exceptions between columns.

Data lineage is a track of the life cycle of each piece of data as it is ingested, processed, and output by the analytics system. This provides visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already.^[2]

References

^ Shahbaz, Q. (2015). Data Mapping for Data Warehouse Design. Elsevier. p. 180. ISBN 9780128053355. Retrieved 29 May 2018.
^ De, Soumyarupa. (2012). Newt : an architecture for lineage based replay and debugging in DISC systems. UC San Diego: b7355202. Retrieved from: https://escholarship.org/uc/item/3170p7zn

[ShahbazData15-1] Shahbaz, Q. (2015). Data Mapping for Data Warehouse Design. Elsevier. p. 180. ISBN 9780128053355. Retrieved 29 May 2018.

[2] De, Soumyarupa. (2012). Newt : an architecture for lineage based replay and debugging in DISC systems. UC San Diego: b7355202. Retrieved from: https://escholarship.org/uc/item/3170p7zn

[1]

[2]

@@ Line 1: / Line 1: @@
 {{Data transformation}}
+{{Expert needed|Computing|reason=The information appears outdated and requires sources both historical (history of data mapping) and current (how is data mapping performed today).|date=May 2018}}
-{{inline|date=June 2010}}
-In [[computing]] and [[data management]], '''data mapping''' is the process of creating [[data element]] [[Map (mathematics)|mapping]]s between two distinct [[data model]]s.  Data mapping is used as a first step for a wide variety of [[data integration]] tasks including:
+In [[computing]] and [[data management]], '''data mapping''' is the process of creating [[data element]] [[Map (mathematics)|mapping]]s between two distinct [[data model]]s. Data mapping is used as a first step for a wide variety of [[data integration]] tasks, including:<ref name="ShahbazData15">{{cite book |url=https://books.google.com/books?id=pRChCgAAQBAJ |title=Data Mapping for Data Warehouse Design |author=Shahbaz, Q. |publisher=Elsevier |pages=180 |year=2015 |isbn=9780128053355 |access-date=29 May 2018}}</ref>
 * [[Data transformation]] or [[data mediation]] between a data source and a destination
-* Identification of data relationships as part of data lineage analysis
+* Identification of data relationships as part of [[data lineage]] analysis
-* Discovery of hidden sensitive data such as the last four digits social security number hidden in another user id as part of a data masking or [[de-identification]] project
+* Discovery of hidden sensitive data such as the last four digits of a social security number hidden in another user id as part of a data masking or [[de-identification]] project
-* [[Data consolidation|Consolidation]] of multiple databases into a single data base and identifying redundant columns of data for consolidation or elimination
+* [[Data consolidation|Consolidation]] of multiple databases into a single database and identifying redundant columns of data for consolidation or elimination
 For example, a company that would like to transmit and receive purchases and invoices with other companies might use data mapping to create data maps from a company's data to standardized [[ANSI ASC X12]] messages for items such as purchase orders and invoices.
 ==Standards==
-'''X12 standards''' are generic [[Electronic Data Interchange]] (EDI) standards designed to allow a [[company (law)|company]] to exchange [[data]] with any other company, regardless of industry. The standards are maintained by the Accredited Standards Committee X12 (ASC X12), with the [[American National Standards Institute]] (ANSI) accredited to set standards for EDI. The X12 standards are often called '''[[ANSI ASC X12]] standards'''.
+X12 standards are generic [[Electronic Data Interchange]] (EDI) standards designed to allow a [[company (law)|company]] to exchange [[data]] with any other company, regardless of industry. The standards are maintained by the Accredited Standards Committee X12 (ASC X12), with the [[American National Standards Institute]] (ANSI) accredited to set standards for EDI. The X12 standards are often called [[ANSI ASC X12]] standards.
-In the future, tools based on [[semantic web]] languages such as [[Resource Description Framework]] (RDF), the [[Web Ontology Language]] (OWL) and standardized [[metadata registry]] will make data mapping a more automatic process.  This process will be accelerated if each application performed [[metadata publishing]].  Full automated data mapping is a very difficult problem (see [[Semantic translation]]).
+The [[W3C]] introduced [https://www.w3.org/TR/r2rml/ R2RML] as a standard for mapping data in a [[relational database]] to data expressed in terms of the [[Resource Description Framework]] (RDF).
+In the future, tools based on [[semantic web]] languages such as RDF, the [[Web Ontology Language]] (OWL) and standardized [[metadata registry]] will make data mapping a more automatic process. This process will be accelerated if each application performed [[metadata publishing]]. Full automated data mapping is a very difficult problem (see [[semantic translation]]).
 ==Hand-coded, graphical manual ==
-Data mappings can be done in a variety of ways using procedural code, creating [[XSLT]] transforms or by using graphical mapping tools that automatically generate executable transformation programs.  These are graphical tools that allow a user to "draw" lines from fields in one set of data to fields in another. Some graphical data mapping tools allow users to "Auto-connect" a source and a destination.  This feature is dependent on the source and destination [[data element name]] being the same.  Transformation programs are automatically created in SQL, XSLT, [[Java (programming language)|Java programming language]] or [[C++]].  These kinds of graphical tools are found in most [[Extract, transform, load|ETL]] Tools (Extract, Transform, Load Tools) as the primary means of entering data maps to support data movement.
+Data mappings can be done in a variety of ways using procedural code, creating [[XSLT]] transforms or by using graphical mapping tools that automatically generate executable transformation programs. These are graphical tools that allow a user to "draw" lines from fields in one set of data to fields in another. Some graphical data mapping tools allow users to "auto-connect" a source and a destination. This feature is dependent on the source and destination [[data element name]] being the same. Transformation programs are automatically created in SQL, XSLT, [[Java (programming language)|Java]], or [[C++]]. These kinds of graphical tools are found in most [[Extract, transform, load|ETL]] (extract, transform, and load) tools as the primary means of entering data maps to support data movement. Examples include SAP BODS and Informatica PowerCenter.
 ==Data-driven mapping==
-This is the newest approach in data mapping and involves simultaneously evaluating actual data values in two data sources using heuristics and statistics to automatically discover complex mappings between two data sets.  This approach is used to find transformations between two data sets and will discover substrings, concatenations, arithmetic, case statements as well as other kinds of transformation logic.  This approach also discovers data exceptions that do not follow the discover....
+This is the newest approach in data mapping and involves simultaneously evaluating actual data values in two data sources using heuristics and statistics to automatically discover complex mappings between two data sets. This approach is used to find transformations between two data sets, discovering substrings, concatenations, [[arithmetic]], case statements as well as other kinds of transformation logic. This approach also discovers data exceptions that do not follow the discovered transformation logic.
 ==Semantic mapping==
-[[Semantic mapping]] is similar to the auto-connect feature of data mappers with the exception that a [[metadata registry]] can be consulted to look up data element synonyms.  For example, if the source system lists FirstName but the destination lists PersonGivenName, the mappings will still be made if these data elements are listed as [[synonyms]] in the metadata registry.  Semantic mapping is only able to discover exact matches between columns of data and will not discover any transformation logic or exceptions between columns.
+[[Semantic mapper|Semantic mapping]] is similar to the auto-connect feature of data mappers with the exception that a [[metadata registry]] can be consulted to look up data element synonyms. For example, if the source system lists ''FirstName'' but the destination lists ''PersonGivenName'', the mappings will still be made if these data elements are listed as [[synonyms]] in the metadata registry. Semantic mapping is only able to discover exact matches between columns of data and will not discover any transformation logic or exceptions between columns.
+Data lineage is a track of the life cycle of each piece of data as it is ingested, processed, and output by the analytics system. This provides visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the data flow for step-wise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already.<ref>De, Soumyarupa. (2012). Newt : an architecture for lineage based replay and debugging in DISC systems. UC San Diego: b7355202. Retrieved from: https://escholarship.org/uc/item/3170p7zn</ref>
 ==See also==
+* [[Data integration]]
+* [[Data wrangling]]
+*[[Identity transform]]
 *[[ISO/IEC 11179]] - The ISO/IEC Metadata registry standard
 *[[Metadata]]
 *[[Metadata publishing]]
 *[[Schema matching]]
+*[[Semantic heterogeneity]]
 *[[Semantic mapper]]
 *[[Semantic translation]]
@@ Line 35: / Line 43: @@
 *[[Semantics]]
 *[[XSLT]] - XML Transformation Language
-*[[data integration]]
-*[[Identity transform]]
-*[[bots (edi)|Bots]] open source software for data mapping
-* [[Data wrangling]]
 ==References==
 {{reflist}}
-==Bibliography==
-* Bogdan Alexe, Laura Chiticariu, [[Renée J. Miller]], Wang Chiew Tan: [http://dx.doi.org/10.1109/ICDE.2008.4497409 Muse: Mapping Understanding and deSign by Example]. ICDE 2008: 10-19
-* Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, Cornelia Hedeler: [http://www.edbt.org/Proceedings/2010-Lausanne/edbt/papers/p0573-Belhajjame.pdf Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces]. EDBT 2010: 573-584
-* Laura Chiticariu, Wang Chiew Tan: [http://www.vldb.org/conf/2006/p79-chiticariu.pdf Debugging Schema Mappings with Routes]. VLDB 2006: 79-90
-* Ronald Fagin, Laura M. Haas, Mauricio A. Hernández, Renée J. Miller, Lucian Popa, Yannis Velegrakis: [http://dx.doi.org/10.1007/978-3-642-02463-4_12 Clio: Schema Mapping Creation and Data Exchange. Conceptual Modeling: Foundations and Applications 2009: 198-236]
-* Ronald Fagin, Phokion G. Kolaitis, [[Renée J. Miller]], Lucian Popa: [http://dx.doi.org/10.1016/j.tcs.2004.10.033 Data exchange: semantics and query answering]. Theor. Comput. Sci. 336(1): 89-124 (2005)
-* Maurizio Lenzerini: [http://www.acm.org/sigs/sigmod/pods/proc02/papers/233-Lenzerini.pdf Data Integration: A Theoretical Perspective]. PODS 2002: 233-246
-* Renée J. Miller, Laura M. Haas, Mauricio A. Hernández: [http://www.informatik.uni-trier.de/~ley/db/conf/vldb/MillerHH00.html Schema Mapping as Query Discovery]. VLDB 2000: 77-88
-==External links==
 {{DEFAULTSORT:Data Mapping}}